Open
Description
I'm starting a simple spider with about 800 megabytes of heap space, but, after running for a day or so, it throws a series of OutOfMemoryError. Example:
Exception in thread "pool-1-thread-97" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-1-thread-92" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-1-thread-99" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-1-thread-88" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-1-thread-101" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-1-thread-100" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-1-thread-103" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-1-thread-94" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-1-thread-105" java.lang.OutOfMemoryError: Java heap space
Here's a version of the code I'm using. The else branch is a little more complicated but it only involves System.out.println, it doesn't write to a database or similar.
import us.codecraft.webmagic.*;
import us.codecraft.webmagic.processor.*;
public class App implements PageProcessor {
private Site site = Site.me().setRetryTimes( 3 ).setSleepTime( 1000 );
@Override
public void process( Page page ) {
List<String> links = page.getHtml().links().all();
page.addTargetRequests( links );
page.putField( "generator", page.getHtml().xpath( "/html/head/meta[@name=\"generator\"]/@content" ).toString() );
String generator = page.getResultItems().get( "generator" );
if ( generator == null ) {
page.setSkip( true );
}
else {
System.out.println( generator );
}
}
@Override
public Site getSite() {
return site;
}
public static void main( String[] args ) {
System.setProperty( "slf4j.internal.verbosity", "WARN" );
Spider.create( new App() ).addUrl( "https://github.com/code4craft/webmagic/" ).thread( 5 ).run();
}
}
Metadata
Metadata
Assignees
Labels
No labels