Optimizing Crawling

I decided to go back and give Yacy another try. The first time I had it, it ran for several months and then the SOLR database wedged and I was never able to recover it. Now I see there is a dump the SOLR database to a file and restore SOLR database from a file, I’m VERY happy that option has been added.

Now I am trying to optimize indexing, and part of the issue is not related to Yacy, my router is under powered (an edgerouter lite) and between 6-9pm the CPU sometimes saturates, but I’ve got a Dreammachine-SE on order so that will be resolved shortly. In the meantime, I’m trying to optimize settings the best I can. My web server is six core 12 thread, cores running at 4.2 Ghz, and I have 96gb of memory for the web server so I can afford to jack up settings a bit but I’m trying to follow the Performance section of the FAQ and I simply am not finding settings it refers to on my Yacy installation.

In particular this item:

Increase Number of Crawl Threads

If your web-crawl is well-balanced (many domains) and crawling is still too slow (indexing queue is empty and cannot be filled fast enough by the crawler), then it is recommended to increase the maximum number of active crawl threads.

How-To

Open the Performance Page. At the ‘Thread Pool Settings’ table you see input fields for maximum active crawl threads. Increase this number, but limit it to a number that is not too big for your (cheap) router.

Effect

The number of concurrent http-fetch requests to target web servers increase. This can speed up crawling.

Side-Effect

Your router may not be able to handle so many concurrent requests.

Why is this not done by default?

To be compliant with minimum requirements of cheap network equipment, and to protect target servers from beeing accessed with too many requests at the same time.

 On the Performance page I do not see any "Thread Pool Settings Table".