Optimizing Crawling

Nanook · 23 May 2022 02:15

I decided to go back and give Yacy another try. The first time I had it, it ran for several months and then the SOLR database wedged and I was never able to recover it. Now I see there is a dump the SOLR database to a file and restore SOLR database from a file, I’m VERY happy that option has been added.

Now I am trying to optimize indexing, and part of the issue is not related to Yacy, my router is under powered (an edgerouter lite) and between 6-9pm the CPU sometimes saturates, but I’ve got a Dreammachine-SE on order so that will be resolved shortly. In the meantime, I’m trying to optimize settings the best I can. My web server is six core 12 thread, cores running at 4.2 Ghz, and I have 96gb of memory for the web server so I can afford to jack up settings a bit but I’m trying to follow the Performance section of the FAQ and I simply am not finding settings it refers to on my Yacy installation.

In particular this item:

Increase Number of Crawl Threads

If your web-crawl is well-balanced (many domains) and crawling is still too slow (indexing queue is empty and cannot be filled fast enough by the crawler), then it is recommended to increase the maximum number of active crawl threads.

How-To

Open the Performance Page. At the ‘Thread Pool Settings’ table you see input fields for maximum active crawl threads. Increase this number, but limit it to a number that is not too big for your (cheap) router.

Effect

The number of concurrent http-fetch requests to target web servers increase. This can speed up crawling.

Side-Effect

Your router may not be able to handle so many concurrent requests.

Why is this not done by default?

To be compliant with minimum requirements of cheap network equipment, and to protect target servers from beeing accessed with too many requests at the same time.

 On the Performance page I do not see any "Thread Pool Settings Table".

roamn · 24 September 2022 01:34

I have found when starting a crawl there is a lot of DNS requests in the order of 80 queries a second I only have old computers.

I tested my home router and it can only handle 65 queries a second the rest go unanswered.
see https://github.com/jedisct1/dnsblast

I run 3 piholes that work together to handle the DNS load via Dnsmasq.
https://pi-hole.net/

This is one of the piholes when doing a crawl with yacy the load is spread over 3.

The forum for PiHole is at https://discourse.pi-hole.net/

Encassion · 6 November 2022 04:04

For those whom need help finding Thread Pool Settings- Administration–> System Administration tab–> Advanced Properties tab

Scroll down until you see crawler.MaxActiveThreads

roamn · 13 November 2022 01:33

Try 8090/PerformanceQueues_p.html
Its at the bottom of the page.

You can also get to it on the crawler monitor page by clicking the (100) next to Loader.
You must restart yacy for this to take effect I have found.
Pause your crawler and restart yacy.