How to speed up YaCy crawls?

zooom · 26 December 2020 16:45

Is this hardcoded or is there a parameter to tweak?
Where can I watch these numbers?
How can I double the number of crawling slots and the timeout?

I can see tons of timeout errors in the recjected URLS list, although they can be reached and I do not think that all of these have “I do not want to be crawled automatically” restrictions implemented.

(although this cloudflare garbage is getting more and more popular)

If this is the reason for the poor performance, shouldn’t show the “Network Access” a full workload at the right side (which it only does during GET robots.txt)?

Time	URL	Fail-Reason
2020/12/26 17:43:48	http://www.www.gasthaus-rosengarten.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.www.gasthaus-rosengarten.ch duration=144 for url http://www.www.gasthaus-rosengarten.ch/
2020/12/26 17:43:48	https://www.chin-min.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.chin-min.ch duration=155 for url https://www.chin-min.ch/
2020/12/26 17:43:48	https://www.kulturgut.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.kulturgut.ch/ to https://www.denkmal.ch/ placed on crawler queue for double-check
2020/12/26 17:43:47	https://www.schoenzeit.webstores.ch/robots.txt	TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:47	http://cajon.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://cajon.ch/ to https://cajon.ch/ placed on crawler queue for double-check
2020/12/26 17:43:47	https://www.dalucia.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.dalucia.ch duration=356 for url https://www.dalucia.ch/
2020/12/26 17:43:47	https://www.sineq.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.sineq.ch duration=468 for url https://www.sineq.ch/
2020/12/26 17:43:47	https://www.schoenshop.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.schoenshop.ch/ to https://www.schoenzeit.webstores.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46	https://www.heime-consulting.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.heime-consulting.ch/ to https://www.heime-consulting.ch/home.html placed on crawler queue for double-check
2020/12/26 17:43:46	http://www.www.telefonmonteur.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.www.telefonmonteur.ch duration=135 for url http://www.www.telefonmonteur.ch/
2020/12/26 17:43:46	https://suli.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://suli.ch/ to https://www.suli.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46	https://www.alpsu.ch/robots.txt	TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:46	https://citywettingen.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://citywettingen.ch/ to https://www.citywettingen.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46	http://www.alpsu.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://www.alpsu.ch/ to https://www.alpsu.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46	https://www.tdcag.ch/robots.txt	TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:46	https://stattboden-riet.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: stattboden-riet.ch duration=73 for url https://stattboden-riet.ch/
2020/12/26 17:43:46	https://www.lenzerheide2020.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.lenzerheide2020.ch/ to https://www.biathlon-lenzerheide.swiss/ placed on crawler queue for double-check
2020/12/26 17:43:45	https://www.reklamegrafik.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.reklamegrafik.ch/ to http://www.artatelier.ch/ placed on crawler queue for double-check
2020/12/26 17:43:45	https://www.etude-avocat-belhocine.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.etude-avocat-belhocine.ch duration=464 for url https://www.etude-avocat-belhocine.ch/
2020/12/26 17:43:45	http://tdcag.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://tdcag.ch/ to https://www.tdcag.ch/ placed on crawler queue for double-check
2020/12/26 17:43:44	https://suxessm

zooom · 26 December 2020 16:54

If this was the case, doubling the number of queues should double the PPM

casey · 28 February 2021 23:39

Hi.

I had the high speed which some of you refering here only on autocrawl. I got like 4000 PPM and easily fill up 500 thread Loader. But when I run manual advanced crawl, its extremly slow and does not matter how many different urls are processed.

I downloaded some datasets from domainsproject.org, then join and shuffle the final file. This provided realy wide set of URL addresses. Then I loaded 1M into yacy. I tried one big file and serveral small files, same result. PPM is 0-10 at best. I know yacy is working with this domain set, because my internal DNS resolver got hammered realy bad when i loaded the file, but after that it stops. When i terminate all the crawls, yacy catches back and PPM rises again.

zooom · 22 September 2021 02:19

Hi casey,
every Internet provider gives you full speed in the first minutes and throttles down your speed to a minimum after a while. This is to show best results while you run speed tests.
Second thing is DNS. Check for “unbound” here.
Cheers

M

casey · 22 September 2021 11:44

Well was not my issue. I have symetric gigabit speed on academic network and there is no aggregation or FUP. Second i had dedicated dns server only for yacy search. It helped a lot when i loaded a big dns dataset, but not much for crawl results. But i stopped playing with yacy a while ago, so i dont need a fix for that anoymore. But thanks.

Orbiter · 2 October 2021 22:50

I worked again on the crawing code and made two enhancements, one to reduce blocking and another one to speed up the start of crawls with very large url start lists

github.com/yacy/yacy_search_server

enhanced crawler

committed 01:23PM - 17 Aug 21 UTC

Orbiter

+844 -845

a main problem when crawling is long waiting time cuased by crawl-delay values f…rom robots.txt entries. that attribute is not supported by google and interpreted by yandex and bing in different ways. In large crawls there is always one host which blocks the whole crawl with extreme large values. YaCy now still obeys crawl-delay but limits them to 10 seconds. Additionally the blocking logic when loading new robots.txt was analyzed and a deadlock was removed. Furthermore the construction of new queue lists was redesigned and it was ensured that always a large list of different hosts for host-balancing is provided for the loader.

Hopefully this helps a bit.

lifeofguenter · 27 December 2021 18:25

I am having the same issue (with latest master build).

Specs:

CPU: Intel i7-6700 (dedicated)
Mem: 32G (24g allocated for yacy)
Disks: SSD (Raid 1)
Network: 1Gbit (no throttling / Hetzner)

Usage

I monitor resource usage via munin - but none get even remotely close to their max.

Observation

Even with many many (1000+) different domains in the queue, I hardly go beyond 2-6 threads for crawling. Either something down the line is blocking (maybe the indexing?) or maybe there are too many URLs in the queue that is making it slow to compute the next URL to crawl? Currently my queue is at 13M links.

Nanook · 26 May 2022 20:02

I managed to get up to 1438 ppm crawl today, the secret being to crawl a LOT of sites simultaneously, in my case 25 sites.