How to speed up YaCy crawls?

Is this hardcoded or is there a parameter to tweak?
Where can I watch these numbers?
How can I double the number of crawling slots and the timeout?

I can see tons of timeout errors in the recjected URLS list, although they can be reached and I do not think that all of these have “I do not want to be crawled automatically” restrictions implemented.

(although this cloudflare garbage is getting more and more popular)

If this is the reason for the poor performance, shouldn’t show the “Network Access” a full workload at the right side (which it only does during GET robots.txt)?

Time URL Fail-Reason
2020/12/26 17:43:48 http://www.www.gasthaus-rosengarten.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.www.gasthaus-rosengarten.ch duration=144 for url http://www.www.gasthaus-rosengarten.ch/
2020/12/26 17:43:48 https://www.chin-min.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.chin-min.ch duration=155 for url https://www.chin-min.ch/
2020/12/26 17:43:48 https://www.kulturgut.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.kulturgut.ch/ to https://www.denkmal.ch/ placed on crawler queue for double-check
2020/12/26 17:43:47 https://www.schoenzeit.webstores.ch/robots.txt TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:47 http://cajon.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://cajon.ch/ to https://cajon.ch/ placed on crawler queue for double-check
2020/12/26 17:43:47 https://www.dalucia.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.dalucia.ch duration=356 for url https://www.dalucia.ch/
2020/12/26 17:43:47 https://www.sineq.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.sineq.ch duration=468 for url https://www.sineq.ch/
2020/12/26 17:43:47 https://www.schoenshop.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.schoenshop.ch/ to https://www.schoenzeit.webstores.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46 https://www.heime-consulting.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.heime-consulting.ch/ to https://www.heime-consulting.ch/home.html placed on crawler queue for double-check
2020/12/26 17:43:46 http://www.www.telefonmonteur.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.www.telefonmonteur.ch duration=135 for url http://www.www.telefonmonteur.ch/
2020/12/26 17:43:46 https://suli.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://suli.ch/ to https://www.suli.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46 https://www.alpsu.ch/robots.txt TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:46 https://citywettingen.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://citywettingen.ch/ to https://www.citywettingen.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46 http://www.alpsu.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://www.alpsu.ch/ to https://www.alpsu.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46 https://www.tdcag.ch/robots.txt TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:46 https://stattboden-riet.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: stattboden-riet.ch duration=73 for url https://stattboden-riet.ch/
2020/12/26 17:43:46 https://www.lenzerheide2020.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.lenzerheide2020.ch/ to https://www.biathlon-lenzerheide.swiss/ placed on crawler queue for double-check
2020/12/26 17:43:45 https://www.reklamegrafik.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.reklamegrafik.ch/ to http://www.artatelier.ch/ placed on crawler queue for double-check
2020/12/26 17:43:45 https://www.etude-avocat-belhocine.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.etude-avocat-belhocine.ch duration=464 for url https://www.etude-avocat-belhocine.ch/
2020/12/26 17:43:45 http://tdcag.ch/ TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://tdcag.ch/ to https://www.tdcag.ch/ placed on crawler queue for double-check
2020/12/26 17:43:44 https://suxessm

If this was the case, doubling the number of queues should double the PPM

Hi.

I had the high speed which some of you refering here only on autocrawl. I got like 4000 PPM and easily fill up 500 thread Loader. But when I run manual advanced crawl, its extremly slow and does not matter how many different urls are processed.

I downloaded some datasets from domainsproject.org, then join and shuffle the final file. This provided realy wide set of URL addresses. Then I loaded 1M into yacy. I tried one big file and serveral small files, same result. PPM is 0-10 at best. I know yacy is working with this domain set, because my internal DNS resolver got hammered realy bad when i loaded the file, but after that it stops. When i terminate all the crawls, yacy catches back and PPM rises again.

1 Like

Hi casey,
every Internet provider gives you full speed in the first minutes and throttles down your speed to a minimum after a while. This is to show best results while you run speed tests.
Second thing is DNS. Check for “unbound” here.
Cheers

M

Well was not my issue. I have symetric gigabit speed on academic network and there is no aggregation or FUP. Second i had dedicated dns server only for yacy search. It helped a lot when i loaded a big dns dataset, but not much for crawl results. But i stopped playing with yacy a while ago, so i dont need a fix for that anoymore. But thanks.

I worked again on the crawing code and made two enhancements, one to reduce blocking and another one to speed up the start of crawls with very large url start lists

Hopefully this helps a bit.

I am having the same issue (with latest master build).

Specs:

  • CPU: Intel i7-6700 (dedicated)
  • Mem: 32G (24g allocated for yacy)
  • Disks: SSD (Raid 1)
  • Network: 1Gbit (no throttling / Hetzner)

Usage

I monitor resource usage via munin - but none get even remotely close to their max.

Observation

Even with many many (1000+) different domains in the queue, I hardly go beyond 2-6 threads for crawling. Either something down the line is blocking (maybe the indexing?) or maybe there are too many URLs in the queue that is making it slow to compute the next URL to crawl? Currently my queue is at 13M links.

I managed to get up to 1438 ppm crawl today, the secret being to crawl a LOT of sites simultaneously, in my case 25 sites.

1 Like