I have Yacy and Pihole running on the same device now. May cure the problem of slow crawling after a while

I put a list of forums some 478 url’s today to test my yacy with pihole running on the same pc.

I started the crawl of the sites at a depth of 1 this was ok and fast for a short time, it would slow down to about 50 ppm.
I started to check the logs on yacy and the query page on the pihole to black list problem sites.
There where a number of sites that where asking for 10 second crawl delay so they got blacklisted in one way or another.
I kept blacklisting sites and restarting the crawler a many times.
I also was clearing the webcache and robots cache.
It took me about 2 hours to have the crawl finish properly.

Its been crawling for about 4 hour now so far with my internet connection maxed out. (see pic).
The CPU in the Notebook im using is an i5 released 12 years ago with a HDD.

These are the blocklists I have created and shared, some just today.
See https://pi-hole.net/ for info on installing one.
There is a good forum as well. https://discourse.pi-hole.net/

The Blocklists.
To be entered to the YaCy peer blocklist page.
https://smokingwheels.github.io/Pi-hole/yacy

To be added to PiHole adlists.
https://smokingwheels.github.io/Pi-hole/yacypiholelist

Must be entered one by one to PiHoles domain Blacklist there Regex.
https://smokingwheels.github.io/Pi-hole/yacypiholeregex

To be added to PiHole adlists.
https://smokingwheels.github.io/Pi-hole/allhosts

The hosts file used on the computer.
https://smokingwheels.github.io/Pi-hole/hosts

My yacy phole pic

You can try the latest version the changes seems to improve crawl speed.

My Repo is update now at https://github.com/smokingwheels/YaCy

The main source here https://github.com/yacy/yacy_search_server

I did an experiment in 2017 with yacy and a raspberry PI 3 B there is a hosts file listing there.
I found this information on a YaCy search engine I have.
The hosts file listing may help the current project.

See https://forums.raspberrypi.com/viewtopic.php?f=63&t=194208&p=1216233

meanwhile I added a dns lookup throttling which reduces the number of requests once there are more than 50 simultaneous requests. I hope this will also protect routers at home as well against request flooding.

1 Like

Thanks.
Works good.
The Crawler filer works good to.

Here is a better pic.

Found possible error need dos2unix to fix.

Github for windows changes line ending so you need dos2unix.

Will test soon.

from your pi-hole screenshot it looks like the throttling is a little bit too strong? maybe I reduce it a little bit?

Sure go ahead will test.
Note I have 5 piholes running at the same time.
I have the concurrent 150 warning a lot less now.
My old router could only handle 75 queries a second my new one about 100 quieres a second.
The current settings takes 10 min to cache dns on a list of 480 sites not normal sort of a crawl?

There is only about 1 mbs left on my connection for DNS when starting a crawl.
Maybe do try all the lookup first then start the crawler?

here is my list of what i’m crawling
https://ncloud.zaclys.com/index.php/s/d5N8A7ffEHce6S8

If anyone wants to use my lists in there Pihole you are welcome to try them.
You copy the list to the Piholes /var/www/html folder and point the adlist to it eg http://192.168.1.7/allhosts

You do it for each file on the pihole.

I don’t have all of them enabled. See pics.

There is a good forum if you have any problems. https://discourse.pi-hole.net/

There is an update to pihole from my early test it looks like it has improved.

Load Testing 1 before I upgrade the remaining 4.

2 out of 5 piholes have a problem no time to fix until after the 25th