I'm tired of inserting lists of 130 sites and waiting for the crawl to finish, and then repeating this again and again

erikobox · 22 April 2024 15:58

There is a list of 320,000 sites. I tried to paste as a text file, YaCy freezes. I tried to enter this huge list manually into the input field. YaCy freezes.

Finally I found that it could handle 130 sites per crawl request, but not more.

YaCy simultaneously tries to process all sites at once. How to organize turn-based crawling on a schedule. Let’s say I add 130 sites each, but I don’t wait, I do it one after another. Does YaCy process them not immediately, but one by one?

This can be done?

Sviatoslav · 25 April 2024 07:27

Это значительное количество! Если Вы собираетесь хранить у себя весь этот индекс, Вам потребуется чрезвычайно огромный ресурс, во множество терабайт.
Из практики, всего лишь три - четыре среднего размера форума, полностью индексированные, дают индекс более 10 Гб.

Кроме того, замечено, что большая база индекса потребляет много ОЗУ при работе.
В случае недостаточности памяти замедляется работа и начнет зависать интерфейс. Так что памяти потребутся тоже очень много. Не могу сказать точно, но однозначно не менее 16 Гб.

Как задавать индексирование. По умолчанию YaCy идет по всем ссылкам параллельно. Обработать столько сразу одновременно она не сможет.
Я не вижу другого способа, кроме как поместить ваш список в ряд локальных файлов частями, и индексировать их последовательно.
Последовательность можно задать при помощи встроенного Шедулера, но его максимальный период - месяц, а за месяц проиндексировать весь такой список ясно что не получится.
Поэтому лично я бы писал собственную программу, утилиту, которая обращалась бы к YaCy и поочередно задавала бы адреса на индексирование, читая ваш список.
Или, проще, такая утилита могла бы обновлять содержимое файла в 130 адресов, по которому задано индексирование YaCy.

roamn · 25 April 2024 22:41

Wow ok I have used about 480 sites in Linux.
Do you run windows?
May I have a copy of your list to try?

http://cloudparty.evils.in/index.php/s/cjxxmcjFZ9zLY73

okybaca · 25 April 2024 23:48

Does it crash during opening of that file, or later, when crawling?

The crawler documentation recomends splitting the link-list into smaller files, which could help.

There are three stages probably, when crawling: loader, queue, and crawler.
Not sure, how exactly that works, but you can examine the system with smaller amount of urls and find out yourself.

okybaca · 25 April 2024 23:57

Oh and: you paste the urls into the advanced crawler form? That’s not exactly good idea.

More easy (and supported) way is to generate a link-list and use that as a source of urls.

once, i experimented with crawling a huge TLD list of whole country. i generated several HTML files out of that list (just using some shell script) and fed it into yacy as a link-list or just using the filters regexp, not remember exactly. it worked.

also, if performance is bottleneck, check the memory settings and assign more RAM to yacy, if you have available. that helps a lot with performace! java & yacy are memory hungry.

roamn · 1 May 2024 10:28

I made a list with yacy just by exporting Only Domain as plain txt.
The list was ~280 k long. I wrote a Quick BASIC 64 program to add http and https to each domain in the list. The list was doubled to 576 k.
I copied the list and pasted it into the Advanced Crawler, the page was busy for about 2 mins.
I then executed the crawl and waited 4-5 mins for the crawler monitor page to respond.
The DNS requests were about 250 to 300 queries per second. I can see this on my Pihole.

Memory peaked a 8400 mb now 11200 mb after about 40 mins.
I’m running yacy on Linux with amazon java.

Hope that helps.

okybaca · 12 May 2024 09:00

I still think, that pasting a several kb list into a web form is not a good idea on any site (and yes, not limiting that in form’s processing is a vulnerability, resulting in DoS, as you described).
Easiest way to use a list is to make a link-list, put the file online and use the url of the list as a source for crawling as a ‘link-list’. See “Start point” section of Advanced Crawler. You can apparently use even the local file, but I, personally, have never tried.

Orbiter · 12 May 2024 15:15

Everything that is pasted into the crawl start window is checked for the robots.txt. This is done concurrently as I expected that all urls there are for different hosts. This maybe is not the case here.
Using a link-list as okybaca recommends is maybe the best solution.