Call to Terminate a Crawl

MG404 · 20 May 2024 11:10

I am using the current documentation here to start a crawl with whatever URL I feed into it. But I wanted to know if there is an API call to terminate a crawl? I want to crawl quite a few url’s and if they’re processed concurrently it would cause YaCy to crash due to the current garbage collector bug. So I wanted each crawl to run for 5-10 minutes and than terminate.

Sviatoslav · 20 May 2024 13:08

Я не знаю, есть ли такой вызов API, но ваша задача решается иным путем.
При запуске индексирования Вы можете задать максимальное количество страниц, которое краулер должен обработать.
Этим управляет параметр crawlingDomMaxPages который Вы можете уменьшить по своему усмотрению.
Можно также уменьшить глубину индексирования crawlingDepth.

Чтобы уменьшить общую нагрузку от краулера, можно попробовать уменьшить параметр
Crawler Pool (на странице /PerformanceQueues_p.html#ThreadPoolSettings). Этот параметр ограничивает количество загрузок, которые краулер может обрабатывать параллельно.
По умолчанию он имеет величину 200. Для моего слабого узла мне пришлось понизить эту величину до 10.

Sviatoslav · 20 May 2024 13:23

Especially for those who are too lazy to use the Google translator.

I don’t know if there is such an API call, but your problem is solved in a different way.
When starting indexing, you can set the maximum number of pages that the crawler should process.
This is controlled by the crawlingDomMaxPages parameter, which you can reduce to your liking.
You can also reduce crawlingDepth.
To reduce the overall load from the crawler, you can try reducing the parameter Crawler Pool (at /PerformanceQueues_p.html#ThreadPoolSettings). This setting limits the number of downloads the crawler can process in parallel.

By default it has a value of 200. For my weak peer I had to lower this value to 10.