YaCy ignores the option Restrict to start domain(s)

erikobox · 13 February 2025 01:24

I’m using the “Restrict to start domain(s)” option to limit indexing to a predefined list of domains. However, I’ve noticed that YaCy sometimes ignores this setting and includes external sites in the index.

okybaca · 13 February 2025 08:53

Are you able to find the causality, when exactly it ignores the rule?

Me personally, I use the Advanced Crawler and Load Filter on URLs and Document Filter > Filter on URL, for example https://www.yacy.net/.* in both cases. And – when I want to crawl the whole site – Unlimited crawl depth for URLs matching with as well. It works well, with single exception: when using sitemaps as source, then external sites are loaded as well, which is probably a bug.
First filter is a “loader” filter, determining which urls are crawled and links from them are loaded. Second is “indexing” filter, determining which sites are actually stored in index.

If it’s a bug, please try to examine it and describe in maximum possible details and file an Issue at Github. We need developers. @orbiter has been not much interested, for a longer time. I can’t do Java, but I do some docs. We need someone who will care for the code part.

erikobox · 15 February 2025 12:26

Advanced Crawler → URL http://google.com and checked “Restrict to start domain(s)”. And it indexes google mirrors like www.google.com.tj www.google.com.py and other .com.* domains. It seems Yacy doesnt recognize them as different domain names.

okybaca · 16 February 2025 13:24

then it’s a bug!