I’m using the “Restrict to start domain(s)” option to limit indexing to a predefined list of domains. However, I’ve noticed that YaCy sometimes ignores this setting and includes external sites in the index.
Are you able to find the causality, when exactly it ignores the rule?
Me personally, I use the Advanced Crawler and Load Filter on URLs and Document Filter > Filter on URL, for example https://www.yacy.net/.*
in both cases. And – when I want to crawl the whole site – Unlimited crawl depth for URLs matching with as well. It works well, with single exception: when using sitemaps as source, then external sites are loaded as well, which is probably a bug.
First filter is a “loader” filter, determining which urls are crawled and links from them are loaded. Second is “indexing” filter, determining which sites are actually stored in index.
If it’s a bug, please try to examine it and describe in maximum possible details and file an Issue at Github. We need developers. @orbiter has been not much interested, for a longer time. I can’t do Java, but I do some docs. We need someone who will care for the code part.
Advanced Crawler → URL http://google.com and checked “Restrict to start domain(s)”. And it indexes google mirrors like www.google.com.tj www.google.com.py and other .com.* domains. It seems Yacy doesnt recognize them as different domain names.
then it’s a bug!