Why Yacy ignore allow rules in robots.txt and index everything?

This is nonsense! I don’t want to index everything that CMS engine of website is generating! In robots.txt there is rules for that, to index pages, not garbage, and YaCy don’t give a f about it!

Thats not all!

If i use

User-agent: *
Allow: /$
Allow: /showthread.php?t=*$
Disallow: /

Yacy see only Disallow: / and say, Oh, this website not allow to index, bla bla bla. WTF??!!

and thats not all!!

Yacy chache robots.txt and even if it changes in the server, it doesn’t care!

Please, file a PR at github issues, if the behavior is reproducible.
Sadly, in the recent months, there have been not much effort in fixing the issues. But for the record, it’s good to keep a track of bugs.

If you’re able to hack it yourself (it’s written in java), that’s an advantage! The robots.txt logic could be somewhere here in the code.

1 Like