Pertaining to How YaCy Crawls Websites

I’m relatively new to YaCy, and would like to know if it already is or can be configured to crawl “politely”, as detailed in these sources:

I want to make sure that I don’t cause the owners of other websites any trouble with my instance.

I followed tweaks from this link:

YaCy’s tendency to be very unpolite when it crawls websites is another problem. It does respect a crawl-delay value if a website has set one in it’s robots.txt. Websites with no robots.txt asking for a limit get none; YaCy will effectively behave like a denial of service attack. A look at source/net/yacy/cora/protocol/ClientIdentification.java reveals some very small limits. These can be increased by changing a few lines:

  • public static final int clientTimeoutInit = 10000;
  • public static final int minimumLocalDeltaInit = 10; // the minimum time difference between access of the same local domain
  • public static final int minimumGlobalDeltaInit = 500; // the minimum time difference between access of the same global domain
  • // yacy default 10 seconds, too long to wait - try 2 seconds
  • public static final int clientTimeoutInit = 2000;
  • // yacy uses a default of 10 which is very unpolite. Use 2000 (2 seconds) instead
  • public static final int minimumLocalDeltaInit = 2000; // the minimum time difference between access of the same local domain
  • public static final int minimumGlobalDeltaInit = 2000; // the minimum time difference between access of the same global domain
  • public static class Agent {
    public final String userAgent; // the name that is send in http request to identify the agent
    public final String robotIDs; // the name that is used in robots.txt to identify the agent

This file is also where a custom user-agent should be set. YaCy does have a configuration option for it but that setting isn’t actually used.

1 Like

Sorry that I took so long to get back to you. Thank you for informing me of this. I will be sure to try out these changes as soon as possible.