Pertaining to How YaCy Crawls Websites

TopHatProductions115 · 20 July 2022 05:19

I’m relatively new to YaCy, and would like to know if it already is or can be configured to crawl “politely”, as detailed in these sources:

I want to make sure that I don’t cause the owners of other websites any trouble with my instance.

Lumberjack · 26 July 2022 05:09

I followed tweaks from this link:

YaCy’s tendency to be very unpolite when it crawls websites is another problem. It does respect a crawl-delay value if a website has set one in it’s robots.txt. Websites with no robots.txt asking for a limit get none; YaCy will effectively behave like a denial of service attack. A look at source/net/yacy/cora/protocol/ClientIdentification.java reveals some very small limits. These can be increased by changing a few lines:

public static final int clientTimeoutInit = 10000;
public static final int minimumLocalDeltaInit = 10; // the minimum time difference between access of the same local domain
public static final int minimumGlobalDeltaInit = 500; // the minimum time difference between access of the same global domain

// yacy default 10 seconds, too long to wait - try 2 seconds
public static final int clientTimeoutInit = 2000;
// yacy uses a default of 10 which is very unpolite. Use 2000 (2 seconds) instead
public static final int minimumLocalDeltaInit = 2000; // the minimum time difference between access of the same local domain
public static final int minimumGlobalDeltaInit = 2000; // the minimum time difference between access of the same global domain
public static class Agent {
public final String userAgent; // the name that is send in http request to identify the agent
public final String robotIDs; // the name that is used in robots.txt to identify the agent

This file is also where a custom user-agent should be set. YaCy does have a configuration option for it but that setting isn’t actually used.

TopHatProductions115 · 7 August 2022 03:41

Sorry that I took so long to get back to you. Thank you for informing me of this. I will be sure to try out these changes as soon as possible.