We are running into an issue where our instance gets blocked and caught by recaptcha request. I wanted to know if Yacy supports using proxies for crawlers, and if I can parse blocked request that end in recapcha to trigger a change of the proxy IP?
The way I think this would work is if a page fails with a certain Fail-Response I could set up a script to update the proxy ip through cli/api to use a different IP address when crawling.
Yes you can use a remote proxy / crawl through a proxy.
would it be awfully evil to set the user-agent of crawler, to one of Google’s?..
No, you can’t be more evil than Google if you do that.
Sure it is ok to do creative things with YaCy but please keep in mind that this tool should not be used to harm target servers. The whole crawl policy (like: throttling down if target host is slow and other limits) is about being nice to content hosters - even if we bend the rules a bit (see: robots.txt). This is not a tool to be evil.
@Orbiter @StopTrackingUs @Sven792
I agree and was being a bit facetious in my suggestion
That being said, IF i for some wierd reason wanted to index a site that goes out of it’s way to NOT be indexed by YaCy; I’d rather use grab-site and/or webrecorder, and import resulting WARCS. (see links below)
Personally though, i’d rather just domain-block them from being crawled by my YaCy;
I mean, it’s very much their loss , not being indexed … isn’t it?
for making WARCS:
GitHub - ArchiveTeam/grab-site: The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
If this topic is only to not get blocked, maybe by using googlebot, then there is no need to use a proxy. Just switch on the googlebot in the crawl start. Thats already there. It’s not there to fake anything, just because YaCy actually complies completely to the robots.txt and is therefore excluded from all sites that exclude everything but the googlebot. That does not make sense to me… so just use that option.
I just imagine this…
Company CEO to Company Webmaster: