Using Proxies for crawl to not get blocked

StopTrackingUs · 3 February 2021 13:45

Hi All,

We are running into an issue where our instance gets blocked and caught by recaptcha request. I wanted to know if Yacy supports using proxies for crawlers, and if I can parse blocked request that end in recapcha to trigger a change of the proxy IP?

The way I think this would work is if a page fails with a certain Fail-Response I could set up a script to update the proxy ip through cli/api to use a different IP address when crawling.

Orbiter · 7 February 2021 11:34

Yes you can use a remote proxy / crawl through a proxy.

stembod · 19 February 2021 20:21

@StopTrackingUs

would it be awfully evil to set the user-agent of crawler, to one of Google’s?..

Sven792 · 8 March 2021 20:46

No, you can’t be more evil than Google if you do that.

Orbiter · 8 March 2021 20:51

Sure it is ok to do creative things with YaCy but please keep in mind that this tool should not be used to harm target servers. The whole crawl policy (like: throttling down if target host is slow and other limits) is about being nice to content hosters - even if we bend the rules a bit (see: robots.txt). This is not a tool to be evil.

stembod · 10 March 2021 19:04

@Orbiter @StopTrackingUs @Sven792

I agree and was being a bit facetious in my suggestion

That being said, IF i for some wierd reason wanted to index a site that goes out of it’s way to NOT be indexed by YaCy; I’d rather use grab-site and/or webrecorder, and import resulting WARCS. (see links below)

Personally though, i’d rather just domain-block them from being crawled by my YaCy;

I mean, it’s very much their loss , not being indexed … isn’t it?

for making WARCS:
GitHub - ArchiveTeam/grab-site: The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
https://webrecorder.io/

Orbiter · 10 March 2021 19:21

If this topic is only to not get blocked, maybe by using googlebot, then there is no need to use a proxy. Just switch on the googlebot in the crawl start. Thats already there. It’s not there to fake anything, just because YaCy actually complies completely to the robots.txt and is therefore excluded from all sites that exclude everything but the googlebot. That does not make sense to me… so just use that option.

stembod · 10 March 2021 19:36

I just imagine this…

Company CEO to Company Webmaster: