Scraping proxy vs sites that use cookies

lingvini · 14 October 2021 08:57

Hi All,

I did some experiments with the scraping proxy feature that should in theory trigger a crawl based on the visited pages, however as I see only a minor part of the visited pages will be crawled in the end.

As I see there is a warning at the proxy settings page that states that no pages will be crawled that store any private information eg where POST/GET action is performed or where cookies are used.

The POST/GET part is more or less fine, but is this cookie restriction really necessary? Most pages use cookies nowadays (eg for controlling advertizements, etc) and this restriction keeps the percentge of the crawled pages (triggered by the proxy) very low.

Is there any workaround for this?

Thanks,

lingvini · 14 October 2021 15:41

More details here: http://localhost:8090/ProxyIndexingMonitor_p.html

Indexing with Proxy

YaCy can be used to ‘scrape’ content from pages that pass the integrated caching HTTP proxy. When scraping proxy pages then no personal or protected page is indexed; those pages are detected by properties in the HTTP header (like Cookie-Use, or HTTP Authorization) or by POST-Parameters (either in URL or as HTTP protocol) and automatically excluded from indexing.

Orbiter · 21 October 2021 15:31

Scraping with the proxy contains great danger - it could capture personalized data. Therefore it rolls out the most restrictive criteria to prevent any kind of mistakes that anyone could do.
That includes authentication tokens in cookies, very common, very dangerous.

You are right that cookies are very common but this restrictions make the scraping feature almost completely unusable. In the past years I tried to remove it from the code but each time someone asked me to not remove it… So I just did not advertise it.

So in most cases it is right to say: don’t use the scraping proxy for indexing. Don’t use the proxy!

pporozin · 19 April 2022 13:23

Hello,

how should then the setup for intranet search engine look like? Unauthenticated crawler? Majority of intranets are protected by credentials and YaCy cannot use/index protected pages, right? Scraping proxy might be the solution?

Boris

Orbiter · 23 May 2022 12:39

The right way to index an intranet is using the crawler in intranet mode.