Crawl (my) website with username/password protected pages

Hello,

I am hosting a domain-specific document repository website, protected from general public access via ‘basic’ http username/password authentication.

I am also trailing using Yacy to index these documents in a searchable manner. I would like to directly index the documents via their public (password protected) URLs from my server for easy access but this requires providing Yacy with the username and password. I could not find how to do this. Can anyone indicate how this could be achieved?

cheers,
Chris

Hey, this is a common question that I always sadly responded with “sorry not possible”. The problem here is that for a safe operation of the search index you will need the same accounts and therefore the same authentication system in YaCy as in your target web pages. Otherwise YaCy would provide access to your protected content using a un-protected search.

To make this possible, it would require that we find an authentication standard (like SAML or AD accounts) that we integrate in YaCy as well and therefore connect a protected System with a protected search. This is a complex task that I never wanted to try, too many things involved that I don’t know how to do.

The only way to go into this direction would be to set up a project where the customer (you) would accept to hire a consultant who brings knowledge about the required authentication technologies together with the grant of the customer to modify their own system to make everything compatible. Don’t understand me wrong - I am not looking for a job here; the consultant required would not be me!

A different approach would be to find competent persons somewhere in the community who has the required knowledge and who is willing to work on YaCy to integrate things.

If the search and webhosting are on the same host, you could play with the authentication settings to allow from localhost/127.0.0.1 without password – or set up an exception of known YaCy’s IP. Not exactly secure, but could help.
Don’t forget to switch yacy to robinson mode, or your ‘secrets’ will be broadcasted to p2p network :-P.

That may be easier, at least to get myself started. Note that the queries do not come from localhost because a DDNS URL is being accessed rather than directly accessing the local host, as I want to indexed URLs to also be accessible from outside my intranet.

For the time being I’ve configured Apache to allow traffic from the my public IP (note requests appear to come from a domain name with my ISP’s assigned dynamic IP embedded in it, so no difference to just directly using the IP address but obviuosly not a proper solution) to bypass basic auth.

I should mention Yacy is also mapped to a subfolder, with Apache providiing https access, it is also accessed via the same credentials and for my use only.

Also, yes, I am running in Robinson mode as it is my own domain-specific search.

Overall I still think having Yacy handle basic authentication would be preferable and is compatible with the use case of private intranet indexing. I will try hacking this functionality into my Yacy instance in an adhoc manner, it doesn’t look too difficult. Properly integration, e.g. to other credentials such as those used in Apache, would be much more difficult though.