Crawling the user defined pages of HTDOCS folder

ThomasFreedman · 1 February 2023 04:38

I am in the process of setting up YaCy for a cluster of private systems to catalog content independently generated on each cluster peer. The content is not very large. It consists of many small files. Currently there are 4145 files using only 31MB of disk storage, uncompressed.

I created a symbolic link in the HTDOCS folder named www to the folder with these files. They are html files, and the folder has an index.html file that has links to every one of the other 4145 files in that folder.

I can enter the URL http://localhost:/www/index.html and see the list of links to the 4145 files in that folder, and clicking on any of those links displays the contents of that file.

I started a crawl using http://localhost:/www/index.html and I get the graphic with localhost, but nothing else. Then I see the URL was rejected with 401 error, unauthorized access.

I also tried Crawling of “http://www.127.0.0.1:57093/index.html” which failed. Reason: scraper cannot load URL: Client can’t execute: www.127.0.0.1 duration=1 for url http://www.127.0.0.1:57093/index.html/ and http://www.localhost:57093/index.html and http://127.0.0.1:57093/www/index.html

Using the advanced crawler page the www prefix is necessary otherwise the url entry box shows the yellow ! saying the website robot rules don’t allow crawlers.

What does it take to allow my local pages to be searched but disallow external robots like google, bing etc?

ThomasFreedman · 2 February 2023 16:27

All of the above was done on a headless Debian server. I also tried to do this on my laptop with yacy setup on port 8007. I get a different error but the crawl still has a rejected URL, this time with
http://localhost:8007/www/index.html FINAL_LOAD_CONTEXT url does not match must-match filter https? ://(www.)?\Qlocalhost\E/www/index.html.*

I’m using the simple “Load web pages, crawler” form under first steps. Where do I find that regex filter, and why is it using https instead of http (server.https=false, remotesearch.https.preferred=false in yacy.conf)? Is that why it’s failing?

Where do I configure these “must match” filters?

okybaca · 4 February 2023 09:08

i’d suspect the form of the hostname.
http://localhost:/www/index.html should be http://localhost/www/index.html, because “:” is a delimiter for specifying the port number, which is not used here.
and “http://www.127.0.0.1:57093/index.html” is most probably also not correct, 'cause hostname is “127.0.0.1” (which is ip address), and is not usually mixed with hostname (www.) part. 57093 is a port number, are you sure, that your http process listens on port 57093? (default port for www is 80). i’d try http://127.0.0.1:57093/index.html or http://127.0.0.1/index.html, instead. first, try the address in browser.

you can try your regular expression (https? ://(www.)?\Qlocalhost\E/www/index.html.) in Production > Target Analysis > RegexTest (/RegexTest.html) against examples of yours to find if the error is not in your regexp. please mind it’s a java pattern, not common regexp.

ThomasFreedman · 4 February 2023 18:26

Firstly, thank you very much for your reply.

The regex is not mine, but rather it’s built into yacy somewhere. I’m just reporting the rejected url error information.

I am certain of ports I use. I use different ports for different systems and tests on those systems. I made several attempts to formulate a url based on the regex error information, and yes some of them were not good attempts. I’m just very puzzled by the rejections and how to understand them, especially since the config doesn’t enable httpS connections yet the regex reports the url as using https.

I finally decided to focus on local laptop to 1st understand configuration then migrate it to headless server once I do, but I’m still very puzzled.

I can’t seem to even repeat my first success using yacy in Intranet mode on my laptop, where I was able to crawl the local content I had on my nginx webserver in without involving any external crawling or results from elsewhere. It was very fast. I was able to get a similar result, but it took much longer and (apparently) involved results from other peers, despite it being Intranet mode.

I’d like to accomplish that, then see if I can move that content under the yacy web server, crawl it there and get the same results.

If I can do all that I should be able to repeat that same process on my headless server. But I gotta say, I’m an IT pro and highly experienced in Linux and setting up hosts for various applications around the world, and I’m quite frustrated getting a simple local crawl to work with yacy.

I’m investing so much effort into yacy b/c it offers the unique capability of a distributed & decentralized search index. But the limited community activity on this forum (at least in English) and condition of documentation is a hindrance.

okybaca · 5 February 2023 08:37

Sorry, Thommas! I was puzzled by these port numbers, now I’m clear about you know what you do ;-))

What about /robots.txt on these http servers?
(yes, it’s possilble to disallow other robots by agent settings). What say the http error logs?

Exactly! I been using YaCy for two years now, out of same reason, and still it’s suboptimal and frustrating. I don’t write Java, so things are not easy to fix for me. At least I try to fix the docs and report the bugs, but usually with no reply.

I suppose the main problem is, that the main developer/founder doesn’t have time as he is working on a new project, and didn’t delegate rights to anyone else, so bugs are not fixed and even pull request hang for several months.