Change crawl depth on intranet

H-BLOGX · 20 January 2022 10:44

Good morning, I use Yacy as a pure intranet search for me. I have on a virtualized Ubuntu server 20.04 (which runs under Proxmox 7.*) a USB hard drive (NTFS). This is mounted there. On the mount directory I run the webserver Lighttp (https://www.lighttpd.net/), which shows me the file structure as http page (similar to FTP), i.e. under http://192.168.1.1 the root directory of the webserver is accessible. With Yacy I then crawl http://192.168.1.1, which also works very well.

But I notice that it does not go deeper than 3 directories, i.e. often I see only the 3 directories, although there is another directory with e.g. 2 PDFs below.

What do I have to set where so that EVERYTHING on the USB hard drive is crawled here?

Kind regards - H-BLOGX

Orbiter · 5 February 2022 19:55

How do you start the crawl? If you actively increase the depth of the crawl, you should get also more than 3 directories deep.

H-BLOGX · 26 February 2022 07:47

Hi Orbiter,

Sorry for the late feedback. Yacy is just an addon that I’m trying to get running cleanly. I deleted everything again, went to “Production” => “Advanced Crawler” and entered http://192.168.1.1 as “Start Point”. At “Crawler Filter” I have entered 0 at “Crawling Depth”. At “Unlimited crawl depth for URLs matching with” I also entered http://192.168.1.1. Then pressed “Start New Crawl Job” at the bottom.

Under “Administration” => “Process Scheduler” I set the “Event Trigger” to “run regular at 07:00 h” (every morning at 7 o’clock).

If I interpret it correctly, the “Crawling Depth” means the number of continuing domains, right?

I would like to have ONLY my files indexed in my intranet. If there is e.g. in a Word or PDF document somewhere https://www.amazon.de in it, Yacy should NOT switch to Amazon and continue crawling here!

But I want ALL my intranet directories and ALL files under them to be indexed.

The whole thing should be updated once a day. Access should be from a cache because of speed, but it should be ensured during daily crawling that e.g. deleted or moved files are then no longer in the cache, but are re-indexed accordingly.

What do I still have to set everything. Sorry am here still really the layer, but with the old wiki of Yacy I do not get along at all.

Kind regards - H-BLOGX