Good morning, I use Yacy as a pure intranet search for me. I have on a virtualized Ubuntu server 20.04 (which runs under Proxmox 7.*) a USB hard drive (NTFS). This is mounted there. On the mount directory I run the webserver Lighttp (https://www.lighttpd.net/), which shows me the file structure as http page (similar to FTP), i.e. under http://192.168.1.1 the root directory of the webserver is accessible. With Yacy I then crawl http://192.168.1.1, which also works very well.
But I notice that it does not go deeper than 3 directories, i.e. often I see only the 3 directories, although there is another directory with e.g. 2 PDFs below.
What do I have to set where so that EVERYTHING on the USB hard drive is crawled here?
Sorry for the late feedback. Yacy is just an addon that I’m trying to get running cleanly. I deleted everything again, went to “Production” => “Advanced Crawler” and entered http://192.168.1.1 as “Start Point”. At “Crawler Filter” I have entered 0 at “Crawling Depth”. At “Unlimited crawl depth for URLs matching with” I also entered http://192.168.1.1. Then pressed “Start New Crawl Job” at the bottom.
Under “Administration” => “Process Scheduler” I set the “Event Trigger” to “run regular at 07:00 h” (every morning at 7 o’clock).
If I interpret it correctly, the “Crawling Depth” means the number of continuing domains, right?
I would like to have ONLY my files indexed in my intranet. If there is e.g. in a Word or PDF document somewhere https://www.amazon.de in it, Yacy should NOT switch to Amazon and continue crawling here!
But I want ALL my intranet directories and ALL files under them to be indexed.
The whole thing should be updated once a day. Access should be from a cache because of speed, but it should be ensured during daily crawling that e.g. deleted or moved files are then no longer in the cache, but are re-indexed accordingly.
What do I still have to set everything. Sorry am here still really the layer, but with the old wiki of Yacy I do not get along at all.