Issue with crawler downloading large files (iso, tar.gz, etc)

redsmurf · 1 January 2023 20:43

Happy New year All,
First off, just want to say YaCy seems to be a great tool, easy to setup and provides quick and useful results with the default setup…

However, one issue I’m seeing: the crawler eventually works its way to a mirror site for linux isos. Based on traffic analysis, the crawler seems to be downloading the full iso from each mirror site (each iso file is almost 1GB) . This end up quickly saturating my internet connection and the crawler ppm drops significantly within about 10 minutes of starting the crawler.

This is a tremendous waste of bandwidth, but I can’t seem to find a way to get the crawler to abandon the large downloads and move on to other content… Is there a way to filter iso files (and large archives such as zip files and tar.gz, etc)?

Here’s my current setup:
Currently using the latest git clone from a few days ago (dated 12/30/22); the advanced crawler is started with all defaults (crawl depth = 3, etc), starting url is rsync - ArchWiki

thanks

okybaca · 2 January 2023 09:28

one way is to limit the crawler using regular expressions in “filters” section in advanced crawler. for example, “.*\.tar\.gz” in “Load filter on URLs” field in “crawler filter” section, means that no tar.gz files will be browsed. you can use multiple of them using “or” (|) operator, for example “.*tar\.gz|.*\.zip” will ignore urls that end with .tar.gz OR .zip.
there are two separate filters, one for crawling (crawler filter), and one for actual indexing (“document filter”).
note that regexp is not “normal regexp” but a “java pattern”.