Auto-blacklisting crawl-delay offenders

roamn · 27 December 2025 02:24

Hi all,

I’ve been experimenting with a non-invasive way to reduce crawler stalls caused by hosts that repeatedly enforce crawl-delays, without touching YaCy core logic (HostQueue, CrawlQueues, etc.).

Instead of patching Java, this approach observes YaCy’s own log output and feeds the result back into the existing blacklist mechanism.

Motivation

When YaCy encounters repeated messages like:

HostQueue * forcing crawl-delay of 85 milliseconds for www.example.com

The crawler can end up spending disproportionate time idling on a small set of slow or defensive hosts.
This is not aggressive crawling — it’s the opposite: backing off early and letting other hosts proceed.

Design goals

No Java changes
Fully reversible
Respects robots.txt
Uses YaCy’s existing blacklist engine
Can be turned off instantly
Reputation-safe

How it works

A small shell script scans yacy00.log
Extracts hosts where YaCy already enforced a crawl-delay
If the delay exceeds a threshold (e.g. 50 ms)
Adds the host to a dedicated blacklist file

Example blacklist file:

DATA/LISTS/autoblack.default.black

This keeps manual blacklists and auto-generated rules clearly separated.

Example script (simplified)

#!/bin/bash

YACY_LOG="DATA/LOG/yacy00.log"
BLACKLIST="DATA/LISTS/autoblack.crawldelay.black"
CRAWL_DELAY_THRESHOLD=50

touch "$BLACKLIST"

grep "forcing crawl-delay" "$YACY_LOG" | while read -r line; do
    delay=$(sed -n 's/.*forcing crawl-delay of \([0-9]\+\).*/\1/p' <<< "$line")
    host=$(sed -n 's/.* for \([^: ]\+\).*/\1/p' <<< "$line")

    [[ -z "$delay" || ! "$delay" =~ ^[0-9]+$ ]] && continue
    [[ -z "$host" ]] && continue
    [[ "$host" =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]] && continue

    if (( delay >= CRAWL_DELAY_THRESHOLD )); then
        rule="$host/*"
        if ! grep -Fxq "$rule" "$BLACKLIST"; then
            echo "# auto-blacklisted: crawl-delay ${delay}ms" >> "$BLACKLIST"
            echo "$rule" >> "$BLACKLIST"
        fi
    fi
done

Restart Yacy

Scheduling (recommended)

Run every 5 minutes via cron:

*/5 * * * * flock -n /tmp/autoblack.lock /path/to/autoblack.sh

This prevents overlapping runs.

Why not integrate into HostQueue?

That may be worth discussing later, but keeping this external has advantages:

avoids fork maintenance
avoids accidental global behavior changes
easier to test, tune, or discard
keeps YaCy’s reputation logic conservative

This behaves more like adaptive politeness, not aggression.

Feedback welcome

I’m interested in:

whether others see similar crawl-delay bottlenecks
thoughts on threshold speed
ideas for optional expiry (e.g. auto-remove after N days)
whether this should stay external or become a configurable feature later on

roamn · 4 January 2026 21:56

DNS / Crawl Stability Notes (Unbound vs Pi-hole, RAM queue test)

I normally run Unbound directly, but I did some testing with Pi-hole in front of it. In my setup, I’m using a cron job to keep DNS pressure under control.

I have a crontab entry running autoblack.sh, and I restart Unbound every 1 minute.
Surprisingly, this does help under heavy crawl load — fewer DNS stalls and fewer long crawl-delay escalations.

I’m also currently testing a RAM-backed queue directory (tmpfs) for the crawler queues to reduce disk contention.

Current test conditions:

~100k URLs in queue
~2k PPM crawl rate (until you hit robot delay ~10000 ms)
DNS stability noticeably improved with frequent Unbound restarts
Queue latency drops when running on RAM vs disk

This is still experimental, but early results look promising, especially under sustained high-PPM crawling. I’ll report back once I’ve got longer-run numbers.

If anyone else is stress-testing DNS + crawl pipelines at this scale, I’d be keen to compare notes.

roamn · 9 January 2026 03:53

I updated agent-smokingwheels with my blocklists from agent-smoke.
Found this site.

I 2026/01/09 10:55:57 REJECTED Kompetenzstelle für nachhaltige Beschaffung - Datenschutz - url in blacklist

Why this is NOT a bug

JSESSIONID in path is a known crawler hazard

This site is doing classic Java servlet session tracking via path parameters:

page.html;jsessionid=ABC123.internet612

That creates:

Infinite URL variants
Same content, different URL every visit
URLDB explosion if not blocked

Blocking these is standard crawler hygiene.

Silent multipliers you are STILL exposed to

These are not large files, and not blocked by crawler.ignore.filetype.

They create millions of unique URLs with identical content.

Tier-1 multipliers (critical)

You caught one of these already, but not all:

Multiplier	Example	Effect
Path session IDs	`;jsessionid=ABC123`	Infinite URL variants
PHP sessions	`?PHPSESSID=xyz`	Same
Generic `sid=`	`?sid=123`	Same
ASP.NET	`?ASPSESSIONID=…`	Same

You must block all of them explicitly.

Tier-2 multipliers (tracking & marketing)

These don’t look dangerous, but they multiply crawl space silently:

utm_*
fbclid
gclid
msclkid
yclid
ref=
source=

A page with 5 tracking variants = 5 URLs, same content.

At scale, this is brutal.

Tier-3 multipliers (CMS / calendar traps)

Very common on European government + NGO sites (like the one you just hit):

?page=1, ?page=2, ?page=99999
?month=, ?year=
?view=print
?lang=de, ?lang=en (sometimes OK, sometimes explosive)

These are why you hit 25M links in 10 days.

High-impact rules you should add NOW

Add these to your URL blacklist (global or crawler).
These are safe, battle-tested, and high-return.

Session ID killers (mandatory)

(?i);jsessionid=
(?i)[?&](jsessionid|phpsessid|aspessionid|sid|sessionid)=

Tracking parameter killers

(?i)[?&](utm_[^=]+|fbclid|gclid|msclkid|yclid|ref|source)=

Calendar / pagination traps

(?i)[?&](page|p|start|offset|year|month|day|calendar)=\d+

Print / duplicate views

(?i)[?&](print|view)=

Download wrapper traps

(?i)/(download|dl|file|getfile)\.php\?

These do not reduce content quality — they reduce duplication.

Why this changes everything (numbers you’ll feel)

Based on your measured history:

agent-samsung-t7: ~86 GB/day
agent-smoke: ~33 GB/day

Blocking silent multipliers typically yields:

30–60% URLDB reduction
25–50% disk growth reduction
dramatically fewer retries
lower SSD write amplification

Well I will clean my index on agent-smoke and try again..

okybaca · 13 January 2026 09:01

Yeah, that’s the reason I suggested this feature, to stripe out some URL parameters in the crawler.

But take a look at “canonical only” option in the advanced crawler: some sites implement the “canonical” meta tag, which says what is the “official” url of the page and you can choose not to crawl the pages with colliding canonical and real url. Check the “No Indexing when Canonical present and Canonical != URL” tick in the Document Filter section.
You can also easily limit dynamic pages (url containing ?).

roamn · 19 January 2026 06:25

I’ve added DNS slow-down handling to my fork here:
https://github.com/smokingwheels/yacy_smoke

The idea is to throttle or auto-blacklist hosts that consistently cause delays, rather than letting them stall the crawler and poison queue throughput.

This is driven by real crawl data. I’m logging hosts that trigger repeated or excessive crawl latency and feeding them into an automatic list.

Current slow-host list (live, growing)
http://gts.undo.it/index.php/s/cbRLcnZ9nTYTdqT

Other related modifications I’ve been testing ( Apparently )

HostBalancer hard-abort logic to prevent queue convoy lockups
Improved crawl-delay enforcement (hosts forcing extreme delays are deprioritised or excluded)
Better handling of pathological dynamic URLs (parameter storms, endless variants)
Reduced DNS retry amplification during large crawls
General crawl stability work focused on long-running high-PPM peers

These changes are aimed at keeping crawl speed high while protecting the peer and the wider network from bad actors, misconfigured hosts, and DNS sinkholes.

Happy to compare notes, share metrics, or upstream anything that proves broadly useful.

Ok I will have a look.

okybaca · 19 January 2026 10:30

If I understand that problem right, is it, that while crawling, some targets do have such delays, that they go to the mode “Waiting 10 seconds for…” and slowing the whole crawling queue?

That also annoys me. But what I thought: it may be the logic of the queue. The crawling delay is somehow done in YaCy, not the host itself. I don’t know exactly how.

It’s probably done somewhere in source/net/yacy/crawler/Balancer.java, source/net/yacy/crawler/HostBalancer.java, or source/net/yacy/crawler/data/Latency.java…

okybaca · 19 January 2026 11:05

Filled an issue for that: Crawler - HostBalancer - Restrictive sites slow the multi-domain crawling · Issue #753 · yacy/yacy_search_server · GitHub