Auto-blacklisting crawl-delay offenders

Hi all,

I’ve been experimenting with a non-invasive way to reduce crawler stalls caused by hosts that repeatedly enforce crawl-delays, without touching YaCy core logic (HostQueue, CrawlQueues, etc.).

Instead of patching Java, this approach observes YaCy’s own log output and feeds the result back into the existing blacklist mechanism.


Motivation

When YaCy encounters repeated messages like:

HostQueue * forcing crawl-delay of 85 milliseconds for www.example.com

The crawler can end up spending disproportionate time idling on a small set of slow or defensive hosts.
This is not aggressive crawling — it’s the opposite: backing off early and letting other hosts proceed.


Design goals

  • :white_check_mark: No Java changes

  • :white_check_mark: Fully reversible

  • :white_check_mark: Respects robots.txt

  • :white_check_mark: Uses YaCy’s existing blacklist engine

  • :white_check_mark: Can be turned off instantly

  • :white_check_mark: Reputation-safe


How it works

  1. A small shell script scans yacy00.log

  2. Extracts hosts where YaCy already enforced a crawl-delay

  3. If the delay exceeds a threshold (e.g. 50 ms)

  4. Adds the host to a dedicated blacklist file

Example blacklist file:

DATA/LISTS/autoblack.default.black

This keeps manual blacklists and auto-generated rules clearly separated.


Example script (simplified)

#!/bin/bash

YACY_LOG="DATA/LOG/yacy00.log"
BLACKLIST="DATA/LISTS/autoblack.crawldelay.black"
CRAWL_DELAY_THRESHOLD=50

touch "$BLACKLIST"

grep "forcing crawl-delay" "$YACY_LOG" | while read -r line; do
    delay=$(sed -n 's/.*forcing crawl-delay of \([0-9]\+\).*/\1/p' <<< "$line")
    host=$(sed -n 's/.* for \([^: ]\+\).*/\1/p' <<< "$line")

    [[ -z "$delay" || ! "$delay" =~ ^[0-9]+$ ]] && continue
    [[ -z "$host" ]] && continue
    [[ "$host" =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]] && continue

    if (( delay >= CRAWL_DELAY_THRESHOLD )); then
        rule="$host/*"
        if ! grep -Fxq "$rule" "$BLACKLIST"; then
            echo "# auto-blacklisted: crawl-delay ${delay}ms" >> "$BLACKLIST"
            echo "$rule" >> "$BLACKLIST"
        fi
    fi
done


Restart Yacy

Scheduling (recommended)

Run every 5 minutes via cron:

*/5 * * * * flock -n /tmp/autoblack.lock /path/to/autoblack.sh

This prevents overlapping runs.


Why not integrate into HostQueue?

That may be worth discussing later, but keeping this external has advantages:

  • avoids fork maintenance

  • avoids accidental global behavior changes

  • easier to test, tune, or discard

  • keeps YaCy’s reputation logic conservative

This behaves more like adaptive politeness, not aggression.


Feedback welcome

I’m interested in:

  • whether others see similar crawl-delay bottlenecks

  • thoughts on threshold speed

  • ideas for optional expiry (e.g. auto-remove after N days)

  • whether this should stay external or become a configurable feature later on

1 Like

DNS / Crawl Stability Notes (Unbound vs Pi-hole, RAM queue test)

I normally run Unbound directly, but I did some testing with Pi-hole in front of it. In my setup, I’m using a cron job to keep DNS pressure under control.

I have a crontab entry running autoblack.sh, and I restart Unbound every 1 minute.
Surprisingly, this does help under heavy crawl load — fewer DNS stalls and fewer long crawl-delay escalations.

I’m also currently testing a RAM-backed queue directory (tmpfs) for the crawler queues to reduce disk contention.

Current test conditions:

  • ~100k URLs in queue

  • ~2k PPM crawl rate (until you hit robot delay ~10000 ms)

  • DNS stability noticeably improved with frequent Unbound restarts

  • Queue latency drops when running on RAM vs disk

This is still experimental, but early results look promising, especially under sustained high-PPM crawling. I’ll report back once I’ve got longer-run numbers.

If anyone else is stress-testing DNS + crawl pipelines at this scale, I’d be keen to compare notes.

I updated agent-smokingwheels with my blocklists from agent-smoke.
Found this site.

I 2026/01/09 10:55:57 REJECTED Kompetenzstelle für nachhaltige Beschaffung - Datenschutz - url in blacklist

Why this is NOT a bug

:one: JSESSIONID in path is a known crawler hazard

This site is doing classic Java servlet session tracking via path parameters:

page.html;jsessionid=ABC123.internet612

That creates:

  • Infinite URL variants

  • Same content, different URL every visit

  • URLDB explosion if not blocked

Blocking these is standard crawler hygiene.

:two: Silent multipliers you are STILL exposed to

These are not large files, and not blocked by crawler.ignore.filetype.

They create millions of unique URLs with identical content.

:fire: Tier-1 multipliers (critical)

You caught one of these already, but not all:

Multiplier Example Effect
Path session IDs ;jsessionid=ABC123 Infinite URL variants
PHP sessions ?PHPSESSID=xyz Same
Generic sid= ?sid=123 Same
ASP.NET ?ASPSESSIONID=… Same

You must block all of them explicitly.


:fire: Tier-2 multipliers (tracking & marketing)

These don’t look dangerous, but they multiply crawl space silently:

  • utm_*

  • fbclid

  • gclid

  • msclkid

  • yclid

  • ref=

  • source=

A page with 5 tracking variants = 5 URLs, same content.

At scale, this is brutal.


:fire: Tier-3 multipliers (CMS / calendar traps)

Very common on European government + NGO sites (like the one you just hit):

  • ?page=1, ?page=2, ?page=99999

  • ?month=, ?year=

  • ?view=print

  • ?lang=de, ?lang=en (sometimes OK, sometimes explosive)

These are why you hit 25M links in 10 days.


:three: High-impact rules you should add NOW

Add these to your URL blacklist (global or crawler).
These are safe, battle-tested, and high-return.

:white_check_mark: Session ID killers (mandatory)

(?i);jsessionid=
(?i)[?&](jsessionid|phpsessid|aspessionid|sid|sessionid)=

:white_check_mark: Tracking parameter killers

(?i)[?&](utm_[^=]+|fbclid|gclid|msclkid|yclid|ref|source)=

:white_check_mark: Calendar / pagination traps

(?i)[?&](page|p|start|offset|year|month|day|calendar)=\d+

:white_check_mark: Print / duplicate views

(?i)[?&](print|view)=

:white_check_mark: Download wrapper traps

(?i)/(download|dl|file|getfile)\.php\?

These do not reduce content quality — they reduce duplication.


:four: Why this changes everything (numbers you’ll feel)

Based on your measured history:

  • agent-samsung-t7: ~86 GB/day

  • agent-smoke: ~33 GB/day

Blocking silent multipliers typically yields:

  • 30–60% URLDB reduction

  • 25–50% disk growth reduction

  • dramatically fewer retries

  • lower SSD write amplification

Well I will clean my index on agent-smoke and try again..

Yeah, that’s the reason I suggested this feature, to stripe out some URL parameters in the crawler.

But take a look at “canonical only” option in the advanced crawler: some sites implement the “canonical” meta tag, which says what is the “official” url of the page and you can choose not to crawl the pages with colliding canonical and real url. Check the “No Indexing when Canonical present and Canonical != URL” tick in the Document Filter section.
You can also easily limit dynamic pages (url containing ?).