Hardware for decent size node

Sure. The main thing I changed wasn’t CPU, it was I/O discipline.

  • Disk latency matters more than raw horsepower. Once random I/O is under control, PPM rises naturally.

  • I’m not using pure mmap for Solr; small in-RAM caching with controlled merges keeps write amplification down.

  • Crawl-delay is conservative; the gain comes from reducing fan-in (fewer simultaneous hosts / loader threads), not being aggressive per host.

  • Having a crawl list too long also does not do well.

  • Network stability mattered too — fixing router/NAT stalls stopped LoaderDispatcher waits.

  • I also filter early: Quad9 DNS + Pi-hole block a lot of known junk, trackers, and dead domains before they ever hit the crawler.

That’s why the burn rate stays low (~55 GB/day) even when PPM looks high — less churn, fewer pointless writes, more useful pages.

Startup config (Windows – start.smoke.bat)

This is the Java startup block I’m using atm:

:STARTJAVA
set javacmd=%javacmd% ^
 -Djava.awt.headless=true ^
 -Dfile.encoding=UTF-8 ^
 -Djava.io.tmpdir=R:\agent-smoke\tmp ^
 -Dsolr.directoryFactory=solr.NRTCachingDirectoryFactory ^
 -Dsolr.nrtCachingDirectoryFactory.maxMergeSizeMB=1024 ^
 -Dsolr.nrtCachingDirectoryFactory.maxCacheMB=8192 ^
 -Dsun.zip.disableMemoryMapping=true ^
 -Dsun.nio.fs.disableFastCopy=true ^
 -Xms8g ^
 -Xmx64g ^
 -XX:MaxMetaspaceSize=1024m ^
 -XX:+UseG1GC ^
 -XX:MaxGCPauseMillis=200 ^
 -XX:InitiatingHeapOccupancyPercent=30 ^
 -XX:+ParallelRefProcEnabled

This setup prioritises low write amplification and stable merges over raw aggression, which keeps disk burn predictable while sustaining higher PPM.


Auto-blacklisting slow / abusive domains

I’m also using a simple log-driven auto-blacklister that watches crawl-delay offenders and appends them to the crawler blacklist.
The down side is you have to restart yacy and start the crawl again.

It’s based on this thread:

https://community.searchlab.eu/t/auto-blacklisting-crawl-delay-offenders/3361

The idea is to drop domains that repeatedly force long crawl-delays (login portals, infinite catalogs, bot-traps) so they stop consuming loader slots and disk churn. This helps keep PPM high while reducing pointless write amplification.

Combined with DNS filtering (Quad9 + Pi-hole), it removes a lot of junk before it ever reaches Solr.

I’ve put the working setup and related changes here:

https://github.com/smokingwheels/yacy_smoke

I’ve made some changes to yacy purely as an experiment.

Hope that helps..

2 Likes