Sure. The main thing I changed wasn’t CPU, it was I/O discipline.
-
Disk latency matters more than raw horsepower. Once random I/O is under control, PPM rises naturally.
-
I’m not using pure mmap for Solr; small in-RAM caching with controlled merges keeps write amplification down.
-
Crawl-delay is conservative; the gain comes from reducing fan-in (fewer simultaneous hosts / loader threads), not being aggressive per host.
-
Having a crawl list too long also does not do well.
-
Network stability mattered too — fixing router/NAT stalls stopped LoaderDispatcher waits.
-
I also filter early: Quad9 DNS + Pi-hole block a lot of known junk, trackers, and dead domains before they ever hit the crawler.
That’s why the burn rate stays low (~55 GB/day) even when PPM looks high — less churn, fewer pointless writes, more useful pages.
Startup config (Windows – start.smoke.bat)
This is the Java startup block I’m using atm:
:STARTJAVA
set javacmd=%javacmd% ^
-Djava.awt.headless=true ^
-Dfile.encoding=UTF-8 ^
-Djava.io.tmpdir=R:\agent-smoke\tmp ^
-Dsolr.directoryFactory=solr.NRTCachingDirectoryFactory ^
-Dsolr.nrtCachingDirectoryFactory.maxMergeSizeMB=1024 ^
-Dsolr.nrtCachingDirectoryFactory.maxCacheMB=8192 ^
-Dsun.zip.disableMemoryMapping=true ^
-Dsun.nio.fs.disableFastCopy=true ^
-Xms8g ^
-Xmx64g ^
-XX:MaxMetaspaceSize=1024m ^
-XX:+UseG1GC ^
-XX:MaxGCPauseMillis=200 ^
-XX:InitiatingHeapOccupancyPercent=30 ^
-XX:+ParallelRefProcEnabled
This setup prioritises low write amplification and stable merges over raw aggression, which keeps disk burn predictable while sustaining higher PPM.
Auto-blacklisting slow / abusive domains
I’m also using a simple log-driven auto-blacklister that watches crawl-delay offenders and appends them to the crawler blacklist.
The down side is you have to restart yacy and start the crawl again.
It’s based on this thread:
https://community.searchlab.eu/t/auto-blacklisting-crawl-delay-offenders/3361
The idea is to drop domains that repeatedly force long crawl-delays (login portals, infinite catalogs, bot-traps) so they stop consuming loader slots and disk churn. This helps keep PPM high while reducing pointless write amplification.
Combined with DNS filtering (Quad9 + Pi-hole), it removes a lot of junk before it ever reaches Solr.
I’ve put the working setup and related changes here:
https://github.com/smokingwheels/yacy_smoke
I’ve made some changes to yacy purely as an experiment.
Hope that helps..