Architecture Decision Records, some YaCy design decisions

thk · 22 January 2025 17:58

I’ve recently learned about architecture decision records (ADR). For a small project they don’t need to be complex or follow a specific template. In fact any format can do as long as they document, why a specific decision has been taken.

Many projects actually use commit messages to document design decisions.

For YaCy it could be helpful to have a folder in the code with small text documents, one for every design decision.

Just today I first discoved the paper Description of the YaCy Distributed Web Search Engine.

There are some designs documented in the paper that I’m very curious about how they were made and whether they are actually still this way today:

The reversed word index (RWI) gets distributed.

Advantages (maybe):

Less burden on crawled sites (negligible today?)
Less processing (parsing) (negligible today?)

Disadvantes (maybe):

requires trust in other peers
uses local storage for data from others
is complex
stores data locally that might never get used

Alternatives (maybe):

Every node only stores data from own crawls
If redundancy is desired, crawls are repeated locally
- This also provides multiple snapshots, in case nodes also serve archives.
Search queries gets sent to (trusted) online peers

Peer Hash Position

If nodes don’t exchange indexes and queries get forwarded to all reachable peers, then there is no need for positioning nodes relative to others. Thus any node could just choose a public-private key pair as identifier.

Query peer selection

Sending a query only to a subset of nodes might save remote nodes a few CPU cycles, but:

A remote node might have recently crawled data not yet shared as hash table
Every interaction is a chance to measure response time, trustworthiness and replaces a ping
It might be that by sending more queries but saving the pre-selection and hash sharing, the overall system might actually save CPU cycles