I’ve recently learned about architecture decision records (ADR). For a small project they don’t need to be complex or follow a specific template. In fact any format can do as long as they document, why a specific decision has been taken.
Many projects actually use commit messages to document design decisions.
For YaCy it could be helpful to have a folder in the code with small text documents, one for every design decision.
Just today I first discoved the paper Description of the YaCy Distributed Web Search Engine.
There are some designs documented in the paper that I’m very curious about how they were made and whether they are actually still this way today:
The reversed word index (RWI) gets distributed.
Advantages (maybe):
- Less burden on crawled sites (negligible today?)
- Less processing (parsing) (negligible today?)
Disadvantes (maybe):
- requires trust in other peers
- uses local storage for data from others
- is complex
- stores data locally that might never get used
Alternatives (maybe):
- Every node only stores data from own crawls
- If redundancy is desired, crawls are repeated locally
- This also provides multiple snapshots, in case nodes also serve archives.
- Search queries gets sent to (trusted) online peers
Peer Hash Position
If nodes don’t exchange indexes and queries get forwarded to all reachable peers, then there is no need for positioning nodes relative to others. Thus any node could just choose a public-private key pair as identifier.
Query peer selection
Sending a query only to a subset of nodes might save remote nodes a few CPU cycles, but:
- A remote node might have recently crawled data not yet shared as hash table
- Every interaction is a chance to measure response time, trustworthiness and replaces a ping
- It might be that by sending more queries but saving the pre-selection and hash sharing, the overall system might actually save CPU cycles