Smaller Indexes

Hi, thanks for Yacy.

I tried to use Yacy a number of times. One thing I noticed is that pages, indexes take a good chunk of space… Is it possible to generate, mantain, use and exchange always compressed indexes?

When adding domains to crawl, you can try:
Advanced Crawler > Expert Crawl Start > Document Cache
and uncheck “Store to Web Cache”

Maybe also filter image files out since they could be much larger than just html documents.

  1. Compressed indexes would be a good thing as index exchange and maintenance would be more efficient… As to the Web Cache, why not compress as well? Nowadays computers have compression aids in hardware, etc… it is even possible to DML a database in java in compressed mode… (java RamdomAccessMode in zip and 7zip…).
  2. Can Yacy index images like Google? If so Yacy could also have an option to reduce lossly the size of the image.
  3. The use of the Web Cache could be an extra good thing as a way to have a WayBackMachine of sorts; save and make available a page or website in offline mode, versioned by time (does it do that already?)… important nowadays as web history tends to vanish as they try to rewrite the past. In that case the Web Cache in the form of a content differential (by version), could be exchanged compressed as well…

Just some suggestions, thanks.

Just want to add that if you want to save size by skipping images entirely, in Start Expert Crawl, go to the section Index Attributes and uncheck “Index Media”.

I agree very much with #3. It is one of YaCy’s greatest strengths in this age of censorship. One can also browse the web even if an internet connection is not immediately available.

Hi, about #3: I don’t think Yacy has anything like the WayBackMachine functionality… So that WOULD be a good new one. Again the compression feature no one anwers to would be very important…
About compression on indexes: I found out Lucene/SolR have some index compression options, which I guess it’s upstream of Yacy, but can be used, not sure how. Media such as pictures are an important part that should be indexed and searched… there’s lossy compression methods that can reduce a lot of space…
One can always use a ntfs compessed folder or squashfs, zfs, etc to compress the data folder, but I was wandering if a more native to Yacy solution is possible.

I don’t believe that real time compression is a feature yet. You can try opening an issue on YaCy’s git to request it.

@#3 WayBackMachine functionality might be implementable, but it would only be “usable” if 1) every Yacy peer remains online indefinitely/uninterrupted and 2) all Yacy peers contain, in aggregate, the index of the entire WWW. Chances of both 1 & 2 happening anytime soon are low.

The analogy to Wayback was made in unt’s post

The use of the Web Cache could be an extra good thing as a way to have a WayBackMachine

Locally cached pages are similar to archive.org’s archive in that they are snapshots of a site that you can view without having to connect to the site. If you setup your crawler to cache all pages and reject DHT from peers, you essentially have your own web archive.

I’ve been using yacy on and off for a while and optimizing the index is probably an area where we could get a decent bit of improvement. A lot of the extra compute cost from compressing the index at rest can be alleviated by using bloom filters in the indexing. Bloom filters can be near optimally dense and could allow you to only decompress the needed portions of the index (with an adjustable false-positive chance) additionally by using something like NGD (normalized google distance) on the yacy indices you could use a semantic hashing algorithm for your bloom filter, which would mean even false positives are likely to be semantically relevant.

1 Like