Very Large YaCy Folder

isle · 22 February 2020 17:35

After a new installation of YaCy, we ran a few crawls on some publications. This rapidly resulted in a very large folder (over 30GB).

YaCy is engineered so that large storage isn’t usually needed. Why would such a large folder result from about 20 crawls?

TW1920 · 23 February 2020 15:52

How big is your index? So how many sites are indexed? Do you cache the sites?

isle · 23 February 2020 22:01

The index was about 32GB. I was using the default settings and tried running some crawls on quite large sites. I think I tried about 10 or more sites.

I wouldn’t mind caching the results, if they could be made of use to the rest of YaCy users. I could see some improvement in the search results I had after those few crawls, which were still in progress when allocated space became full.

By the way, if I wish to reinstall YaCy to a larger drive, is it an easy process to copy paste a folder into the new installation to avoid having to go through the crawling again?

isle · 4 July 2020 14:57

TW1920 - thank you for asking.

I don’t know if we are caching the websites, but if that is the default setting, then we probably are. What are the benefits of caching the crawled websites? If we turned off caching, would the cache be automatically cleared? Could the cache benefit others, if we are going to delete it?

This is crawling about 12 sites to a depth of 4.

Looking at the YaCy folder there are a few gigantic files:

\YaCy\DATA\INDEX\freeworld\SEGMENTS\solr_6_6\collection1\data\index
\YaCy\DATA\INDEX\freeworld\SEGMENTS\default\text.index.20200703184026253.blob
\YaCy\DATA\INDEX\freeworld\SEGMENTS\default\text.index.20200704095302555.blob
And there is a large folder with lots of files:
\YaCy\DATA\HTCACHE\file.array

It is all over 300GB now, after just a couple of days of crawling.
This sort of storage usage might discourage new users. Perhaps there ought to be a different default setting to not cache, if the cache is the problem.

Tom_Booth · 30 September 2020 07:56

“quite large sites” a crawl depth of 4 on a large site with many links could be a problem.

Personally I limit depth to 0 or 1 for large sites. Sometimes a little higher for very small sites with few links, but if the indexing takes very long I’ll halt it.

But I try to index for quality rather than quantity, and mostly like to know what I’m indexing.

If just one initial site crawled has say 1000 links connected to additional sites with lots of links, at a depth of 4 could result in billions of sites indexed.

Indexing 20 large sites at a depth of 4 could potentially result in indexing nearly the entire internet.

isle · 30 September 2020 09:18

This is a great explanation. Near the setting, some kind of graphic to illustrate the consequences of increasing crawl depth would be helpful.

https://www.bridgespan.org/getmedia/23b4be8f-8aca-4f85-a325-2f366957d179/exponential-organizations-linear-vs-sxponential-graph1-600x450.aspx