Advice on crawling science and culture related sites

sroland81 · 5 March 2025 16:29

Hi, we’re setting a YaCy server for public access mainly for indexing science/culture sites and repositories. We’re indexing nasa, noaa, nist, eso.org, wikipedia, some user forums with valuable information and discussions, is there any advice on how to crawl these sites? Our server currently has a RAID-1 500GB HDD. We started to a test this hardware for indexing and it already indexed 1.5 million documents and used 45GB of disk space. We are mostly interested in documents, maybe not indexing every .gif or favicon image on sites, i don’t know if this is actually the case. Maybe some advice on optimize this crawls, thanks very much.

sroland81 · 6 March 2025 15:36

Also i have other questions
1- When does a crawl terminate? I launched some crawls on websites and it’s been days and it seems it never ends, is this because the crawl waits for new content to index or simply the site is too big and it has not finished it?

2- When i terminate a crawl and re-launch it later, does it crawls the whole site back again? or resumes where it left?

3- in the crawl monitor i see a lot of images being indexed, how of the indexing storage space does it account for? i see many of them, this is an example of 8 running crawls

This is the crawl log

Thanks!

okybaca · 6 March 2025 23:52

Hi and welcome!

wikipedia,

wikipedia could be imported from published dumps. worked for me, but not sure, if the feature is not broken right now. See Import/Export section in Administration.

1- When does a crawl terminate? I launched some crawls on websites and it’s been days and it seems it never ends, is this because the crawl waits for new content to index or simply the site is too big and it has not finished it?

When it’s finished (technically: when there are not any more urls in Loader for particular domain). Some large sites have millions of pages (remembering nytimes.com – something like 10M, bbc.co.uk ~700K…).

2- When i terminate a crawl and re-launch it later, does it crawls the whole site back again? or resumes where it left?

If you choose ‘no doubles’ in Advanced Crawler, it should ignore already indexed pages. Not sure, how well does that work. But, generally, every time you start a crawl, it starts from beginning. But the Queue will survive restart, crash, etc.

3- in the crawl monitor i see a lot of images being indexed, how of the indexing storage space does it account for? i see many of them, this is an example of 8 running crawls

There is a setting in the Advanced Crawler for that. Untick “Index Media”

You can also limit the indexing of various parts of the sites using regular expressions (example).

Basically, as pages are crawled, links from them are added into Loader, which is the queue (can be deleted there, based on regular expressions). Pages, which comply with indexing rules, are indexed by Crawler later. The others are only used as sources of links.

Hope that helped

roamn · 7 March 2025 02:33

You should have enough space for 200 days.

I find my peers consume ~2.25 GB a day and more if crawling.

sroland81 · 7 March 2025 13:34

Thank you both. Now i’m thinking that i started the crawl with the easy crawl, not advanced. So, i think i should start all over again with these tweaks. Now it goes with 2 million documents and 70GB of storage, at this rate, i’m only going to get a few million documents indexed. Is there anyway to reset/delete the current index and start from zero again?

sroland81 · 7 March 2025 14:50

Well, i think i figured out. I went to Index Management and deleted all indexes. I started all again. I’ll keep you posted!

Best regards.

okybaca · 13 March 2025 00:17

You can also delete from index selectively:
Index Administration > Index Deletion > Delete by URL Matching

There you can use regular expression (in fact, “Java Pattern”), such as .*\.jpg to delete just jpg files. Simulate Deletion prints out the count of documents, which would be deleted by that pattern.

sroland81 · 19 March 2025 13:29

I wonder if i can use that for deleting unwanted domains that get crawled by its ramifications from other sites, like x.com, facebook, instagram… etc. Also how do i get updates from sites i already crawled? i mean, not RSS syndication, but say there is a repository site that ofter gets new things, is crawling it back the only way to get the updates?

okybaca · 21 March 2025 09:07

I, personally limit that already when crawling, e.g. while indexing arstechnica.com, I use https://arstechnica.com/.* in both Load Filter on URLs and Filter on URLs in the Advanced Crawler.

These Filters are very usable also in limiting unwanted pages, for example, I always deny .*/wp-json/.* while crawling WordPress sites.

sroland81 · 11 June 2025 12:14

Well, i updated the server hardware from a dual to a quad core CPU and from 3GB to 16GB of RAM and now the server runs much faster and responsive. The storage remains the same, 500GB RAID-1. The indexing is much faster, but i see the links are finishing and removed from the crawler. Question, they get updated sometime? in case some content gets added?

Best regards.

Orbiter · 22 July 2025 13:07

The idea to crawl specific scientific pages is very good and I am thinking about some kind of YaCy Index Blocks that could be curated from specific sites that everyone can import into their own index. Right now it looks like good data is collected in other ways than simply put them to the web, like data dumps that are published in huggingface and/or kaggle. Those many different ways to publish good data would mean that we need more importers to collect such kind of data blocks. The result of a collection would be an index dump. We could start to maintain collections of such index dumps.