Advice on crawling science and culture related sites

Hi, we’re setting a YaCy server for public access mainly for indexing science/culture sites and repositories. We’re indexing nasa, noaa, nist, eso.org, wikipedia, some user forums with valuable information and discussions, is there any advice on how to crawl these sites? Our server currently has a RAID-1 500GB HDD. We started to a test this hardware for indexing and it already indexed 1.5 million documents and used 45GB of disk space. We are mostly interested in documents, maybe not indexing every .gif or favicon image on sites, i don’t know if this is actually the case. Maybe some advice on optimize this crawls, thanks very much.

Also i have other questions
1- When does a crawl terminate? I launched some crawls on websites and it’s been days and it seems it never ends, is this because the crawl waits for new content to index or simply the site is too big and it has not finished it?

2- When i terminate a crawl and re-launch it later, does it crawls the whole site back again? or resumes where it left?

3- in the crawl monitor i see a lot of images being indexed, how of the indexing storage space does it account for? i see many of them, this is an example of 8 running crawls

This is the crawl log

Thanks!

Hi and welcome!

wikipedia,

wikipedia could be imported from published dumps. worked for me, but not sure, if the feature is not broken right now. See Import/Export section in Administration.

1- When does a crawl terminate? I launched some crawls on websites and it’s been days and it seems it never ends, is this because the crawl waits for new content to index or simply the site is too big and it has not finished it?

When it’s finished (technically: when there are not any more urls in Loader for particular domain). Some large sites have millions of pages (remembering nytimes.com – something like 10M, bbc.co.uk ~700K…).

2- When i terminate a crawl and re-launch it later, does it crawls the whole site back again? or resumes where it left?

If you choose ‘no doubles’ in Advanced Crawler, it should ignore already indexed pages. Not sure, how well does that work. But, generally, every time you start a crawl, it starts from beginning. But the Queue will survive restart, crash, etc.

3- in the crawl monitor i see a lot of images being indexed, how of the indexing storage space does it account for? i see many of them, this is an example of 8 running crawls

There is a setting in the Advanced Crawler for that. Untick “Index Media”

You can also limit the indexing of various parts of the sites using regular expressions (example).

Basically, as pages are crawled, links from them are added into Loader, which is the queue (can be deleted there, based on regular expressions). Pages, which comply with indexing rules, are indexed by Crawler later. The others are only used as sources of links.

Hope that helped

You should have enough space for 200 days.

I find my peers consume ~2.25 GB a day and more if crawling.

Thank you both. Now i’m thinking that i started the crawl with the easy crawl, not advanced. So, i think i should start all over again with these tweaks. Now it goes with 2 million documents and 70GB of storage, at this rate, i’m only going to get a few million documents indexed. Is there anyway to reset/delete the current index and start from zero again?

Well, i think i figured out. I went to Index Management and deleted all indexes. I started all again. I’ll keep you posted!

Best regards.

You can also delete from index selectively:
Index Administration > Index Deletion > Delete by URL Matching

There you can use regular expression (in fact, “Java Pattern”), such as .*\.jpg to delete just jpg files. Simulate Deletion prints out the count of documents, which would be deleted by that pattern.