Advice on crawling science and culture related sites

Hi and welcome!

wikipedia,

wikipedia could be imported from published dumps. worked for me, but not sure, if the feature is not broken right now. See Import/Export section in Administration.

1- When does a crawl terminate? I launched some crawls on websites and it’s been days and it seems it never ends, is this because the crawl waits for new content to index or simply the site is too big and it has not finished it?

When it’s finished (technically: when there are not any more urls in Loader for particular domain). Some large sites have millions of pages (remembering nytimes.com – something like 10M, bbc.co.uk ~700K…).

2- When i terminate a crawl and re-launch it later, does it crawls the whole site back again? or resumes where it left?

If you choose ‘no doubles’ in Advanced Crawler, it should ignore already indexed pages. Not sure, how well does that work. But, generally, every time you start a crawl, it starts from beginning. But the Queue will survive restart, crash, etc.

3- in the crawl monitor i see a lot of images being indexed, how of the indexing storage space does it account for? i see many of them, this is an example of 8 running crawls

There is a setting in the Advanced Crawler for that. Untick “Index Media”

You can also limit the indexing of various parts of the sites using regular expressions (example).

Basically, as pages are crawled, links from them are added into Loader, which is the queue (can be deleted there, based on regular expressions). Pages, which comply with indexing rules, are indexed by Crawler later. The others are only used as sources of links.

Hope that helped

1 Like