While crawling news sites you probably find they are full of rubish, which will clutter your index.
You wanna index pages containing information, not the text of links to ‘other intersting articles’.
Usual news sites contain plenty of links to other articles. Indices of authors categories, tags… all of these will top your search results, instead of articles itselves.
You can refine your crawl settings to get just the articles, not the other pages.
The best way is to use regular expressions in Advanced Crawler.
Here is an example of setting for a wordpress sites, I most commonly use.
As a Start point set a top level page:
eg. https://www.example.org
.
You could use a sitemap.xml as well, but XML processing is currently still broken in YaCy. (BUG)
In Crawler Filter I set Crawling Depth of 0, and “Unlimited crawl depth for URLs matching with " https://www.example.org/.*
”
There are two more regexp filter.
First one filters the pages, robot will load (means just open and extract the links).
In “Load Filter on URLs” I set up “Use filter” “https://www.example.org/.*
” and “must-not-match” “https://www.example.org/wp-json/.*
”.
That’s because I don’t want to index all the API results from WordPress.
Other pages will be crawled.
Second filter is for actual indexing, meaning which pages will be included in search index of yours.
Here I try to avoid all the pages with links, other than articles.
Usual line is something like:
https://www.example.org/wp-json/.*|https://www.example.org/tag/.*|https://www.example.org/category/.*|https://www.example.org/author/.*|.*pagedTop.*|https://www.example.org/.*/page/.*
Mind that these regexps are Java Patterns, instead of regular regular expressions.
Character “|
” is a logical operator “or”, which allows to set multiple regular expressions in one field.
“.*
” means any characters.
There is also some “Content Filter”, which should allow to index only a portion of page with article text itself. I wasn’t able to get it working.
Did you?
In Indexing section at the bottom of Advanced Crawler I check “Index text” and deselect “Index media”, since I’m not interested in indexing images.
Then click “Start New Crawl Job” and watch the crawling process. If there are still some unwanted pages, I terminate the job, copy the new one in Crawler Scheduler, refine the settings and run it again.
Do you use different indexing strategies?