How do I crawl Word Press sites?

okybaca · 9 October 2023 10:10

While crawling news sites you probably find they are full of rubish, which will clutter your index.

You wanna index pages containing information, not the text of links to ‘other intersting articles’.

Usual news sites contain plenty of links to other articles. Indices of authors categories, tags… all of these will top your search results, instead of articles itselves.

You can refine your crawl settings to get just the articles, not the other pages.

The best way is to use regular expressions in Advanced Crawler.

Here is an example of setting for a wordpress sites, I most commonly use.

As a Start point set a top level page:
eg. https://www.example.org.

You could use a sitemap.xml as well, but XML processing is currently still broken in YaCy. (BUG)

In Crawler Filter I set Crawling Depth of 0, and “Unlimited crawl depth for URLs matching with " https://www.example.org/.*”

There are two more regexp filter.

First one filters the pages, robot will load (means just open and extract the links).
In “Load Filter on URLs” I set up “Use filter” “https://www.example.org/.*” and “must-not-match” “https://www.example.org/wp-json/.*”.
That’s because I don’t want to index all the API results from WordPress.
Other pages will be crawled.

Second filter is for actual indexing, meaning which pages will be included in search index of yours.
Here I try to avoid all the pages with links, other than articles.
Usual line is something like:

Mind that these regexps are Java Patterns, instead of regular regular expressions.

Character “|” is a logical operator “or”, which allows to set multiple regular expressions in one field.

“.*” means any characters.

There is also some “Content Filter”, which should allow to index only a portion of page with article text itself. I wasn’t able to get it working.
Did you?

In Indexing section at the bottom of Advanced Crawler I check “Index text” and deselect “Index media”, since I’m not interested in indexing images.

Then click “Start New Crawl Job” and watch the crawling process. If there are still some unwanted pages, I terminate the job, copy the new one in Crawler Scheduler, refine the settings and run it again.

Do you use different indexing strategies?

Sviatoslav · 15 October 2023 10:57

Yes, I use such strategies.
The difficulty lies in applying these parameters when indexing is performed on a list of sites. Because each site requires its own set of patterns. How can I set them separately for each site in the list file?
I have asked the author to clarify in detail the acceptable list file format, but he is unwilling to answer this question.

[Note.]
Apparently I will have to leave the forum one way or another because it has become very difficult to log into. Every time the message “your browser is unsupported” appears. If the forum admins think that I will replace my soft at their whim, then they are very mistaken.
If they don’t want as many people as possible to be able to come in, then that’s their problem.

okybaca · 16 October 2023 08:58

ha ha, i suppose, the forum is some 3rd party software, and we’re glad it’s working. it wasn’t, for a few days, few weeks ago.

okybaca · 16 October 2023 09:12

Not sure with that either. There is something called ‘crawl profiles’. They can be edited in CrawlProfileEditor_p.html (or, more exactly: duplicated, modified and deleted, rather than edited) or Process Scheduler. But not sure, how exactly it works, and I suspect that it’s more a nickname for ‘crawl jobs’.

To be more constructive: what would be the desired behavior of profiles?

I suppose, the main developer is putting more effort into new projects now, which is understandable after so many years developing this amazing piece software. But we all miss support, github maintenance and active development. On the other side, it’s an open source, so it could be in the hands of community. Us.

Sviatoslav · 16 October 2023 12:40

Yes. But for this is need:
a) Good knowledge of the construction and principles of operation of search engines
b) Good knowledge of Java programming
I don’t think there are many here who know both of these things at the same time at the proper level.
Personally, I don’t know this.

I think that the most logical and simplest for the user would be the ability to insert into the list of sites not necessarily a standard URL, but also an indexing task line, which can be directly copied from the Scheduler table along with all its parameters.
Perhaps such an opportunity already exists. I don’t know.

As I remember, the forum is maintained by Discourse. But it is the owner’s choice - on which platform to place the forum, and also to demand from webmasters that it work the way the owner needs.

okybaca · 17 October 2023 09:30

Yes, the same problem for me. Java too unfamiliar for me. I started my own branch, where I try to adapt at least the things I’m able to do. Some of them were merged in the master branch of yacy, some not, those I keep in my fork, which I use personally.

Maybe one such option is to start the crawl using the API. You can use curl or bin/apicall.sh to start the crawl from the command line, hence from the script.