The Story of YaCy Grid

Orbiter · 15 June 2019 16:30

It took a long time to get here .. but finally, this story had to be told…

In 2015 YaCy had become a well-recognized and already mature search engine software. This platform was intended to be used by private persons but there was also a demand by professional users as well. Designed as a peer-to-peer software, the architecture had some flaws by design:

no stable search: consecutive queries do not produce the same search result, not on one peer and never for different peers. If YaCy would have a stable search result that would contradict to censor-resistance.
incompleteness: we distribute our search index to a set of remote peers and we don’t have any control over the lifetime of the storage of that index. This causes a bad recall.
lacking speed: peer-to-peer search is meta-search - the retrieval process is only as fast as the slowest peer. If we do a time-out, we are lacking information and increasing incompleteness.

All these problems are ok if we insist on having a “freedom index” but for professional use, these problems must be solved. As the demand was there, I was trying to find a good concept for a full re-design!

In late 2015 I met Ilya Kreymer in San Francisco (check out his repository, its amazing!). He was a former employee of the Internet Archive and worked on a free version of the wayback machine: openwayback. He created numerous other warc-based tools and this told me that a re-design of YaCy should not only consider separated components but also should use standards as ‘glue’ between the parts of the software.
This convinced me that we should have a WARC middle-layer in a new YaCy architecture and re-use all these amazing tools. The new YaCy architecture could i.e. have a crawler archive which looks loke the internet archive of all crawled pages.

A Supercomputer Plattform Architecture

So I designed slides to advertise a redesign of YaCy, at this time called “Kaskelix”.

These components would be either constructed from recycled code from YaCy or they would consist of external, standardized software modules. This design contained also optional elements - like “Moderation”, “Add-On Content” which are not obligatory for the whole construction but would leave room for a commercial assignment.

To prove feasibility, I added the following to-do picture:

In January 2016, the OpenWebIndex initiative was started with a conference consisting mostly of members of the suma-ev. The idea was, that the OWI creates a large search index but not a search interface. Users of the OWI would have to create their own interface which would cause that comparison of the OWI with other search indexes could not happen on a user-experience level but only on scientific attributes.

YaCy Grid did still only exist as a concept, but I was sure that my approach was best suited for OWI. Unfortunately, it turned out that no software was ever developed for the OWI, the approach was purely political - at that time. We did not even reach the point where I was able to propose my architecture, which was most disappointing.

Implementation for a Business Partner

At the end of 2016, I actually found a business partner to implement some of the proposed components.
The architecture required an orchestration element which I called (ironically) MCP - “Master Connect (sic!) Program”. It also required a queueing mechanism which provided interfaces and scaling to the other grid elements.
The following modules had been implemented for YaCy Grid:

yacy_grid_mcp - grid orchestration. This software runs not only as daemon once in the grid, but it is also deeply integrated into the crawler, parser, loader and search element as git submodule. The MCP also includes a client to elasticsearch and acts as an indexing client for the grid.
yacy_grid_crawler - crawler, which includes the crawl start servlet, host-oriented crawling balancing and filter logic for the crawl jobs.
yacy_grid_loader - loader and headless browser. As of today, many (maybe mostly) web pages have not any more static content and content is loaded dynamically. As headless browser loading of web content is very complex, this component must be scalable.
yacy_grid_parser - the YaCy Parser as it is implemented in “legacy” YaCy. We have an extremely rich metadata schema in YaCy and YaCy Grid inherits this schema.
yacy_grid_search - the query parser and search back-end API for search front-ends. In the fashion of stateless microservices, this component can be scaled up according to the load on the search front-end.
yacy_webclient_bootstrap - a demonstration search client that looks exactly the same as the in-legacy-YaCy built-in search front-end.

These parts must be combined with the following standard software:

elasticsearch - instead of solr, YaCy Grid now uses elasticsearch as a search index
kibana - dashboards for monitoring
RabbitMQ - Queues for high-performance computing
an FTP server - storage for WARC and flat-index files

Creating a search front-end for YaCy was also part of Google Summer of Code within the FOSSASIA Community - this created the following component:

susper.com - a Google - look-alike Search Interface

All together in one picture:

Going Online in a Cloud-Hosted Kubernetes Cluster

A huge YaCy Grid installation went online at the beginning of 2018: Our partner who is running YaCy Grid for http://land.nrw uses a Kubernetes Cloud hosting for YaCy Grid Docker containers:

This provides a search index for the public administration documents and web pages of all communities (cities, villages, more than 1000) in the state of NRW/Germany.

We can monitor crawling behavior with kibana:

The load status of crawl queues and the queues of other grid components can be monitored with the dashboard of RabbitMQ:

YaCy Grid: A Scalable Search Appliance

As YaCy does not only provide a rich, opensearch-based search API but also an implementation of the Google Search Appliance XML API. That means, YaCy Grid may be a drop-in replacement of existing GSA user. As Google abandoned the GSA, users should switch to YaCy Grid.
With YaCy Grid we achieved finally:

index stability - all search results for the same query are the same
completeness - we can find everything that was crawled
speed - this construction provides unlimited scaling: for crawling, indexing and for search.

The story is now still going on:

more monitoring features and operational support (like re-crawling of failed loadings) is currently being developed for YaCy Grid
we should develop a concept to integrate (or join) YaCy Grid with (the old) “legacy YaCy”.
The OpenWebIndex initiative has new people on board and we are currently trying to integrate one part (yacy_grid_parser) of YaCy Grid into the OWI architecture.
we need documentation. Creating a platform for documentation is now required…

To be continued…

If you like this story then you are invited to share your ideas in the comments! It is a huge challenge to join old and new YaCy components towards a better platform: would you like to contribute? What can be done by the community? Are you a professional user of the old YaCy software and would you like to switch to YaCy Grid?

Orbiter · 15 June 2019 16:43

Please share: https://www.reddit.com/r/YaCy/comments/c0z9sm/the_story_of_yacy_grid_development_of_a_large/

https://twitter.com/0rb1t3r/status/1141005180034519043

zooom · 17 June 2019 07:55

Yes, I am a professional user of the “old” yacy software and would like to switch to the new yacy grid.

What about data migration? My personal experience with yacy is, that there is 2 important parts to take care of: 1. Starturls/Domains and 2. Blacklists (“the opposite”). Both means a lot of work and manual interaction, so I am working on a management system for these topics.

Another interesting topic is automatic content classification (e.G. part of speech tagging). This implies the knowledge and availability of reference data like taxonomies and structured lists of named entities.

This is the second topic I am working on and I would be happy to contribute to the new yacy grid.

What is the best starting point? Can I setup a sandbox? Is the grid compatible to freeworld?

Best regards

Markus

Orbiter · 17 June 2019 20:51

Hi Markus, a lot of good questions…

Yes! There are several things to consider:

index migration: Legacy YaCy runs on Solr and the RWI index, YaCy Grid runs on elasticsearch. You can export the Solr index in the “Index Export” function in Legacy YaCy into elasticsearch bulk format. Have a look at the option, it explains the full process
index startup migration: The crawler in YaCy Grid can be called with a curl command which accepts the same crawl start attributes as you see in the process scheduler. So instead of migration of the search index you can do a re-start indexing using the Crawl Start URLs. However you must rewrite a part of the start URL to the new path of the YaCy Grid Crawler component. I will explain details later.

As explained above! Second Option

I dropped the blacklists as their syntax was terrible and confusing all the times. To have blacklists in YaCy Grid, you must translate them into a (maybe very long) must-not-match list, to be used as parameter in the crawl start.
There is no automatism for this, it’s just a concept.

I also dropped content classification for now. I left this out as a possible commercial or used-constructed function which can be applied on the parser result. The parser creates flat json files, the same as you get with the legacy YaCy export-to-elasticsearch files. What you must do is: parse the json, match content against you vocabulary and write the classifcation back to the index dump file.

Yes, there is no general solution. Every user of a possible content classification brings it’s own files. We must see where we can go here with a community contribution. In fact this would be a great project for community work,

Right now the only documentation is the readme at the project repositories. Start with yacy_grid_mcp/README.md at development · yacy/yacy_grid_mcp · GitHub and then read the README of the other grid components.
I also want to write a nice manual, but only after we migrated the home page of yacy.net into a new CMS with the ability to write longer documentation texts.
Or maybe earlier as enhancement of the READMEs

Only in some special way:

the code of the MCP and the parser is largely taken from legacy YaCy. The crawler and loader is mainly rewritten. So there is a compatibility on code basis and this will be used to merge YaCy Grid code back into legacy YaCy
the Loader in YaCy Grid produces WARC files and legacy YaCy can import these with the surrogate framework (just put them into the surrogate/in/ path)
legacy YaCy can export into YaCy Grid dump index file format
the search API is identical. Completely the same. A search front-end on legacy YaCy fits also on YaCy Grid and vice versa. This applies also for the GSA (Google Search Applicance) XML api which is also a search API in YaCy.

Future functions which may enhance compatibility:

I am planning that a crawl start in legacy YaCy can be send to YaCy Grid as crawl start, but that is not implemented
I am planning a “connect to YaCy Grid” function in legacy YaCy to use a YaCy Grid search API as metasearch element.
It may be also possible that the Crawler in legacy YaCy could create WARC files and puts it into the YaCy Grid indexing queue.

So lets see where this leads. YaCy Grid is by definition NOT a p2p network, it is ‘just’ a massively scaling search engine tool set. Maybe we have enough experience here to make a composition out of that which forms a collection of such Grid implementations and maybe share data, like:

a common repository of crawled WARC files
a common repository of parsed WARC files
a common list of indexes

Please share your thoughts about that.

Orbiter · 19 June 2019 13:26

a Twitter user asked about YaCy Grid:

https://twitter.com/agnelvishal/status/1141321183817658368

So the answer

https://twitter.com/yacy_search/status/1141323609153183744

points to required action items here. The is not yet a fixed answer on how to connect Legacy YaCy with YaCy Grid. But even if I don’t find time to create something here, everyone could connect the systems using their open APIs. But it would be good to have a concept.

The first thing we could easily do is having a kind of registry of YaCy Grid installations where YaCy Grid users can subscribe automatically - but voluntary! So we could have a meta-search over the Grid Installations
Another things could be, that users can join an existing YaCy Grid with their own Grid Loader & Grid Parser. That also would require the registry mentioned above.

Challenge:

One concept of YaCy Grid is to use simple standard software to implement parts of the platform. So what kind of ready-made URL registry would you suggest?

agnelvishal · 20 June 2019 09:56

Do we need something like https://aws.amazon.com/pub-sub-messaging/

Orbiter · 20 June 2019 12:01

Yes, for that the RabbitMQ is part of the architecture.

ForestFriend · 26 June 2019 18:25

First, thanks for all the important work on YaCy and congrats on the redesign.

I do think that the name “YaCy Grid” is a little confusing. The name implies something that is more P2P than original YaCy, not less, perhaps like a connected “grid” of instances with chosen P2P connection policies. It is more like “YaCy Platform” or “YaCy Engine” than a “grid”.

Orbiter · 26 June 2019 19:27

YaCy Grid refers to the actual definition of “Grid Computing” as you can find in Wikipedia: Grid computing - Wikipedia

Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application.

So YaCy Grid is exactly that: it’s distributed (not necessarily but potentially decentralized) but the parts are not interacting (it’s using a hub - a broker) and cluster components have different tasks (the grid components).

It’s not about more or less p2p, it’s just another computational approach.

Tom_Booth · 30 July 2019 18:08

I’m a little confused in regard to the overall intent and purpose of YaCy Grid, (as opposed to legacy YaCy.)

My general impression being that Legacy YaCy is geared toward “personal use” whereas YaCy Grid is intended for “professional” use. though,:

What I’m interested in is YaCy as a peer-to-peer search engine for the public internet. period.

So, in that regard, is there any possibility for YaCy Grid becoming a search engine for use on the World Wide Web, by the general public, or is it intended for corporate or organizational use, such as on an intranet.

Legacy YaCy, I know, can be used in both ways, I don’t however see much, if any emphasis, or any discusson at all really, for using YaCy Grid across the public internet.

Orbiter · 30 July 2019 21:17

Hi Tom,

…is described by my first posting following of:

So:

Yes, for the “Legacy YaCy” user it’s a long way:

[*] extract the “cherries” from legacy YaCy to become parts of YaCy Grid (done!)

[*] YaCy Grid to become a working infrastructure for a large-scale search engine (first customer!)

shakedown - YaCy Grid shall become stable and a complete concept

re-integrate / replace YaCy Grid components into Legacy YaCy

one code for two applications: p2p and large-scale scientific/professional use

We will go that way, but if you consider to use YaCy Grid instead of Legacy YaCy because of the known flaws, you can do this right now!

Thats the reason I come up with this topic: to get the preliminary results of YaCy Grid into the hands of the users!

Tom_Booth · 31 July 2019 04:27

Above it reads:

Then:

Finally:

I think I understand.

I’m interested in:

I’m not entirely sure what exactly you mean by “URL registry”. What are some examples of “ready made URL registries”?

Tom_Booth · 31 July 2019 05:46

How?

I’d also like to try this, though I don’t know when I’ll have the time.

I’m still trying to get up to speed on Legacy YaCy. So far I’m finding it quite fantastic.

Ryu945 · 5 September 2020 00:03

I was looking into setting up Yacy for the first time in a Docker container. I saw this thread and wondered if the software was about to change from a legacy version to a more modern version. Should I wait until this has occurred? I looked at the Docker container for Yacy revamp and see it is 2 years old. It makes me wonder if the overhaul is a dead project and only legacy Yacy is being worked on now.

Some details on what is going on would be nice.

Orbiter · 5 September 2020 07:22

YaCy Grid is a complete different approach than legacy YaCy and I am actively maintaining this for a customer. Last fix was in July in the parser component https://github.com/yacy/yacy_grid_parser/commits/master

Connections between Legacy YaCy and YaCy grid may occur first in a common file storage solution, maybe in the context of yesterdays posting about S3/minio

TheHolm · 10 September 2020 09:54

Is YaCy grid is ready for “production” in current state? Is there any document describing general architecture of YaCy Grid and what components are missing? All I found is Readme on yacy_grid_mcp github.

Orbiter · 10 September 2020 17:34

Yes, this is production state.
If you are looking for a document describing the architecture and what is possibly missing - read the first thread of this posting.
The readme should be sufficient to get this running.

isle · 12 November 2020 22:34

If you are able to incorporate some data visibility into YaCy grid, that is both helpful and motivating. For example, the “leaderboard” on the YaCy stats site encourages you to keep crawling. It also provides an indication about which sites (like Wikipedia) are already being heavily crawled.

Some examples of graphics could be:

Size in documents of site to be crawled
% of site already crawled
Staleness of information
Popularity of site with others crawling
etc

The main thing I think is that improvements in engineering for Grid YaCy are put into the peer-to-peer YaCy.

At the moment, the world is at the mercy of proprietary search engines, so that is my main concern.

Any update on when we can easily try YaCy Grid?

Orbiter · 29 November 2020 11:58

Yes all of that makes sense … its just too much work at once. We must break this into pieces and we need volunteers who want to contribute to it.

Pieces may be

find a concept on how to integrated YaCy Grid with legacy YaCy and its environemt (like new yacystats)
find an easy way to deploy all YaCy Grid parts
and write a documentation for the deployment
extend YaCy Grid with a dashboard/leaderboard as you suggested

Furthermore I consider renovations in YaCy Grid:

maybe replace RabbitMQ with Apache Kafka
add S3 storage as crawl result space including set-up of minio as self-hosted bucket store

We need more developers who want to join in to start this work!

Orbiter · 31 October 2021 23:14

To story continues with the Searchlab: The Searchlab Project