The Searchlab Project

Orbiter · 29 October 2021 13:37

For quite some time I am working on a concept for a portal using YaCy Grid as Crawler/Indexing engine. It is about two years ago that I tried to find a sponsor for the portal. Today not only the project has started, it also has reached it’s first milestone!

The searchlab portal is actually live right now - but I do not share the temporary link until the new searchlab portal can be - maybe you guess it - here, at the same place where the forums are. This means we must migrate this forum to a subdomain of searchlab.eu - I will keep you updated.

The YaCy Searchlab project is kindly sponsored by NLnet and YaCy Patreon patrons. I would kindly ask you here to become a patron as well to support my work on YaCy and the searchlab.

The six milestones of the project explain pretty well where the project is going and what you can expect:

A lot more details is contained in the README of the searchlab repository at github.

If you actually want to see the portal yourself, you can easily do so by using docker:

docker run -d --rm -p 8400:8400 --name searchlab yacy/searchlab

.. end then open http://localhost:8400 in your browser.

If you have any ideas suggestions or questions, please let me know!

Orbiter · 24 January 2022 08:57

The Milestone M2 is now implemented! Unfortunately this is not visible right now in public, because that requires to move the searchlab portal to searchlab.eu - which is currently here, the location of the forum. I will therefore move the searchlab forum to community.searchlab.eu.

Orbiter · 2 February 2022 08:47

The searchlab is now publicly available - where is forum was: https://searchlab.eu

This is now a going-public with a “small-bang” (as opposite of a “big bang”) with a small set of functions as described in M2. Most notably is the search function and the index which is provided by a small set of test crawls, you can test it at Search - Searchlab

That search function is only the first in a series of many, because the next milestone M3 will provide search apps. Those apps will be hosted in a separate reposity: GitHub - yacy/searchlab_apps: Search Apps for the Searchlab

okybaca · 3 February 2022 13:52

Great, and congrats!
Once again, what is the planned future of “legacy” YaCy? Will it be actively developed?
While still lot of folks use old YaCy and crawl the web, is it still worth of (human & computer) time investment, or is it slowly dying project?
Will the grid version include RWI and P2P functions and be somehow backward compatible with legacy one?
thanx & good luck

Orbiter · 3 February 2022 17:19

“legacy YaCy” is an important project, is ongoing and will go on.

In YaCy Grid there will be no RWI and no P2P function because it was somehow the purpose to build a high-performace search portal that does not need networking to other peers to work.

A “backward compatibility” of YaCy Grid/Searchlab to P2P YaCy is partly already there (the crawl start API and the search API is completely identical, only paths differ) and I plan to implement some kind of “forward compatibility” into P2P YaCy: it would be great to make the searchlab apps usable for the old YaCy as well. Some new interfaces will be required, and I will take care that this way is possible.

Orbiter · 26 February 2022 12:49

The Searchlab Apps are beeing implemented!

Right now there are three apps available, created from older YaCy stand-alone search web interfaces:

These apps will work with both, searchlab and YaCy installations, you only have to change the address of the back-end.

The apps will appear on searchlab.eu in an apps sections and will be shown in an app-store like UI. But it will be easy for everyone to add another search app because all of them are hosted in a new git repository: GitHub - yacy/searchlab_apps: Search Apps for the Searchlab

The way you can extend the apps is defined in the README of that repository:

Contributing Your Own Apps

If you like please give us a pull request with your new app!
We love to extend the searchlab apps with community-created content.

To do so, please..

Create a new subfolder within htdocs/app/ with the name of your app
Create a app.json and fill it with an app description using at least
the same fields as used in htdocs/app/websearch_lit/app.json.
The app.json is used within https://searchlab.eu to show a proper visualization
of your app.
You must create a index.html file within your app folder.
You must create a screenshot.png file with the exact size of 1024x1024.
The image should not contain any transparency and it should show a mostly
proper screenshot of your app when it is producing something useful for the user.
You can use all css and js code as given in htdocs/app/_/css and htdocs/app/_/js,
but you MUST NOT add any files to those directories. If you need any other
css and js code, please link them directly from the internet or add those
to your app folder in a separate css/js-path within your app folder.
Your App must be published under the CC0 license.
Make a pull request where only files within your app folder is added/modified,
not anything else.

Mix and Merge with Searchlab

Everything that is inside the htdocs folder of the searchlab_app folder is hosted in searchlab.eu/en/, for example:

.. but with future versions of the searchlab, the content will be available with a customized user-account path which then accesses only user-account generated content. The user-account paths will be https://searchlab.eu/<user>

That means, if a user named freedom has an account, one web interface for search of the freedom index is i.e. https://searchlab.eu/freedom/app/websearch_bootstrap, which can be embedded elsewhere easily.

Orbiter · 8 June 2022 10:02

Web Crawler and Data Warehouse

The next milestone is reached:

you can now “pseudo-login” to your own searchlab account. Currently there is no authentication, only anonymous accounts that you get when you click on the “login” button. If you remember your login-id, you can re-use that id later to access your personal assets store
you can now start a web crawl with not-yet-everything-enabled options. The web crawl is executed by a YaCy Grid network and results of that crawl go into the search index and into the assets store (as long as you checked the storage options)

there is now a Data Warehouse which hosts the assets of web crawls:
- a corpus database (a table which describes the content of your search index)
- a crawl start history (json objects which can be used for detailed analyses and to automate crawl starts)
- a graph database for each crawl (link structure between the documents in your index as json objects)
- an index dump for each crawl (json objects with parsed documents, the same that has been pushed into elasticsearch)
- original warc files from the crawl-loader for each crawl (compressed and optional)

There is currently no limitation on crawl starts, the start location, the crawl depth and other crawl options. You can start crawls anonymously without login and you can also do that using the login and the anonymous user-ids. That will change slightly within the work of the next milestone, where proper authentication (possibly oauth) is implemented.

For now, I invite everyone to try out the new function that make the searchlab actually usable. However, when the authentication is implemented, all currently created indexes will be cleared so we can start fresh with real accounts.

Orbiter · 29 July 2022 23:02

tl;dr: You can now log in and create your own index and search your own index!

We have another milestone before the final one - Integrating of an ACL Framework for Account Management and Account Rights: you can now log in with an existing github, patreon or twitter account.

To test this, click at “login” on https://searchlab.eu or directly open https://searchlab.eu/en/login/

For Authentication there is now a self-management zone where the user can edit account details:

This page can be found at the link with your user-id in the main menu, i.e. https://searchlab.eu/273584169/home/

This is also the place where you can assign a patreon and/or github sponsors account. If you are a patreon or github sponsor, extended abilities for your account are switched on. However, this works only with the completion of the next and final milestone. For now every user has extended rights to generate an index.

Detailed information about the user rights gives the Access Control List (i.e. bandwidth, frequency of crawls, speed of crawls, frequency of API requests, asset storage space, number of search queries etc.)

This page can be found at https://searchlab.eu/273584169/access/services/ - We might change details and extend this for more service levels.

Finally we added the option for the deletion of an account:

We tried to be minimalistic for the amount of data that we catch from twitter/github/patreon: we store only the users name and the users email address. The email address is used to identify the user and this makes it possible that you can use different sign-in methods while still accessing the same searchlab account, as long as the remote oauth services send back the same email address of that user.

“self”-Option in Account Settings:
This option will be set on when you create a new account. It causes that all search requests use only index entries that the user produced themself. That causes that immediately after a log-in all search results are empty unless you start a crawl. With the new accounts you can offer your own search portal content!

share your search portal
The address of your search portal after you logged in contains a user-id. You can share this address and other users can then use your search index without logging in!

Orbiter · 25 September 2022 17:16

The final milestone of the Searchlab Project is reached! Now

you can log in with your github sponsor account or your patreon account and the searchlab is able to read your sponsoring level and assing you the proper Service Level.

everyone can read platform statistics for User Activity, Index Size and Crawling Activity

The platform is currently growing by itself - here and there people are using the opportunity to start a web crawl. The is a small number of users registered to create and provide their own index.
I hope that the Searchlab will grow with users who find this useful.

Integration with YaCy

Searchlab is made out of YaCy technology, which was implemented in the YaCy Grid services. The Searchlab uses the same APIs for crawl starts and index retrieval. There should be a chance to interconnect both projects in a proper way. I will work on this as well!

Sponsoring

The next weeks and months of my time will be filled with bugfixes and activities to support the platform while it is growing. At a certain point it will be necessary to buy a hosted Elasticsearch account outside my own server. To finance this, I ask the Searchlab and YaCy users to sponsor me with my Patreon account .

Orbiter · 26 October 2022 09:05

Final-plus-one Milestone!!

Just this week I added a major element to the searchlab to prevent a lock-in situation there: you can export the search index created at the searchlab into a jsonlist file and import it again into your own YaCy instance!

The whole process is simple and beautiful and I show it in this new YaCy tutorial!

Highlights:

export and import is done as an asynchronous task so it is possible to do this with even very big index files
while the export/import is running there is a monitoring which shows the of the progress of the export
exported dump files are simple jsonlist files (files where each line is a json document) that are stored in the searchlab built-in asset store
the compatibility of search index documents is actually simple because the searchlab is using almost the same index schema as legacy YaCy.

I hope this encourages everyone to try out the searchlab because you can always take your index created there and move to your own YaCy instance.