Idea for hugging face results export

YaCy users, I have the idea to export curated datasets directly from YaCy searches to Hugging Face!
To be able to Build custom datasets for specific needs :white_check_mark:

see also GitHub - yacy/yacy_expert: A search engine which will answer to all questions I need to review that in detail

2 Likes

YaCy Index Dumps as datasets on Huggingface is in general a good idea. I doubt if we require a build-in export-to-huggingface for that but maybe a well-known documented process would be a good first start.

Maybe we should try a proof-of-concept example with a dataset that has potential to be beneficial for a large audience. Here are some ideas:

  • Wikipedia Dataset
  • News Articles Datase
  • Scientific Papers Dataset: medicine, physics, computer science, etc.

We maybe just need a how-to webpage on yacy.net which can be referred to in the dataset readme.

I made some experiments here to find out what it takes to publish on huggingface. It’s not so simple like pushing into a git repository, there are some extra tasks required.

The test dataset I created there is a repository containing all german laws, about 110.000 paragraphs:

I used a document format that makes it possible to easily read in the data into a Solr index using a solr-standard bulk index import format (see repository for instructions). I also made the schema of the json files similar to the YaCy index format. With some small changes in the YaCy code it is possible to directly import these files into a local YaCy instance.

To import the files in YaCy, you can copy a link to one of the jsonl files (like
https://huggingface.co/datasets/orbiter/bundestag_gesetze_index_bulk_20240507/resolve/main/bundestag_gesetze_index_bulk_1_of_4.jsonl?download=true
and paste it into the “JSON List Index Dump File Import” servlet in YaCy at /IndexImportJsonList_p.html. Put the link into the “URL” input field and click “Import JsonList File”. Be patient, your peer then download the >50MB file from huggingface and starts an import process. Do so with all four files.

This is only possible with the latest YaCy version from https://release.yacy.net/ because the importer has small changes to enable the jsonl format.

While this is only a first test we ca build upon it and find out how to smoothen the process.