I made some experiments here to find out what it takes to publish on huggingface. It’s not so simple like pushing into a git repository, there are some extra tasks required.
The test dataset I created there is a repository containing all german laws, about 110.000 paragraphs:
I used a document format that makes it possible to easily read in the data into a Solr index using a solr-standard bulk index import format (see repository for instructions). I also made the schema of the json files similar to the YaCy index format. With some small changes in the YaCy code it is possible to directly import these files into a local YaCy instance.
To import the files in YaCy, you can copy a link to one of the jsonl files (like
https://huggingface.co/datasets/orbiter/bundestag_gesetze_index_bulk_20240507/resolve/main/bundestag_gesetze_index_bulk_1_of_4.jsonl?download=true
and paste it into the “JSON List Index Dump File Import” servlet in YaCy at /IndexImportJsonList_p.html
. Put the link into the “URL” input field and click “Import JsonList File”. Be patient, your peer then download the >50MB file from huggingface and starts an import process. Do so with all four files.
This is only possible with the latest YaCy version from https://release.yacy.net/ because the importer has small changes to enable the jsonl format.
While this is only a first test we ca build upon it and find out how to smoothen the process.