Self-hosted S3 Buckets for distributed Data Collection

Orbiter · 4 September 2020 08:37

I recently discovered min.io, a free implementation of the Amazon S3 protocol.
This technology is fascinating and it also fits into a requirement that I implemented in the YaCy Grid technology, there I used a FTP to collect parsed web pages which are then indexed later by other processes.

Idea
It would be interesting to bring up the idea of “connected self-hosted S3 storage” for a new p2p sharing layer which could be a technology layer for a next-generation YaCy version. But here we should go into a more broad approach to semantical data collection, not only web data but a wide range of annotated documents. The extended YaCy document target would include any kind of information, images, conversations, lists, databases etc. Data would need to be annotated, that means each object must be accompanied with a metadata file which describes the content, like a Dublin Core Metadata Set

Connection to YaCy
Each YaCy peer could assign such a storage location to put there a kind of “Data Donation”. Donations can be parsed web fulltext (aka “surrogate” files in legacy YaCy, index files in YaCy Grid), Domain lists or even index files as YaCy shares with its p2p protocol.
In YaCy Grid we have a multi-stage document processing (load, parse, annotate/enrich, index) and a broker system which sets up a high-performance computing infrastructure to fulfill that processing queue. The new S3 bucket approach should be able to serve the same purpose without a broker.

Challenges
To give users a good control over the content of such S3 buckets, they must have exclusive write rights on their own S3 storage. To be able to set up a data distribution, a “mix and share” must be implemented. Also the ability to find the S3 bucket which contains specific information must be shared in a common search index. The storage paths of objects must be done in such a way that queued object processing as with YaCy Grid can be done.

Call for Ideas

What would you like to push from YaCy to such a object storing location
What (other) kind of information would you (publicly -> therefore objects with free license) like to share?

TheHolm · 10 September 2020 09:48

It bit a strange idea for search engine. I can see only one legitimate use case.
To distribute URL seed lists and black lists. But there are already plenty of ways to do it outside of YaCy grid.

zooom · 12 December 2020 17:43

Interesting idea. I could imagine a directory of starturls / domains, structured and tagged to quickly be able to start crawling.