Can (or could) YaCy do conceptual indexing? (or be the search engine for "the semantic web")

Tom_Booth · 20 July 2019 16:44

The best I could do for now would be to change the extension from .prc to .html. I don’t have access to the actual server, which is who knows where? As far as I’m aware there is no actual “right” mime type for my made up .prc file/index/database.

It primarily consists of a list of metadata - URL pairs, or in other words, the search engine’s index.

Changing the extension, presumably, would allow YaCy to find matches to the reg-ex used, but the idea is not just to index the page as containing a match but to slurp up the metadata for each url and add it to the peer index.

Tom_Booth · 2 August 2019 07:52

I’m trying to hammer something together:

Probably this wouldn’t work, but I guess it doesn’t hurt to ask.

Suppose I wish to create a vocabulary to access metadata in the head of a document.

For example, I put the file “agriculture.vocabulary” in the /yacy/DATA/DICTIONARIES/autotagging/ folder.

I want to access a controlled language that, theoretically, could exist in some metadata on a web page.

Could I say: agriculture:<meta name=YaCy-code content=AGR>

OR perhaps better:

agriculture:[regex]

BTW the tutorial “Dev:GenericFacets – YaCyWiki.html” (located in www) shows:

An actual vocabulary file would look like this:

This is the content of a file named ‘OperationSystem.vocabulary’

list of terms

Linux=Debian,Ubuntu,Fedora,Mandriva,openSUSE,Mint
Windows=Vista,Windows2000,WindowsXP

But an actual vocabulary file I created through the YaCy interface looks like:

The file was written using a colon rather than an equal sign. I was just wondering if this matters, or does YaCy accept either syntax/format?

My reasoning for all this is that this “faceted” search using simple word substitutions is not explicit enough. I fear that it could only, in many, or most cases, only lead to more ambiguity and confusions and therefore less relevant rather than more relevant results. Relating “Linux” to, among other things “Mint” could associate Linux with mint tea, mint candy, mint room freshener, mint scented candles, mouth wash, tooth paste, breath mints, herbal mint, seeds of the mint plant, mint gold, mint coins, mint condition automobiles, mint julep etc. etc.

A Ranganathan type faceted classification code string like:

YaCycode=“SDM50FRMAGR…42N075W.LIB534DOCHMs7yebhd8ao

is absolutely non-ambiguous and a regex could be used to easily isolate the subject field AGR

And the subject.vocabulary file could be used to unambiguously translate that, or rather
er vice-versa.

I don’t know if I’m explaining this in a way that can be understood, but it can easily be demonstrated. Well not that easily, but it can be shown to work. (in a centralized database)

I’m just wondering if there are any tools already available in YaCy that might be able to utilize something like this, and have it work in a distributed context.

Tom_Booth · 2 August 2019 17:21

Just to give an idea:

Here is an example of the code/markup requirements for an EVENT using Schema.org (i.e the metadata standard used by Google)

{
“@context”: “https://schema.org”,
“@type”: “Event”,
“name”: “Jan Lieberman Concert Series: Journey in Jazz”,
“startDate”: “2025-01-01T19:30”,
“endDate”: “2025-01-01T23:00”,
“location”: {
“@type”: “Place”,
“name”: “Santa Clara City Library, Central Park Library”,
“address”: {
“@type”: “PostalAddress”,
“streetAddress”: “2635 Homestead Rd”,
“addressLocality”: “Santa Clara”,
“postalCode”: “95051”,
“addressRegion”: “CA”,
“addressCountry”: “US”
}
},
“image”: [
“https://example.com/photos/1x1/photo.jpg”,
“https://example.com/photos/4x3/photo.jpg”,
“https://example.com/photos/16x9/photo.jpg”
],
“description”: “Join us for an afternoon of Jazz with Santa Clara resident and pianist Andy Lagunoff. Complimentary food and beverages will be served.”,
“offers”: {
“@type”: “Offer”,
“url”: “https://www.example.com/event_offer/12345_201803180430”,
“price”: “30”,
“priceCurrency”: “USD”,
“availability”: “InStock - Schema.org Enumeration Member”,
“validFrom”: “2024-10-20T16:00”
},
“performer”: {
“@type”: “PerformingGroup”,
“name”: “Andy Lagunoff”
}
}

Example provided by Google:

It includes a few significant facets: type:event, location[address], start date[date]

This is wonderful. The fact that this can be accomplished at all, however, I see a few problems.

I think location by street address is a bit cumbersome as well as less than universal world wide. What about events in the wilderness areas or international waters or a lake or a public park or other places where there is no government assigned postal address. A desert region or whatever. Some Geocode/latitude-longitude is better I think, in many ways.

The main problem with this, though, is, as far as I’m aware, all this (above) metadata would have to be hand coded by something of an expert in this kind of metadata. It puts it largely outside the hands of the general public, or your average web designer.

Essentially, this is just one facet type - EVENT

It is not language independent, which is a HUGE drawback.

Don’t get me wrong, it is great. It is the RIGHT idea, but, is it going to be used?

One of the biggest and best uses for this sort of metadata could be or should be People’s “Me” type pages: Hi I’m Bob and I like cars… Hi I’m Jane and I like cooking…

Well, of course there is Facebook for that sort of thing, wherein lies another problem. Proprietary, commercialized, privacy infringing, lack of creative control.

But Bob and Jane are not very familiar with HTML or metadata structures so they are locked into using proprietary systems like Facebook rather than designing their own web pages from scratch. It has all become too complicated. Who does that anymore?

Compare that with something like this: YaCycode=SDM50FRM AGR …42N075W.LIB534DOCHMs7yebhd8ao

It has none of the drawbacks described above and encodes far more significant data.

I wrote these programs when was it? Well, I last updated the Perl script in 2006

This is merely a bare proof-of-concept effort. Because, no programmers I talked to about it thought it was possible, back in the 90’s, so I had to study computer programming in my spare time for about ten years

I think it is significant that the “backend” consists of two Perl scripts of less than 20kb each.

I would very much love to combine this with YaCy.

I think it is still more-or-less working, but could be vastly improved in many ways.

http://megamapper.com/

Tom_Booth · 4 August 2019 06:09

I have been attempting to do something with this, - experimenting with it, but I’m not too sure what I’m doing and haven’t been able to determine if it is having any effect on search results.

Tom_Booth · 5 August 2019 14:52

I found an online PDF translator:

It looks to have done a fairly excellent job, with the exception of skipping a few words here and there.

Thanks! there is, for me anyway, much new information to study up on. Some things I have not heard of before, like triplestore.