Synonymes, stemming/lemmatisation

okybaca · 4 December 2023 10:13

recently, i tried to understand the way, yacy uses synonymes.

still, one thing is not clear to me:
are synonymes used at the time of indexing or at the time of search?

the most logical way would be using synonymes enrichment at the time of
search, because indexing the synonymes doesn’t bring any new information and
would index even the words not included in the original text. in other
words, it increases the index size and doesn’t bring any advantage.

some languages are flexible (such as english, french, german, czech,
polish…), the word “left” and “leaves” does have meaning of “leave” and
should be indexed under one word entry. a lot of search engines use
stemming, or, better, lemmatisation in order to bring the word into neutral form for
indexing or search.

the “exact phrase” search is a problem then, since we search for occurence
of exactly the same sentence.

stopwords are language specific as well and should be treated separately for
each language.
german article ‘den’ means ‘day’ in czech, the word, which actualy holds a
meaning.

ideal approach imho (for the discussion):

indexing:

content → lang detect → stopwords removal (lang specific) → stemming
(lang specific) → indexing of stemmed words

search:

query → lang detect → stopwords removal (lang specific) → stemming
→ synonymes enrichment (lang specific) → stemming expansion (?) → search

not sure also, what it would do with planned vector search…

Sviatoslav · 4 December 2023 15:08

Being able to do a real exact search would be a huge advantage because the major search engines have lost this capability.

Orbiter · 19 December 2023 16:24

Synonymes and stemming are two different things:

Solr is able to do stemming; however YaCy an only poorly make use of it
Synonymes are handled before indexing: the text is enriched with synonymes as far as there is a synonym matching library attached. YaCy can do that. I once had a large synonym library file that magically jumped to me from that product with the two ‘oo’… we cannot distribute that. But YaCy can use it for synonym enrichment. Some years ago someone manufactured an alternative library but I don’t know where it is right now, maybe I can find it somewhere.

okybaca · 9 January 2024 11:25

Yes, one thing is a good synonymes dictionary.
One collection I stumbled upon when going through GitHub Issues. But as far as I could assess, they are not of a great quality.

Other way could be generation out of Wiktionary. This Finnish project could be an inspiration. This one is actually flexing, but wiktionary contains synonymes as well, at least for some languages.

The question is that synonymes (and stopwords) are language-specific and whether the language is detected before doing synonymes-enrichment.

Second question is if it should be done at the time of indexing. Then, with a huge synonymes dictionary, the index (and it’s RWI distribution) grows. Other approach would be synonymes-enrichment of a query at the time of search, which seems to be logical for me and could be at least an option…