recently, i tried to understand the way, yacy uses synonymes.
still, one thing is not clear to me:
are synonymes used at the time of indexing or at the time of search?
the most logical way would be using synonymes enrichment at the time of
search, because indexing the synonymes doesn’t bring any new information and
would index even the words not included in the original text. in other
words, it increases the index size and doesn’t bring any advantage.
some languages are flexible (such as english, french, german, czech,
polish…), the word “left” and “leaves” does have meaning of “leave” and
should be indexed under one word entry. a lot of search engines use
stemming, or, better, lemmatisation in order to bring the word into neutral form for
indexing or search.
the “exact phrase” search is a problem then, since we search for occurence
of exactly the same sentence.
stopwords are language specific as well and should be treated separately for
each language.
german article ‘den’ means ‘day’ in czech, the word, which actualy holds a
meaning.
ideal approach imho (for the discussion):
indexing:
content → lang detect → stopwords removal (lang specific) → stemming
(lang specific) → indexing of stemmed words
search:
query → lang detect → stopwords removal (lang specific) → stemming
→ synonymes enrichment (lang specific) → stemming expansion (?) → search
not sure also, what it would do with planned vector search…