Documentation unclear about webgraph vs Reverse Link Index

niadmin · 22 March 2025 10:20

The documentation isn’t clear about what are the advantages of one over the other. It looks like the reverse index is the better way, maybe because it’s new but I’m not sure. I understand it tooks more resources from the serverr but what are the advantages of webgraph and should I enable it?

What is the difference between Citation Reference (reverse link index) and Webgraph?

They contain both the same: links leading from page to page to calculate their CitationRank and hence the ‘popularity’.

The only difference is in storage: “Webgraph” is stored in second solr core, “Citation Reference” is stored internaly (e. g. DATA/INDEX/freeworld/SEGMENTS/default/citation*).

The number of solr Webgraph entries is limited by 2147483519, which is reached after several millions of pages indexed. This limitation could be overcome by using solr cluster.

okybaca · 24 March 2025 13:44

I wrote that part, and for sure it’s unclear because I don’t know as well

I’m trying to do at least some documentation, but I’m just an user as well, so it’s more like describing the black box, while discovering it.

Something could be found while searching this forum, there is also an old-forum archive, if you wanna try to dig. Then we got to resort to trying reading the source code and understand how that works…

@orbiter as a main developer could help to understand for sure, but he have stay silent for last months…

okybaca · 24 March 2025 13:52

maybe, from the oldforum:

Date: 2013-04-13 17:57:58

ok, this needs a bit of explanation: the new fields must be filled with a web crawl to make it functional, and the formula as given above is purely experimental. It considers the number of external links to a web pages and the number of different external domains as important and increases the ranking further if the web page has a low click depth. All values which appear in the forumla are computed in a two-pass process:

first the documents are indexed and a web structure index is generated in parallel. The references and clickdepth values are filled with dummy values and the document gets also a ‘ready for postprocessing’ flag.

when all crawls are finished, a postprocessing step is performed: all documents with the postprocessing flag are then filled with the actual values after a clickdepth computation and a reference count. This can only be done after all crawls because only then the information is present.

That means right after the crawl is finished the ranking formula using this values will not work, you must wait additionally until the postprocessing is finished. This can currently only be monitored in the log, not in the web interface. However, this process is pretty fast.

The counting of external references and the clickdepth can be consideres as something like a ‘poor mans citation rank’ which can be the basis for a page-rank-like second postprocessing step. Before the development for this can start we need more experience with the current formula.

okybaca · 24 March 2025 13:54

i believe the only difference is in a way, how the references are stored, one is solr, the other is kelondro, the function should be the same.