Pages getting deleted from index when crawled twice

izzy · 16 October 2024 00:33

Hello, I have an independent yacy instance that I and some others are adding crawl requests to. There is an issue that has come up where when a request is sent to crawl a page already in the index not only will the crawl fail, the page will get deleted from the index and can not be readded unless I go in and add it manually (I’ve set up a system where people may call the quickcrawllink api to request pages).

The message printed to the log when this happens is:
SWITCHBOARD * postprocessing deactivated: no reference index avilable; activate citation index or webgraph
SWITCHBOARD * postprocessing deactivated: constraints violated
SWITCHBOARD * cleanup post-processed 0 documents

Does anyone have any insights on how to resolve this? I’ve come up blank on how I would activate the options mentions mentioned, if that would resolve it.

okybaca · 22 October 2024 10:40

Hi and welcome, Izzy!

I’m not sure, whether “postprocessing” log messages are related to this problem. I believe that postprocessing is just some filling the webgraph or citation database with urls linked from the indexed page… or something.

I’d suspect more the “delete before start” or similar from the crawler settings, see here.

The best way is probably to debug one by one.

What API call do you use to add an url?

izzy · 22 October 2024 11:19

I figured out how to enable the citation index which seems to have solved the problem. I won’t pretend to understand what it’s doing but duplicates aren’t getting deleted anymore.

If anyone’s browsing in the future, go to config settings and enable either of the below to true:
core.service.citation.tmp
core.service.webgraph.tmp (haven’t tried but seems like this one would fix it to)