Does Yacy index PDFs?

algorimas1 · 25 October 2023 09:08

I want to index some PDF-Papers, but Yacy reacts differently on it. I made a simple PDF with a unique search term like “jeopardyfish” and put it on my website. Then i was able to find this PDF after crawling. Unfortunately it did not work with bigger PDF-Files or eBooks in PDF-Format. Does Yacy index the content of files or not?

okybaca · 26 October 2023 10:15

Hi & welcome!

Yes, YaCy should be definitely parsing PDFs.

The question is how to debug your case. In the search results, you can click “Metadata” or “Parser”, to see the details of what exact text is indexed in a particular PDF.

Some of the PDFs are text (easiest way to distinguish is the text is selectable by mouse in the pdf viewer), some of them are just scanned images of pages, then no text is indexed, because no actual text is included in PDF. You could OCR the text out of images stored in PDF, for example using open source tesseract, but it’s computationaly heavy and must be fine-tuned, usually (and YaCy doesn’t do that afaik). Sometimes, even if PDF is text, the text is quite broken, as you see, if you print out the PDF text with ghostscript tool.

For a personal e-library I used recoll & xapian in the past, with success.

algorimas1 · 26 October 2023 21:31

Thank you for your explanation. Yes, i know there are image pdfs which can’t be indexed. I followed your advise and looked in the metadata of the PDF, which was not in the search results, though it had the keyword.

The word count in the metadata was 100 words, but this is an ebook of 210 pages. With four smaller PDFs it was no problem, Yacy found these with the existing keyword and had a word count of around 500.

Ok, maybe Yacy has a limit crawling big PDFs, as this would probably have a critical workload. My hope was, that i could integrate books also in my search results.

Orbiter · 27 October 2023 10:13

I took the opportunity to upgrade the integrated PDF parser (apache pdfbox) to the latest version 3.0.0 and the release notes for that release is really big: Release Notes - ASF JIRA

I did a quick test on my own pdf files ( Index of /material ) and for me it looks fine. Should work for all PDF files which have plain text inside and not only scan images. However, modern scanners should do OCR themself and therefore include the plaintext as well which should show up in YaCy then.

Orbiter · 27 October 2023 10:16

Ah I forgot - there was actually a limitation of a maximum of 200 MB PDF files and I removed it in upgraded pdfbox to 3.0.0 · yacy/yacy_search_server@5ba5fb5 · GitHub

mainly because they changed the API and the limit option was not there any more.