Crawling JavaScript pagaes

sms · 25 October 2024 09:42

Is there a possibility to crawl pages that are JavaScript based?
I’m thinking about integrating HTMLUnit (https://htmlunit.sourceforge.io/) that needs much less resources, compared to a headless browser.
Thanks for any pointers!

okybaca · 2 November 2024 11:04

Seems interesting, would you go for that?

sms · 4 November 2024 09:59

Well, these days, many sites are partly or completely dynamic (on the client side). Therefore, if there would be an easy and fast way to render these pages in YaCy, this would be an interesting feature.
In the past I experimented with HTMLUnit and the performance (compared to headless browsers) was not too bad.

okybaca · 13 November 2024 07:58

Will you try to implement that into YaCy?
The best way, IMHO, would be to let the user choose the crawler via config option, or at the crawler start, to have both options available…

sms · 13 November 2024 09:36

Maybe I was not clear with my question: With the current YaCy, is the crawling of pure JavaScript pages possible?
The use of HTMLUnit for this feature was just an idea.

okybaca · 13 November 2024 09:44

Not sure, but I’d rather think it’s not.

HTMLUnit implementation would be nice.