Is there a possibility to crawl pages that are JavaScript based?
I’m thinking about integrating HTMLUnit (https://htmlunit.sourceforge.io/) that needs much less resources, compared to a headless browser.
Thanks for any pointers!
Seems interesting, would you go for that?
Well, these days, many sites are partly or completely dynamic (on the client side). Therefore, if there would be an easy and fast way to render these pages in YaCy, this would be an interesting feature.
In the past I experimented with HTMLUnit and the performance (compared to headless browsers) was not too bad.
Will you try to implement that into YaCy?
The best way, IMHO, would be to let the user choose the crawler via config option, or at the crawler start, to have both options available…
Maybe I was not clear with my question: With the current YaCy, is the crawling of pure JavaScript pages possible?
The use of HTMLUnit for this feature was just an idea.
Not sure, but I’d rather think it’s not.
HTMLUnit implementation would be nice.