Using YaCy as an alternative to Google CSE for searching across Twitter profiles

Hi, I follow a bunch of scientists and PhDs on Twitter, and I use Google CSE (now called Programmable Search Engine) to create a little custom search engine that searches across these accounts (link). Unfortunately, it only allows 10 URLs max.

I’ve seen that YaCy is pretty powerful in global search, and I want to use it for this custom search, but when I crawled https://twitter.com/username/, it

  1. crawled very deeper, indexing deep content from other profiles,
  2. indexed 20+ languages of the same URL (with the ?lang=XX extension)
  3. indexed garbage like login pages, Twitter’s TOS, etc

I tried the advanced crawler; I set the crawl depth to 2. It’s a bit better but it still indexes other languages and garbage. URL patterns don’t seem to work either

My objective is this: I want YaCy to index all the users’ tweets, quote tweets, and reply tweets in the English language (i.e., index every https://twitter.com/username/status/*), but I’m not quite sure how to make YaCy do that.

Please help. Thanks

I think what you’re looking for is to crawl only the subpaths of the starting URL: https://twitter.com/username/status/

Load Web Pages, Crawler > Site Crawl Start section > Path parameter
There, you can choose to load only files in a sub-path of the given url so that it will only index https://twitter.com/username/status/*

1 Like