I’ve been running Yacy on and off for quite some time. Every time I’ve installed
new node I had to search for domains to start crawl with all over again.
Did you really crawl the stuff yourself? Pretty many of them time out and there is tons of subdomains, esp. in the porn sector.
I downloaded similar stuff from domainlists.io. To filter out the ones with content behind, I use “subbrute” and “massdns” together with a perl script:
while(<>) {
if (/^;; ANSWER/) {
$_ = <>;
if (/([a-z0-9\-]+)\.(\w+)\. (\d+) IN A (.+)/) {
print "$1.$2\t$4\n";
}
}
tb0hdan was scanning for domain names, not web services. So right approach will be to feed list to nmap to find out what servers are responding on port 80. And than feed only them to YaCY,
I just filtered all names staring with “www.” and craw them. Getting pretty good results.
@zooom Yes, I do. Crawler code itself is opensource - https://github.com/tb0hdan/domains-crawler - just file reader and TLDs used to configure it are not. There are bugs (as always) but I’m working on getting them fixed.
I’ve used additional sources as crawler input to speed up dataset growth, all of them are listed in dataset readme.
Regarding subdomains - there are some limits in place, still I wanted to have those as well to allow for others to have doorway detection. I’m working on autovacuum process that
will filter invalid (i.e. expired) domain names.
Regarding domainlists.io - I strongly believe that domain list should be publicly available and not sold.
@TheHolm Yes, your approach with nmap seems to be the best so far.
Here is a funny thing: I once applied to the Deutsche Nationalbibliothek to make a german domain search engine (that was a government request). They did not accept my proposal, but as a reference for a good harvesting start point I submitted a 1095-pages, 477119-domain start list in a pdf (a pdf was requested).