Launch of

Thanks, interesting stats.
I have notice that number of Inactive peers was growing for an month, while number of active one is steady. Is it an monitoring methodology artifact or reality.


Nice to hear that you like the page, even if it still does not show everything I have in my mind. :slight_smile:

You are right. Explanations are very important. I am working on it.

The number of Active and Inactive peers are beeing calculated, of course.
I decided to call a peer active if it fullfills following criteria are based on 7 days. The data is being collected every hour:

  • The ppm and qph values had to be changed
  • The values of links and words had to be changed to the positive or negative

The ppm and qph values are quite hard to achieve by the peer because it has to have this value right at the time when my script runs.

If these values are sero all over the time the peer seems to be inactive.
The problem is that if it was crawling a page and at the exact same time lost these exact same values it would be still inactive. Right?
That’s what my script cannot detect. Well, it would be able if it collects data every second, but I don’t have such ressources. :slight_smile:


Hey @akdk7 thats fantastic! Nicely done!
tweetet it

1 Like

Hi @Orbiter. Thanks a lot. :slight_smile:

For a few days now I am collecting the “Index Browser” page from peers that allow a connection from the internet. So far, my script was able to connected to a total of 221 peers and 13,728 domains. Makes it quite interesting how many peers are available from the free world.
The overview page shows the top 5 of domains/links, tlds/links and peers/domains. You can also browse through the complete list of domains here.

interesting, this brings up one element that fits into a “data collection idea” that I have in mind for some time. Here is my posting about it: Self-hosted S3 Buckets for distributed Data Collection
Your collection of domains would be one piece that can be shared in that S3/minio place.

That is an interesting idea @Orbiter. I’m curious about the results.

I’ve released a big update. You can expand each row to see more details. For example the peers that are crawling the address.
I also changed the method of loading the data from the server. It should be a lot faster now as it only preloads a small part of the complete list.
If you like to see more or other details let me know.

@akdk7 is not responding but is responding…

yay ists up again! \o/

Are you sure you see all crawled domains of every instance if you do not own the admin pw?


Yep, had some cert issues, but could fixed it. :slight_smile:

I can only crawl instances that are available from my server. And the page that contains the crawled urls is not pw secured. At least I haven’t found an option to do so.


Knowing, which domain really provides valuable content is a tricky task. You have to crawl a domain several times and save the status each time. If you only feed a list of domains into YaCy, crawl it and later export the domain list and compare what you’ve got, you will recognize that you lose a lot of domains due to timeouts or other reasons. I am working on an external database mechanism which collects the domains in MySQL and counts the times a domain was available and delivered proper content.
The other issue is stability of YaCy. Crawling lists of > 100K domains never succeeded for me. After a restart the status is undefined and you have to start again.
I now select only a few thousand domains which were not crawled for a longer time and feed the domain export from yacy periodically which increases the “still alive” counter each time by 1.



I’ve decided to make a big change to the data collect process. At the moment the data is being collected every hour. But I’ve recognized that the calculation process is taking up to 5 minutes in total. During that time the page is not available, what I hate of course. :slight_smile:
So I decided to reduce the period of data collecting from once an hour to once a day.

What does that mean to you who are running a public yacy instance?
The data collection process will run at 0 o’clock (TZ=Europe/Berlin) from now on.
That means that I had to delete a lot of data from the database (peers, domains, urls). Currently, the main site is still collecting every hour but I set up a new server with a new database that is using the new strategy (one a day). I want to let the process stabilize over the next days and I am checking for unwandet side effects.
This information might be interesting if you want to share your domain list only for this purpose. You could open the port in the firewall by cronjob for example.

I like your idea of having a huge database of domains to crawl. Keep going. :slight_smile:


ok, but why does it take so long? Is it caused by YaCy response time? Which API?

First, one sql statement is very complex and combines over 10 tables with each 5,5 million entries :slight_smile:
Second, is postgresql. It can only use one cpu core per execution of a statement.

ah ok then that challenge not on YaCy side.
5.5 million entries, why?

No, no, it has nothing to do with yacy itself.

This number of records are a result of collecting data every hour for 5 months now.
For example: Every hour there are 1500 peers online and available from one yacy instance.
1500 (peers per hour) x 24 (hours a day) x 30 (days a month) x 5 (months since start) = 5.4 million records.

But while talking to you I maybe have a solution in my mind. :slight_smile:

I am observing a funny development in the amount of peers recorded on my stats site. During the last week the number has increased significantly. When I have a look into the peers list ( I can see a lot of new names with the pattern “agent-[random name]-[w|ufe]-[number]”. Anyone deploying tons of new instances? :slight_smile:
Unfortunately they are not accessable from the internet so I cannot crawl the index browser…

I’ve been anticipating for some time that before much longer YaCy would likely go viral. There is so much dissatisfaction over mainstream internet platforms of late, people are actively seeking alternatives. (That is how and why I discovered YaCy not that long ago).

well… I changed the naming schema for pre-set names.

About the number of peers: well, I am running since some time (sometime this January) a buch of test-peers in a kubernetes cluster. But those should not appear here, they are currently only junior peers and they should just make a spike and not a ramp…

Maybe the installation notes … I enhanced them and added docker images for ARM / Raspberry Pi