In 2016 I started developing a database structure to store statistics of yacy peers since the original project yacystats.de shut down. But then I had to stop for personal reasons.
From time to time I ran a yacy instance on my private server. Unfortunately, it sent my private internet to the graveyard and I hab shut it down again (damn Fritzbox).
About a month ago I started to develop a new database structure version and had a really good start with the scripts that are fetching the data, importing into the database, creating statistics and so on.
The new site I’ve put together is 2 weeks online now.
Here is what I have:
- The network stats that can be collected from one peer installation that shows current statistics like ppm, qph, links, words etc.
- The “seedlist” with all “public” peers
- And… index browser pages…
The first two things should be clear. Some values and the official names of the peers. I am collecting these pages every hour.
But what I was really interested in over the last two days was the index browser page. It shows all web pages that had been indexed. If you collect this page from every peer you can create an overall index of every website.
I could write a lot more about it but now I’d like to hear your thoughts and maybe ideas!?
Thanks & greetings
Thanks very much for your work and for sharing it with us.
I think the GUI would be more intuitive if you could mouse hover and see a more detailed explanation of each data feature, with a short example to illustrate.
Thanks, interesting stats.
I have notice that number of Inactive peers was growing for an month, while number of active one is steady. Is it an monitoring methodology artifact or reality.
Nice to hear that you like the page, even if it still does not show everything I have in my mind.
You are right. Explanations are very important. I am working on it.
The number of Active and Inactive peers are beeing calculated, of course.
I decided to call a peer active if it fullfills following criteria are based on 7 days. The data is being collected every hour:
- The ppm and qph values had to be changed
- The values of links and words had to be changed to the positive or negative
The ppm and qph values are quite hard to achieve by the peer because it has to have this value right at the time when my script runs.
If these values are sero all over the time the peer seems to be inactive.
The problem is that if it was crawling a page and at the exact same time lost these exact same values it would be still inactive. Right?
That’s what my script cannot detect. Well, it would be able if it collects data every second, but I don’t have such ressources.
Hi @Orbiter. Thanks a lot.
For a few days now I am collecting the “Index Browser” page from peers that allow a connection from the internet. So far, my script was able to connected to a total of 221 peers and 13,728 domains. Makes it quite interesting how many peers are available from the free world.
The overview page shows the top 5 of domains/links, tlds/links and peers/domains. You can also browse through the complete list of domains here.
interesting, this brings up one element that fits into a “data collection idea” that I have in mind for some time. Here is my posting about it: Self-hosted S3 Buckets for distributed Data Collection
Your collection of domains would be one piece that can be shared in that S3/minio place.
That is an interesting idea @Orbiter. I’m curious about the results.
I’ve released a big update. You can expand each row to see more details. For example the peers that are crawling the address.
I also changed the method of loading the data from the server. It should be a lot faster now as it only preloads a small part of the complete list.
If you like to see more or other details let me know.
https://yacy-stats.de is not responding but
https://yacy-stats.de/?page=3 is responding…
Are you sure you see all crawled domains of every instance if you do not own the admin pw?
Yep, had some cert issues, but could fixed it.
I can only crawl instances that are available from my server. And the page that contains the crawled urls is not pw secured. At least I haven’t found an option to do so.
Knowing, which domain really provides valuable content is a tricky task. You have to crawl a domain several times and save the status each time. If you only feed a list of domains into YaCy, crawl it and later export the domain list and compare what you’ve got, you will recognize that you lose a lot of domains due to timeouts or other reasons. I am working on an external database mechanism which collects the domains in MySQL and counts the times a domain was available and delivered proper content.
The other issue is stability of YaCy. Crawling lists of > 100K domains never succeeded for me. After a restart the status is undefined and you have to start again.
I now select only a few thousand domains which were not crawled for a longer time and feed the domain export from yacy periodically which increases the “still alive” counter each time by 1.
I’ve decided to make a big change to the data collect process. At the moment the data is being collected every hour. But I’ve recognized that the calculation process is taking up to 5 minutes in total. During that time the page is not available, what I hate of course.
So I decided to reduce the period of data collecting from once an hour to once a day.
What does that mean to you who are running a public yacy instance?
The data collection process will run at 0 o’clock (TZ=Europe/Berlin) from now on.
That means that I had to delete a lot of data from the database (peers, domains, urls). Currently, the main site yacy-stats.de is still collecting every hour but I set up a new server with a new database that is using the new strategy (one a day). I want to let the process stabilize over the next days and I am checking for unwandet side effects.
This information might be interesting if you want to share your domain list only for this purpose. You could open the port in the firewall by cronjob for example.
I like your idea of having a huge database of domains to crawl. Keep going.
ok, but why does it take so long? Is it caused by YaCy response time? Which API?
First, one sql statement is very complex and combines over 10 tables with each 5,5 million entries
Second, is postgresql. It can only use one cpu core per execution of a statement.
ah ok then that challenge not on YaCy side.
5.5 million entries, why?
No, no, it has nothing to do with yacy itself.
This number of records are a result of collecting data every hour for 5 months now.
For example: Every hour there are 1500 peers online and available from one yacy instance.
1500 (peers per hour) x 24 (hours a day) x 30 (days a month) x 5 (months since start) = 5.4 million records.
But while talking to you I maybe have a solution in my mind.
I am observing a funny development in the amount of peers recorded on my stats site. During the last week the number has increased significantly. When I have a look into the peers list (https://yacy-stats.de/?page=2) I can see a lot of new names with the pattern “agent-[random name]-[w|ufe]-[number]”. Anyone deploying tons of new instances?
Unfortunately they are not accessable from the internet so I cannot crawl the index browser…