2016-08-03

Fedora Statistics: Questions and answers.

I work for Red Hat as a system administrator for the Fedora Project where I get to do a lot of neat and interesting things daily. One of the tasks I have is gathering various statistics for the Fedora Project Leader's state of the hat speeches like the one he has just given at Flock. This means I also get to help answer regular questions on mailing lists and irc channels like "How many users does Fedora have?", "How many downloads of Fedora are there?", "How does this compare to ?"

All of these questions are where Matthew Miller pulls out a clip from a dinosaur movie eating a lawyer on a toilet or similar comedic point. Why? Because raptors live here and will eat you if you do not have a proper escape policy.

The question "How does this compare to ?" is the easiest to answer: "Nothing I give you can be compared to what any other distribution probably says."  This doesn't mean I or they are lying... it just means the terms being used aren't well defined and we are probably using slightly different ones. 

What is a user? Someone who created a Fedora account, did something and never logged in again? Someone who created a Fedora account and logs into FAS daily? weekly? monthly? yearly? Maybe that person who never logs in just answers things on IRC or bugzilla or mailing lists? Maybe the person who logs in daily is just a script that does that because a developer decided to test a cron script and then forgot about it.

What is a download? If I go by raw "GET .*iso .*" in various logs I could say we have had millions of downloads weekly. But anyone who knows weblogs knows that is as fishy as the 1990 website counters and being told "We got N million hits". Those downloads are inflated because of several reasons:
  1. Some people mirror everything. They may not install any of it.. but just in case they ever need it.. they have it.
  2. Some people try to be helpful in promoting their OS by downloading the OS over and over again so that any count of downloads will be larger than the next guys. There are multiple IP addresses which download Fedora isos hourly. No one needs 24 copies of Fedora 8 every day... especially when they had just downloaded 24 copies of Fedora Zod. 
  3. Some "web companies" do the same and then send us mailers about how we can see how they have increased our downloads and if we used them we could see even further growth. 
  4. A lot of people use specialized download tools which try to torrent downloads via http. What they do is ask 20-100 times for a mirror and then use each one to download a bit of the file as a speed booster of some sort. This shows up as even more downloads.
  5. Then there are the people who are stuck on NAT over NAT or satellite links. They may show up as a dozen or so IP addresses in their attempt to grab a single IP address. 
On the other side there are multiple people behind a single NAT so that the 200 downloads from IBM are probably not the same system.. but maybe they are? There are ways to get clearer results via various 'fingerprinting' techniques but I don't think that they really can help when you have companies whose basic job is to bump up numbers so you can lie about how good your product is doing on the web. [I am going to avoid commenting on the morality of deep fingerprinting because I am in a rather cranky mood today.] Depending on how one looks at the data, you can keep discounting stuff further and further down until your only answer is that you know that you had more than 1 download and less than whatever your max amount of non spider hits were. 

So what does that leave us with so many "unknown unknowns" and too few "known knowns"? What we have been using has been not looking at downloads at all and instead focus on actual users who have installed the OS and are using yum or dnf to update their systems. This can give us a rough lower bound number of 'active' systems. Going through the mirror logs we create a large amount of tuples of (date, ip address, hardware arch, fedora release). We then unique this list to deal with the various users who have cron set up to do a yum update every 10 minutes. It has a bad effect of making the thousands of systems behind the Red Hat NAT be counted as 1 system, but we hope that is made up for by the person doing a yum update over their Verizon phone and showing up as 20 ips as they get ip shifted every now and then. 
Fedora OS yum checkins (oldest highest)

Now again these are 'reasonable' minimum numbers. The N thousands of ARM IOT systems with Fedora on them do not do yum updates so aren't counted. The systems which are hard configured to use a local mirror aren't counted. And so on and so on. 

[Caveats:
  1. The above graph is a hack job I created using awk and gnuplot with a 7 day moving average calculated via python pandas. I expect that it could be made prettier or cleaner through various ways. 
  2. The drop off in late 2008 was due to the webservers being mostly off during the security incident of that time.
  3. The hockey stick drop off in late 2014 is due to Fedora dropping support for RC4 encryption and various systems hard coded to use it not being able to check in any more. All of these systems were way past End of Life for each release so they were not getting any updates.
  4. Most releases have a growth pattern of continually growing until the day the next release comes out. That pattern changes for Fedora 8, 14 and 22 which seem to have continued to grow even after they were end of lifed (or close to it). These seem to be due to some VPS, cloud or similar provider continually basing images off of these releases.
]

I think somewhere in this post I lost my original focus. Sorry about that.. I looked at the picture and got all oooh I need to talk about that.

No comments: