Using Distrowatch
This article is about the "Hits Per Day" (HPD) score on Distrowatch, what it can be used for and how you can read a lot of different information out of it.
On Distrowatch, http://distrowatch.com/, you can follow the "popularity" of almost any distro of your choice. I put quotes around the popularity because you can question what the score actually means.
When looking at the bottom of the case, Distrowatch is an impressive one-man job. It only presents a lot of Linux (and BSD) distributions - mostly from a factual point of view. Based on the simple template it gives users a basic first-hand view on a distro (how active is it developed, how up-to-date is it etc) and when users click on certain distro information, it counts the number of user hits for that particular distro. The more hits, the more "popular" the distro is.
The HPD statistical score (on the Distrowatch frontpage) doesn't really say much in itself, but it serves as the best "barometer" on Linux distributions on the internet.
No doubt that distro maintainers from all over the world is attracted by Distrowatch. Any distro, no matter how big or small, can get its 2 minutes of fame when advertising its new releases on Distrowatch and no doubt that these releases feeds extra activity - which increases the HPD score. This way the Distrowatch pages fuels its own success in a clever way.
Users of the data
When thinking about who should read Distrowatch scores and why, the picture can be a bit unclear.
Perhaps users are interested in locating popular distros. I know for a fact that some users only look at the top 10 distros and based on this they select a Linux distribution. Personally I would not hesitate to say - look at top 12. The distros in the top league change position from time to time, but it seems to be the same 12 distros that dominates the top (more or less). This I have compared with registrations last spring. So, what do all the users of "second class" distros use it for? I don't know.
Anyway, when my choice fell on Zenwalk spring 2005, I didn't consider it a second class distro at all. On the contrary it felt like a first class match to my needs and demands. I just had to look further down the Distrowatch list of distros to find it. It was named Minislack, a catchy name, and at that point in time it was positioned in the mid thirties.
From a distro maintainers point of view, Distrowatch could be an interesting page where the popularity of their distro can be monitored, potentially planing activities that keeps the score up. In this case it would have been nice with raw (unsmoothened) data. I mentioned this to the maintainer, Ladislav Bodnar, and he gave me this URL where a file with raw data can be downloaded:
http://distrowatch.com/text/newhpd.csv
When downloading this file I was surprised to see that Distrowatch is watching almost 500 different Linux distributions. Only the top 100 shows up in the Distrowatch barometer on the front page.
What the statistics shows
When it comes to who will click on a link on Distrowatch and count as a HPD on Distrowatch, it is more obvious that it will be people searching for information, most likely people who doesn't use the given distribution already. It is therefore as if the Distrowatch HPD shows how hard the "accelerator" is pushed down. Each HPD is potentially a new user of that particular distro. This picture is mixed with people like myself, who reads about all distros and their activities, to see what is going on. Sometimes I click on the Linux distro that I prefer, to see what's new, to check whether the info is right, to see whether any new reviews are mentioned etc. We can safely conclude that the HPD score represents a mix of various factors.
Data pollution risks
It is possible that there are users out there, who choose to go onto Distrowatch and, as a daily routine, click on their preferred distro, if nothing else, then to support the "popularity" of that distro. This is like polluting the data, but also such a hit is really worthless for that distro. The readers of such distro-information doesn't get the information he/she thought. Only a distro in trouble, needing more popularity (to attract the curiosity of new users), could benefit from such pollution, theoretically. This conclusion is based on the thought that if you rank higher on the Distrowatch score list, you are also more likely to attract the interest of new users. One user clicking on Distrowatch on a daily basis doesn't make a difference, though, and I would find it surprising if a community of activity evolve around such an activity.
Worst case scenario for polluting the data would be a distro, which ships with a script that touches the Distrowatch page for that particular distro - every day. That way all users would be forced to assist in the apparent increase in popularity on Distrowatch. I don't think users would accept it, but if such a function was hidden well enough, it could go on for a while and create unnaturally high HPD scores (for no purpose at all).
Types of statistics
As mentioned, the HPD score is a meter for new users like an accelerator measurement - not a measurement of how big the active user base is (like a speed measurement). Speed in this case represents the active user base of a distribution, who participates in the development of a distro and therefore generates progress. Measuring "speed" would be a better measure of success, but I don't know of any overview that shows how many active users a given distro has. Each distro has it's own homepage and forum page, usually where users can register to become an active member. The number of registered users could be a measure of success for a distro. The scores on Distrowatch doesn't give you this kind of data.
Distrowatch gives you the number of HPD for 6 months by default, but you can also get 1 month, 3 months and 12 months statistics. The longer time, the more smoothened your data will be. Distrowatch also gives you the option to look at previous years, like 2005 and 2004 etc.
I think the 12 months statistics is so smoothened that most of what goes on dissapears into the dark, it hides any irregularities that a distro maintainer needs to use for studying impact of various happenings - but a normal user of can read a tendency line out of it. If you read the 12 months statitics every month in eg. 3 months you can see whether things are going up or down in the long run (now as opposed to a year ago). It removes seasonal changes since it includes the whole year. While following the Distrowatch statistics for Zenwalk I have seen seasonal changes in activity. For young distros the 12 month data average stretches over too long a period to give good information. I think Ladislav Bodnar, who maintains the Distrowatch page, realizes this and this is why the default is the 6 month statistics.
For long existing distros, like Slackware, 12 months can sound insufficient. Slackware has existed for a decade, and I guess there's only so much new that has happened within a year of the lifespan of such a distribution. I also guess there's a natural limit for how high you can score on Distrowatch - I mean, there is a limit on the number of Linux users in the world and therefore a limit on how many of them will find one particular distro of interest (in the Distrowatch way). For high-scoring distros, it is the number of users that is interesting, not as much the "acceleration" - the incoming new users - you'll be more interested in keeping your existing users happy. Again, this is where I make the split between success in getting new users to pay attention to you, versus success in keeping your users happy with your product.
Measuring market share of linux distributions is a very hard thing to do, but the barometer found on Distrowatch is a clever alternative.
The 3 months statistics is from a user point of view a bit too detailed to make sense - all distros move up and down in popularity, but for managing "happenings" in the life of a distro, it could be used. It is really the 6 months statistics that gives a user the right picture of what distros are more or less popular, I think, whereas it is the 1 month and 3 months statistics that can be used by maintainers of a distro, if registered in detail. It is easier though, to download the newhpd.csv file and make your own statistical analysis.
Regarding seasonal activity I loaded the entire newhpd.csv file into an OpenOffice spreadsheet and analyzed the activity level across all distros. Seasonal activity on Distrowatch is definitely identifiable and probably reflects the activity level in the Linux world in general.
Notice, I have smoothened the data over 11 days (5 days ahead and 5 days behind, to get the peaks on the right date). This smoothening prevents a noisy picture of raw data (which for completeness is also visible on the graph). From the smoothened data I read:
- The first peak at around 1.17 is located at 8. to 11. April 2005.
- The first big bottom at round 0.8 is located at 29. and 30. June. This bottom and the period around it is obviously during summer vacation. Activity is back to full level (crossing 1.0) on 19. August.
- The second peak above 1.1 is located at 8. and 9. October.
- The second bottom with just above 0.8 (83% activity level) is located at 27. and 28. of December 2005. Apparently people are commiting their Christmas vacation to their families instead of Linux.
- The third peak above 1.1 is located at 17. to 20. march 2006.
The peaks and dips seems to follow a weekly pattern - for example it could be the beginning of the week (during weekend or around monday) being dominated with high level of activity. Less activity is found during the middle of the week (around wednesday).
The 2005/2006 statistics shouldn't be considered facts. Other years may be different but to me (and probably many others) it sounds plausible that summer vacation time is not the most active period of the year.
What does it mean for a maintainer of a distribution? Well, given that we look at 2005 and assuming that the activity level of distros on Distrowatch can be used as a measure for the activity level in the Linux world in general, June and July was a bad period to try and capture some interest for your distro. It would have been an uphill battle. It is much more fruitful to be a visible distro when the activity level in the Linux world is high. During high activity there's a higher chance to capture users for your distribution - on the other hand, activity never stops (completely) in the Linux world, so don't put too much importance in these variations.
At some point my interest in Distrowatch changed from curiosity to a more in-depth view. I am a user of Zenwalk, and also a supporter of that distro. I started to record all the HPD's on a daily basis. It would look like this (in this case data from 1. february 2006 and onwards):
| Date | HPD 1-month | HPD 3-month | HPD 6-month | HPD 12-month |
|---|---|---|---|---|
| 01-02-06 | 242 | 248 | 250 | 241 |
| 02-02-06 | 242 | 249 | 250 | 241 |
| 03-02-06 | 243 | 249 | 250 | 242 |
| 04-02-06 | 244 | 249 | 250 | 242 |
| 05-02-06 | 243 | 249 | 250 | 242 |
Unveiling the data
Theoretically, when you register so much information, you should be able to "unsmooth" the data to get the daily HPD score. After all, for each day you get 4 data inputs, so the set of equations should be overdetermined. Reality is a bit different, but it is worth a shot anyway. If not for any other reason, then at least because I get personal pleasure out of fiddling with numbers.
The first hurdle is to register enough data. For every day in a month you get 4 inputs. The one-month statistics starts biting itself in its tail. I call it self-dependence. Let me try to explain. Any smoothened data contains info from the past, so, no matter where you start - the past is unknown, only the average for that month is known for a start. You can only guess what happened on a day-to-day basis - but, as you roll ahead day for day during the month, everything basically depends on your starting point.
When you drop in a new days worth of numbers, and guesses the HPD for that particular day - also one day score drops out of your statistics. To change the HPD for a day you can choose to either add a score at the end or at the beginning of your 1-month statistical space.
For every day in 3 months you'll end in the same situation as above, but with 3 months of data. Further more some of the 1 month data will "collide" with 3 months data. It starts happening after only 2 months of registering data. I call this double dependency. I'd say this is where the data not only depends on predecessors, but gets involved with a completely different set of data. This is where the fun starts and you cannot (not easily anyway) manage to fit the data by hand. You need computer power to process the data. The reason is that it gets too complicated to regulate your starting conditions (data from 6 months ago and until 3 months ago) and predict like 30 steps later what the influence will be. I havn't made the computer work yet. Experience with manual fiddling of data should help create the best computer software to do the job.
The double dependency, though complicated, will keep your guesses within limits. After all, the HPD score is an average, rounded into a whole number - so a score of 249 can be anything from 248.500 to 249.499. With double dependency you have much less "play" to make it all work since both the 1-month and 3-months scores must be within limits to be correctly rounded - anything else, and you can safely assume that you're wrong - or you claim that something went wrong in the Distrowatch calculations, which sounds unlikely to me.
Anyway, when I was fiddling with the data I noticed that the scores on Distrowatch does something weird. With the accurate data available in newhpd.csv I could not reproduce the scores that Distrowatch was showing in the barometer (on the front page of Distrowatch.com - on the right side). Ladislav Bodnar investigated this, and based on my registrations it looks like a change was made in the calculations around 20. march, which makes the HPD statistics more correct.
Getting started with some data
Okay - time to help you get started with your own data. First, download the newhpd.csv file from Distrowatch. For example do as follows:
wget -t 2 http://distrowatch.com/text/newhpd.csv
The file is a big 450 kb file. Each line contains a distro with a collection of data from the last year. You can filter out the rest with the following command (choose another distro name, which you're interested in):
sed -n '/^zenwalk/ {p}' ./newhpd.csv > ./zenwalk_newhpd.csv
Each input (name, and each number) is separated by a carot : ^. You can change the file from a single line to each number on each line (eg. for import into a spreadsheet). For example like this:
sed -e 's/\^/\n/g' ./zenwalk_newhpd.csv > zenwalk_"today"_newhpd.csv
Be aware that the file is updated just past midnight (GMT), so give some slack and download it at some other time during the day.
I created an OpenOffice spreadsheet containing every day for an entire year prior to the reading/download of the file, as this shows (just a small bite of it):
| DATE | Daily HPD score |
|---|---|
| 01-02-06 | 186 |
| 02-02-06 | 172 |
| 03-02-06 | 180 |
| 04-02-06 | 161 |
| 05-02-06 | 146 |
This is what the graph looks like for Zenwalk - daily HPD score, and a smoothened set of data.
The smoothened curve is in my case a 7 days average, looking 3 days ahead and 3 days behind - this way the peak appears on the right time. This 7-day average gives a good judgement of the area underneath the graph - the area expresses how much "weight" the event should have.
Each peak is related to an event in the history of Zenwalk.
| 1st peak: | Beginning of data range, inaccurate data |
| 2nd peak: | Release of Minislack 0.4 on 25-03-2005 (peaks 26-03-2005) |
| 3rd peak : | Release of Minislack 1.0 on 23-04-2005 (peaks 24-04-2005) |
| 4th peak : | Release of Minislack 1.0.1 on 03-05-2005 (peaks 04-05-2005) |
| 5th peak : | Release of Minislack 1.1 on 10-06-2005 (peaks 11-06-2005) |
| 6th peak : | Article on DesktopLinux on 05-07-2005 (peaks 06-07-2005) |
| 7th peak : | (twin peak on 12-07-2005, see above and the followup articles re. OSnews) |
| 8th peak : | (very small peak on 25-07-2005 : 282 & 26-07-2005 : 268, reason unknown) |
| 9th peak: | Release of Zenwalk 1.2 on 12-08-2005 (peaks that day) |
| 10th peak: | Release of Zenwalk 1.3 on 15-10-2005 (peaks that day) |
| 11th peak: | Release of Zenwalk Core 2.0 on 27-11-2005 (peaks that day) |
| 12th peak: | Release of Zenwalk 2.0.1 on 04-12-2005 (peaks that day) |
| 13th peak: | Article on OSnews + brief on LinuxToday re. OSnews article on 29-12-2005 |
| 14th peak: | Release of Zenwalk Core 2.1 + article on golem.de on 18-01-2006 |
| 15th peak: | Release of Zenwalk 2.2 on 16-02-2005 (peaks on 17-02-2006) |
Based on this information you can see what events had a profound effect on the raise or fall of your distribution. For example it is surprising to see that the smoothened peak no. 6 goes to way above 750 HPD (a 7 day average!) and at the same time spreads wide in time. This particular event is most likely the one that sparkled most new users to be interested in the Zenwalk distribution. You can also see how the "valley" between the peaks lies higher than before.
With many peaks at the beginning of the graph, and less peaks at the second half of the graph, Zenwalk experienced a period where the 12-month average was higher than the 6-month average. Anyway, the matter of the fact is that Zenwalk has been good at keeping up the interest - as a combination of the maturity of the project and, apparently, people are informed about the project and therefore checks out the Distrowatch information. Keeping some activities going on a regular basis is probably a very good idea, whether it be the release of a new version or some other activity that catches peoples interest.
I download the latest newhpd.csv file on a weekly basis to update the spreadsheet and follow the progress/pulse of the Zenwalk distribution. To be able to follow the popularity of your distro is an invaluable tool to evaluate your success with certain initiatives.



