| 1 | = Statscollector for libpurple based clients = |
| 2 | |
| 3 | First of I would like to extend my thanks to all Pidgin/libpurple developers who have given me this opportunity to work on a GSoC. |
| 4 | |
| 5 | This project aims at collecting useful statistics about the users who use clients based on libpurple. As this is tied with Pidgin, I have majorly focused to work on Pidgin/Finch which both use libpurple. The motivation is to, first - let developers know which features to work on/optimize, and second - to have some interesting facts about how people use the widely active IM service these days. I will split this sections describing the types of statistics collected followed by information on the client (plugin) and the server. |
| 6 | |
| 7 | [[TOC(inline, noheading)]] |
| 8 | |
| 9 | == For the crazy and non-patient == |
| 10 | |
| 11 | For those who just want to see the final result of stats website, feel free to visit [http://stats.pidgin.im]. The source for the 2.x.y pidgin branch is housed at [http://hg.pidgin.im/soc/2012/sanket/statscollector-2.x.y 2.x.y-plugin] and at [http://hg.pidgin.im/soc/2012/sanket/www-statscollector server]. |
| 12 | |
| 13 | == Statistics collected == |
| 14 | |
| 15 | If you visit [http://stats.pidgin.im] you can see a host of statistics that are currently collected. I will summarize them in the form of a list here: |
| 16 | |
| 17 | * System information |
| 18 | * Type of Operating System -- Windows (breakdown), Apple (breakdown), Linux |
| 19 | * Architecture information |
| 20 | * Hardware |
| 21 | * Operating System |
| 22 | * Pidgin Code |
| 23 | * Type of processor -- x86, x86_64, ppc, ppc64 etc |
| 24 | * Client information |
| 25 | * Version of libpurple in use |
| 26 | * UI in use -- Pidgin/Finch (haven't tested with Adium et. al) |
| 27 | * Protocols |
| 28 | * Purple Protocols -- jabber/irc/... |
| 29 | * Avg user count for each protocol |
| 30 | * Breakdown on servers for jabber/irc (see note(1) below) |
| 31 | * Plugins |
| 32 | * Count of plugins |
| 33 | |
| 34 | NOTE(1): Breakdown on servers can leak private information if the server is not public, for that reason I am developing a simple hash based mechanism to determining if the server is public before accepting raw names. This will avoid any private information sharing! |
| 35 | |
| 36 | == Plugin == |
| 37 | It's a plain old libpurple plugin which does some crazy stuff to collect information about the client (native and libpurple). Though you could always have a look at the source I would only mention a few challenges associated with writing the client. |
| 38 | |
| 39 | === OS/Hardware specific information === |
| 40 | Operating Systems such as Windows, Macintosh, Linux (various myriad flavors) and some crazy ones make life difficult to collect common information as Architecture Type or the Bitness of hardware/OS. I had to go through the complicated regime of #ifdef's to complete this task. One interesting observation though is, POSIX compliance can generally save your day. In my case, I could classify the systems in POSIX/Windows, much like IE/rest of the world :-) |
| 41 | |
| 42 | === Privacy Concerns === |
| 43 | As the plugin is if client side, it can potentially collect secret information. No worries, you should believe in the disclaimer we are about to flash though ;-). Ensuring that everything that is public ONLY is published was a important thought throughout. For example, in order to track if the user is enabling the same account twice, we only store the hashes of his uid instead of the uname@service string. This ensures that, we do not store any sensitive information inside stats.xml (the file which contains all stats data)! You should definitely have a look inside, stats.xml (it resides inside your pidgin/libpurple home directory, ~/.purple in my case). |
| 44 | |
| 45 | |
| 46 | == Server == |
| 47 | The server is basically a collator which collects all the stats.xml and transforms them into a useful database (we obviously don't want to be working on raw xml's). For the interested it's written in [http://djangoproject.com/ Django] and uses the awesome [http://highcharts.com/ Highcharts] Javascript Library. Thanks Eion, for the recommendation on the charts library :-)! |
| 48 | |
| 49 | === Processing Stats === |
| 50 | One major challenge for this server was to sort the XML's efficiently. Because utlimately it's going to hit a lot of traffic and rendering information should be efficient, to be short! I have followed the following workflow: on submitting stats.xml the server will breakdown the file and store it into a database schema. All queries for date ranges by users then, will be simple select * from db where date >= d1 and date <=d2 format. MySql or any RDBMS will be ideally suited for these queries. I had to make sure that Django's abstraction did not screw up the efficiency, because your logic can change the type of query you make -- without you knowi |