Trac is being migrated to new services! Issues can be found in our new YouTrack instance and WIKI pages can be found on our website.

Changes between Initial Version and Version 1 of GSoC2012/Statscollector


Ignore:
Timestamp:
Aug 18, 2012, 6:01:43 AM (11 years ago)
Author:
sanket
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GSoC2012/Statscollector

    v1 v1  
     1= Statscollector for libpurple based clients =
     2
     3First of I would like to extend my thanks to all Pidgin/libpurple developers who have given me this opportunity to work on a GSoC.
     4
     5This project aims at collecting useful statistics about the users who use clients based on libpurple. As this is tied with Pidgin, I have majorly focused to work on Pidgin/Finch which both use libpurple. The motivation is to, first - let developers know which features to work on/optimize, and second - to have some interesting facts about how people use the widely active IM service these days. I will split this sections describing the types of statistics collected followed by information on the client (plugin) and the server.
     6
     7[[TOC(inline, noheading)]]
     8
     9== For the crazy and non-patient ==
     10
     11For those who just want to see the final result of stats website, feel free to visit [http://stats.pidgin.im]. The source for the 2.x.y pidgin branch is housed at [http://hg.pidgin.im/soc/2012/sanket/statscollector-2.x.y 2.x.y-plugin] and at [http://hg.pidgin.im/soc/2012/sanket/www-statscollector server].
     12
     13== Statistics collected ==
     14
     15If you visit [http://stats.pidgin.im] you can see a host of statistics that are currently collected. I will summarize them in the form of a list here:
     16
     17 * System information
     18  * Type of Operating System -- Windows (breakdown), Apple (breakdown), Linux
     19  * Architecture information
     20   * Hardware
     21   * Operating System
     22   * Pidgin Code
     23  * Type of processor -- x86, x86_64, ppc, ppc64 etc
     24 * Client information
     25  * Version of libpurple in use
     26  * UI in use -- Pidgin/Finch (haven't tested with Adium et. al)
     27 * Protocols
     28  * Purple Protocols -- jabber/irc/...
     29  * Avg user count for each protocol
     30  * Breakdown on servers for jabber/irc (see note(1) below)
     31 * Plugins
     32  * Count of plugins
     33
     34NOTE(1): Breakdown on servers can leak private information if the server is not public, for that reason I am developing a simple hash based mechanism to determining if the server is public before accepting raw names. This will avoid any private information sharing!
     35
     36== Plugin ==
     37It's a plain old libpurple plugin which does some crazy stuff to collect information about the client (native and libpurple). Though you could always have a look at the source I would only mention a few challenges associated with writing the client.
     38
     39=== OS/Hardware specific information ===
     40Operating Systems such as Windows, Macintosh, Linux (various myriad flavors) and some crazy ones make life difficult to collect common information as Architecture Type or the Bitness of hardware/OS. I had to go through the complicated regime of #ifdef's to complete this task. One interesting observation though is, POSIX compliance can generally save your day. In my case, I could classify the systems in POSIX/Windows, much like IE/rest of the world :-)
     41
     42=== Privacy Concerns ===
     43As the plugin is if client side, it can potentially collect secret information. No worries, you should believe in the disclaimer we are about to flash though ;-). Ensuring that everything that is public ONLY is published was a important thought throughout. For example, in order to track if the user is enabling the same account twice, we only store the hashes of his uid instead of the uname@service string. This ensures that, we do not store any sensitive information inside stats.xml (the file which contains all stats data)! You should definitely have a look inside, stats.xml (it resides inside your pidgin/libpurple home directory, ~/.purple in my case).
     44
     45
     46== Server ==
     47The server is basically a collator which collects all the stats.xml and transforms them into a useful database (we obviously don't want to be working on raw xml's). For the interested it's written in [http://djangoproject.com/ Django] and uses the awesome [http://highcharts.com/ Highcharts] Javascript Library. Thanks Eion, for the recommendation on the charts library :-)!
     48
     49=== Processing Stats ===
     50One major challenge for this server was to sort the XML's efficiently. Because utlimately it's going to hit a lot of traffic and rendering information should be efficient, to be short! I have followed the following workflow: on submitting stats.xml the server will breakdown the file and store it into a database schema. All queries for date ranges by users then, will be simple select * from db where date >= d1 and date <=d2 format. MySql or any RDBMS will be ideally suited for these queries. I had to make sure that Django's abstraction did not screw up the efficiency, because your logic can change the type of query you make -- without you knowi
All information, including names and email addresses, entered onto this website or sent to mailing lists affiliated with this website will be public. Do not post confidential information, especially passwords!