Version 3 (modified by 11 years ago) (diff) | ,
---|
Statscollector for libpurple based clients
First of I would like to extend my thanks to all Pidgin/libpurple developers who have given me this opportunity to work on a GSoC.
This project aims at collecting useful statistics about the users who use clients based on libpurple. As this is tied with Pidgin, I have majorly focused to work on Pidgin/Finch? which both use libpurple. The motivation is to, first - let developers know which features to work on/optimize, and second - to have some interesting facts about how people use the widely active IM service these days. I will split this sections describing the types of statistics collected followed by information on the client (plugin) and the server.
For the crazy and non-patient
For those who just want to see the final result of stats website, feel free to visit http://stats.pidgin.im. The source for the 2.x.y pidgin branch is housed at 2.x.y-plugin and at server.
Statistics collected
If you visit http://stats.pidgin.im you can see a host of statistics that are currently collected. I will summarize them in the form of a list here:
- System information
- Type of Operating System -- Windows (breakdown), Apple (breakdown), Linux
- Architecture information
- Hardware
- Operating System
- Pidgin Code
- Type of processor -- x86, x86_64, ppc, ppc64 etc
- Client information
- Version of libpurple in use
- UI in use -- Pidgin/Finch? (haven't tested with Adium et. al)
- Protocols
- Purple Protocols -- jabber/irc/...
- Avg user count for each protocol
- Breakdown on servers for jabber/irc (see note(1) below)
- Plugins
- Count of plugins
NOTE(1): Breakdown on servers can leak private information if the server is not public, for that reason I am developing a simple hash based mechanism to determining if the server is public before accepting raw names. This will avoid any private information sharing!
Plugin
It's a plain old libpurple plugin which does some crazy stuff to collect information about the client (native and libpurple). Though you could always have a look at the source I would only mention a few challenges associated with writing the client.
OS/Hardware specific information
Operating Systems such as Windows, Macintosh, Linux (various myriad flavors) and some crazy ones make life difficult to collect common information as Architecture Type or the Bitness of hardware/OS. I had to go through the complicated regime of #ifdef's to complete this task. One interesting observation though is, POSIX compliance can generally save your day. In my case, I could classify the systems in POSIX/Windows, much like IE/rest of the world :-)
Privacy Concerns
As the plugin is if client side, it can potentially collect secret information. No worries, you should believe in the disclaimer we are about to flash though ;-). Ensuring that everything that is public ONLY is published was a important thought throughout. For example, in order to track if the user is enabling the same account twice, we only store the hashes of his uid instead of the uname@service string. This ensures that, we do not store any sensitive information inside stats.xml (the file which contains all stats data)! You should definitely have a look inside, stats.xml (it resides inside your pidgin/libpurple home directory, ~/.purple in my case).
Server
The server is basically a collator which collects all the stats.xml and transforms them into a useful database (we obviously don't want to be working on raw xml's). For the interested it's written in Django and uses the awesome Highcharts Javascript Library. Thanks Eion, for the recommendation on the charts library :-)!
Processing Stats
One major challenge for this server was to sort the XML's efficiently. Because utlimately it's going to hit a lot of traffic and rendering information should be efficient, to be short! I have followed the following workflow: on submitting stats.xml the server will breakdown the file and store it into a database schema. All queries for date ranges by users then, will be simple select * from db where date >= d1 and date <=d2 format. MySql? or any RDBMS will be ideally suited for these queries. I had to make sure that Django's abstraction did not screw up the efficiency, because your logic can change the type of query you make -- without you knowing it!
Ensuring server names in prpl-jabber/irc are public
One problem very rightly pointed out by elb (Ethan) regarding displaying Jabber/IRC breakdown is that it may potentially reveal private servers which can then reveal identity of users. We don't want that obviously :-). Also if a user is running a local server for some development purposes, we don't want that either. To solve these problems this mechanism has been provided:
Plugin
- The plugins will ask for a trusted list of servers from the Stats server
- This list will actually contain md5 hashes
- If the current server is in the list then we can simply put it's name in stats.xml, else
- The server is yet to be determined as public
- In both cases, the current stats will count as evidence towards it's being public
Server
- The server will check for incoming stats, if it contains only the Hash or both Hash and Name
- If only hash is present, then it'll increment confidence for it being public else,
- If both hash and name are present, it'll check if md5(name) == hash hash in trusted_list
- If both conditions are satisfied, then the name will come in display else not