Monday, September 3, 2012

Data analysis: trickier than you'd expect

Well, at least it's trickier than you'd expect if you expected it to be easier than it ought to be.

Now that I'm gathering data, I'm starting to look at ways to analyze it. My first attempt has been to take some off-the-shelf tools to try to see how students are moving through the website. I've got a nice pile of data to look at (535 megabytes of BZip2 compressed log data), so I decided to analyze a day's work of logs using Statviz. I've come to a couple of conclusions:
  1. I need to put a lot more work into data cleaning and massaging before I'll get useful results out of this data. There is just too much noise in these logs to give useful output
  2. I need a faster machine! I set Statviz running on my PhD machine, and gave up three days later when it just kept processing. I started up a run on another, much faster machine in the house, and took over 24 hours, but it eventually came up with a result.
This was just one day's results - in other words, the website is generating log files faster than I waill be able to process them. On the bright side, the entirety of the log data isn't my primary concern - it's the usage patterns of the social tools that concern me, which will involve a smaller subset of the log files, and contents of databases.

As an example, here's one of the graphs generated from the log files:

It's substantially shrunk down from the original, which is 49680 by 6460 pixels in size.