Saturday, March 1, 2014

R



For the last week I've been playing with a new tool for analysing data - R. It's quite an amazing tool. It's a programming language, but one specialized for vector and matrix mathematics - which is ideal for statistical analysis. My initial test data set was ticket tracking data from work - I've been using Excel, and had a very large, clumsy spreadsheet with the same formula repeated thousands of times in order to calculate what tickets were open at what time. After a few hours of reading the R documentation and experimenting, I was able to turn the simple line graph I had in Excel into a really nice stacked bar graph in R - in a fraction of a second it reads my CSV of 128 tickets, calculates from the opening and closing dates how many tickets were open when, and builds a bar chart of how many are open, when, and for which service that we offer (over 60 services are represented).

I was very impressed by R's power and expressiveness. Some things seem a bit harder than they need to be, but that's probably just because I don't know my way around properly yet.

I've now started to work on my log file data. It took quite a while to read in 16 million lines of logs, but now operations across those log file lines are surprisingly quick - in six seconds it'll give me a list of all the IP addresses, and how many accesses can from each (which has lead to one surprising, non-PhD related discovery, which I'll need to follow up at work). Some operations seem to take more memory than I have on my machine (which is now 32GB), but so far I've found a different way to do each calculation, so it's not too bad. And I can do things like a sweet logarithmic scale histogram of lengths of user sessions (spoiler: they're mostly short!), with only a few minutes of playing around to figure out what a histogram is.












No comments:

Post a Comment