Tuesday, March 18, 2014
R - loving it
I'm still impressed by R - I'm finding it delightfully powerful. Still trying to figure out what to do with my PhD data, but I'm finding it incredible useful looking at work data to uncover patterns. And the speed with which you can tease pattens out of data means that I can do things like sit down with staff and explore a data set to start to work out why a piece of software is playing up - or at least show patterns in the data which will lead good developers to understand how a piece of code is misbehaving. I'm now annoyed that I hadn't discovered this earlier :)
Saturday, March 1, 2014
R
For the last week I've been playing with a new tool for analysing data - R. It's quite an amazing tool. It's a programming language, but one specialized for vector and matrix mathematics - which is ideal for statistical analysis. My initial test data set was ticket tracking data from work - I've been using Excel, and had a very large, clumsy spreadsheet with the same formula repeated thousands of times in order to calculate what tickets were open at what time. After a few hours of reading the R documentation and experimenting, I was able to turn the simple line graph I had in Excel into a really nice stacked bar graph in R - in a fraction of a second it reads my CSV of 128 tickets, calculates from the opening and closing dates how many tickets were open when, and builds a bar chart of how many are open, when, and for which service that we offer (over 60 services are represented).
I was very impressed by R's power and expressiveness. Some things seem a bit harder than they need to be, but that's probably just because I don't know my way around properly yet.
I've now started to work on my log file data. It took quite a while to read in 16 million lines of logs, but now operations across those log file lines are surprisingly quick - in six seconds it'll give me a list of all the IP addresses, and how many accesses can from each (which has lead to one surprising, non-PhD related discovery, which I'll need to follow up at work). Some operations seem to take more memory than I have on my machine (which is now 32GB), but so far I've found a different way to do each calculation, so it's not too bad. And I can do things like a sweet logarithmic scale histogram of lengths of user sessions (spoiler: they're mostly short!), with only a few minutes of playing around to figure out what a histogram is.
Subscribe to:
Comments (Atom)