Sunday, August 25, 2013

Cleaning the CAM database

I'm now pulling nice big piles of data into my CAM database, but I'm finding it rather messy. I'm parsing log files, and there are over 100 distinct URL patterns I need to handle. This obviously brings in a lot of places I can make mistakes - mis-parsing URLs, overly broad matching of a URL pattern, missing variants in a particular URL pattern. So I'm now working on cleaning that all up. The main items I'm looking at are called PBLs and TAs, and they have clearly defined patterns regarding what they are.

It's slow going - I'm gradually fixing up and improving those 100 patterns, and finding all sorts of ways they can be wrong. I'm also finding "errors" in the data - which appear to be incorrect URLs formed by users playing with things in the address bar of their browser, or occasionally from javascript incorrectly accessing unusual URLs. I'm not sure what to do with these events; it's probably the case that I should just delete those URLs from my analysis database, as they will just make analysis harder, with no real benefit.

I'm interacting directly with the database here, using the SQL console. It's the quickest and most trustworthy way to get at the data; but I'm starting to think having some other visualization tools might come in handy at this point to confirm that I'm getting the patterns I expect.
Here's some of what I've done:

Phase 1: add a "pattern" tag to events, to allow mapping of parsing patterns to items in the database. This will let me check that each pattern is catching the right items, and that the code isn't catching and misinterpreting any false positive matches

Phase 2: Look through every single pattern and ensure that it catches URLs that will be parsed correctly by its parsing rule. Use commands like:

select distinct followedlink from event where tags='pattern60';

This will take a long time to do.

Phase 3: Look at items, ensure that they are being mapped correctly. Initially, I was getting PBLs with values like "4635" and TAs with values like "4.04" - I had mis-mapped items. Use commands like:

select  item.title,item.itemid,count(*) from item,eventitem where type='TA' and item.itemid=eventitem.itemid and not item.title ~ '^[0-9]+$' group by item.title,item.itemid;

To find non-correct TAs, and

select  item.title,item.itemid,count(*) from item,eventitem where type='pbl' and item.itemid=eventitem.itemid  group by item.title,item.itemid order by count;

to find all the PBLs and work out which ones are OK.

No comments:

Post a Comment