It's slow going - I'm gradually fixing up and improving those 100 patterns, and finding all sorts of ways they can be wrong. I'm also finding "errors" in the data - which appear to be incorrect URLs formed by users playing with things in the address bar of their browser, or occasionally from javascript incorrectly accessing unusual URLs. I'm not sure what to do with these events; it's probably the case that I should just delete those URLs from my analysis database, as they will just make analysis harder, with no real benefit.
I'm interacting directly with the database here, using the SQL console. It's the quickest and most trustworthy way to get at the data; but I'm starting to think having some other visualization tools might come in handy at this point to confirm that I'm getting the patterns I expect.
Here's some of what I've done:
Phase 1: add a "pattern" tag to events, to allow mapping of parsing patterns to items in the database. This will let me check that each pattern is catching the right items, and that the code isn't catching and misinterpreting any false positive matches
Phase 2: Look through every single pattern and ensure that it catches URLs that will be parsed correctly by its parsing rule. Use commands like:
select distinct followedlink from event where tags='pattern60';
This will take a long time to do.
Phase 3: Look at items, ensure that they are being mapped correctly. Initially, I was getting PBLs with values like "4635" and TAs with values like "4.04" - I had mis-mapped items. Use commands like:
select item.title,item.itemid,count(*) from item,eventitem where type='TA' and item.itemid=eventitem.itemid and not item.title ~ '^[0-9]+$' group by item.title,item.itemid;
To find non-correct TAs, and
select item.title,item.itemid,count(*) from item,eventitem where type='pbl' and item.itemid=eventitem.itemid group by item.title,item.itemid order by count;
to find all the PBLs and work out which ones are OK.
No comments:
Post a Comment