Fraunhofer Institute
CAM Schema (seems to be also done my the fine folks at Fraunhofer)
As well as a few academic papers that describe the schema, such as:
Muñoz-Merino, PJ, Pardo, A, Kloos, CD, Munoz-Organero, M, Wolpers, M, Niemann, K & Friedrich, M 2010, ‘CAM in the semantic web world’, ACM Press, New York, New York, USA, p. 1.
Scheffel, M, Niemann, K, Pardo, A, Leony, D, Friedrich, M, Schmidt, K, Wolpers, M & Kloos, CD 2011, ‘Usage Pattern Recognition in Student Activities’, Springer Berlin Heidelberg, pp. 341–355.
However, the information available online about it seems awfully sparse, compared to my experiences with open source software. There are a bunch of downloadable items, but no real "getting started with CAM" documents, not SQL create scripts, etc. The SQL binding is distributed as a PNG file or GraphML, not as SQL. This is not a major hurdle; just not what I'm used to from experiences as a software developer. I spent a couple of hours searching for some sample SQL create scripts; in the end I spent twenty minutes writing my own based on the available documentation.
The other oddity is the number of different versions of the schema, which seem to differ quite radically. I've settled on version 1.5, this one:
though I think I'll need to make some changes (the same Item is referenced by events in different Feeds; rather than keeping separate Items for the separate Feeds, I'd rather link the feed to the Events table).
It seems pretty good - there is space in the schema for most of the things I'd like to log. I've had a chat to Abelardo Pardo who, in addition to being very intelligent, is an author on both the papers mentioned above, and he seemed to agree that the schema isn't something set in stone; it's a flexible framework that allows you to use various components when you need them. The version he was using also had a User table, which will also be useful and which I'm considering adding to my existing schema.
So I have my data in there. I'm very pleased with my session splitting code (for detecting user sessions in the log data). My importer is loading over a thousand records per second* (which is important when each day's logs are five to thirty thousand lines - over half a million lines for January, which is a very light month in terms of site usage). Currently I'm recreating the database each time and loading all the data, but at some point I'll set it up to run with each day's logs as they are created, and just have a database sitting there with all my data, ready to go. Also on the list to investigate (at Abelardo's suggestion) is a NoSQL database - which apparently simplifies queries and improves performance.
* From the latest import: "636600 events inserted in 416 seconds - 1530 events per second". I am fond of my shiny new iMac.
No comments:
Post a Comment