Data Quarry : December 2012

Tuesday, December 18, 2012

Columnar Databases: grey area of the database management systems

When it comes to storing data, there seem to be two database management systems (DBMS) camps: the traditional, posh relational databases (RDBMS), represented by IBM DB2, MySQL, MS SQL Server, etc. and the cool, trendsetting NoSQL databases, most famously represented by a number of open source DBs running on Hadoop. And while one camp may distrust the other, generally there is a clear understanding on when to use one over the other. Main factors to consider are how structured your data is, how much data you have, and how much control over the query execution you want to have. First is usually a no-brainer: if your data is unstructured, then creating a strict RDBMS schema is close to impossible. Second, when you're dealing with huge data, NoSQL usually wins, since you can scale it out onto multiple servers or cloud using platforms like Hadoop. RDBMS can be scaled out as well, but usually it is much harder and, in the process, the DB usually has to be denormalized to avoid expensive JOINs, therefore losing its data integrity advantage. Third, the query complexity. SQL, being a declarative language, lets programmer simply state what data he/she wants, and leave the details of how to get it to the system and rely on the query optimizer for the performance of the query. NoSQL, usually (there are exception, of course), make you work for your data, but allows you more control over how to execute the query in the most effective way.

For me, the decision to go with one DBMS or the other is not so clear cut. I prefer it when the data is represented in as much of a structured way as possible. As an OO programmer, I know first had how great it is to have data abstraction, when you don't have to worry (much) about messing up data retrieval and only worry about the analysis. And what if only some part of my data is unstructured? Would I want to give up all structure because of some unstructured fields? Also, since I'm not a big fan of spending lots of money on one expensive server, would the ability to scale out be enough to turn be into a NoSQL-er?

If you have similar doubts, and you also can't make up your mind about these camps, I invite you to look at one of the "grey areas" of DBMS. What I want to mention here is not a hybrid approach, which I think is very cool as well. But a columnar database. One the face of it, it's just a flavour of a regular RDBMS. You still store your data in tables. You can still have relationships between tables. You still use SQL to access it. But with row-based RDBMS, the DB table is like a spreadsheet: you have one key per row, all the attributes are stored sequentially, and each attribute (column) in the row is as big as the largest one that you need to store. With columnar DB, you have a key per each attribute, the data is stored column by column, so you only take as much space as needed for the value. On top of that, some DBs sort the values in a column and store only unique values, as well as apply more advanced compression techniques, which leads to drastic space reduction and makes it feasible for large datasets. In addition, since data in columnar DB is not saved in monolithic spreadsheet-table structure, it is much easier to scale it out into a cluster. As a result, such databases are usually less normalized then traditional row stores, which makes is a good fit for semi-structured data. And since the data takes so much less space, one technique to improve query performance is to create extra tables, optimized for performance of frequently executed queries. It sounds like not such a good idea to duplicate data like that, but I had an opportunity to evaluate a proprietary columnar DB and compare it to another proprietary row DB. I was very surprised that with the same initial schema, but 3 times as many tables for a columnar DB, columnar DB took 4 times less space on the disk.

Of course, each implementation of columnar DBs have its own nuances (same true for RDBMS and NoSQL), so I'm not advocating for or giving a comprehensive review of columnar DBs here. I merely suggest to explore more options before committing to a DBMS. Because like choosing appropriate data types in programming, choosing a proper DB can make a big difference in the success of the project.

Monday, December 10, 2012

Linear Regression :: when no fanciness is needed

I am a firm believer in using a right tool for the job, and sometimes a simple hammer can be more effective than a jackhammer. When it comes to data mining, such can be the case with linear regression, so I'd like to put in a few words about it.

Most powerful and most used techniques in data mining (i.e. support vector machines, neural networks, etc.) are usually of classification learning style. These techniques are trained on labelled data and the resulting model predicts which class a new instance belongs to. The data they operate on are nominal (constant values that can't be compared) or ordinal (values that can be ordered, but there is no distance between them). If such algorithms need to deal with numeric data, the usual technique is to combine it into discrete intervals (i.e. instead of the numeric range of values {148, 134, 152, 80, 256, 462, 57}, have {large, medium, small} and define each value in terms of these constants). But what if this breakdown is not granular enough for you and what you want to predict is not a class, but a numeric quantity? Then regression is something you should consider.

Linear regression is a simplest of the regressions, but sometimes can do the job nicely as well. Use it if you have a range of numeric values over time, for example (or some other numeric quantity) and want to see where the trend is going. What linear regression does is it fits a line through the data (see graphs below), so that you can see which direction the data is taking and can predict future values. For instance, in time series analysis, if you use linear regression on historic data, you can predict where the future values will be based on the intersection with the regression line. Of course, this model is only useful if it represents your data well. To find out if it does, you need a coefficient of determination, or R². This value, which ranges from 0 to 1, represents the percent of the data that fits the model well, i.e. the higher, the better. I have found one little problem with it, though. If the data does not ascend of descend, but tends toward a horizontal line, like in Figure 2, then R² is very low, even when the data is close to the regression line. I've put two toy examples of data and their corresponding linear regression in two figures below. You can see that in both cases the data is pretty close to the regression line, but while R² for Fig.1 is high as expected, Fig.2 has a very low one. Not quite sure why that happens, but it seems like coefficient of determination is not a good measure when linear regression line is parallel to x axis.

Figure 1. Linear regression on sample data. Coefficient of determination (R²) is 0.903

Figure 2. Linear regression on sample data. Coefficient of determination (R²) is 0.007

Nitty-gritty

Images are generated in R statistical environment. The regression formula and coefficient of determination are calculated in my java implementation of linear regression, found in Git repo.

Tuesday, December 4, 2012

Exploring Data :: association rules and frequent item sets

If I had to choose a propositional method (i.e. attribute based) that is close to inductive logic programming, it would be association rules. Both methods produce a set of rules that describe the data, although ILP, being logic based, is more powerful, since association rules are of the form of conjunction of has_a(_) type predicates, while ILP can have any kind of predicates that you define, such as before(_,_), distance_between(_,_), etc. One great thing about association rules is that you don't need any negative examples to contrast learning, and you don't need your data to be labeled. All you need is your data, which consists of a set of items (i.e. this is not a numeric prediction method) and some idea of the minimum accuracy for the resulting rules (this could be experimented with). So this method is a great exploratory tool, frequently used for basket analysis (ex. when you need to find out if buying milk leads to the purchase of bread and peanut butter).

Since when we are mining for association rules we have no idea of what attribute we are trying to predict and which attributes and their values it will depend on, bluntly going through all data looking for all possible combination of attributes and their values is not feasible. So the first step in finding association rules is finding frequent item sets. In other words, we are just looking for the items that frequently appear together, without inferring any relationship between the items. Taking a grocery store example, we'll be looking for an item set {milk, bread, peanut butter}. There are several algorithms that find frequent item sets, the most common is Apriori (Agrawal, Srikant; VLDB 1994). I'm not going to go through the details of the algo, since it is described pretty well in the paper and on wiki, but generally all you need besides your data is minimum support for the set (i.e. minimum number of time that an item set appears in data). The algo starts with one item item-sets, i.e. looking at each individual item, rejecting those that don't meet minimum support. With every iteration, it adds one more item to the item sets, again rejecting those that don't meet minimum support and stops when no larger item sets that meet min support can be found.

Once the frequent datasets are found, each one is turned into one or more association rule with a specified minimum accuracy. But I think that in most cases finding frequent datasets already solves the data analysis problem, and there is no need to turn sets into rules.

Nitty-gritty

For a java implementation of the Apriori with a small example in main, see my GitHub repo.