Tuesday, December 18, 2012

Columnar Databases: grey area of the database management systems

When it comes to storing data, there seem to be two database management systems (DBMS) camps: the traditional, posh relational databases (RDBMS), represented by IBM DB2, MySQL, MS SQL Server, etc. and the cool, trendsetting NoSQL databases, most famously represented by a number of open source DBs running on Hadoop. And while one camp may distrust the other, generally there is a clear understanding on when to use one over the other. Main factors to consider are how structured your data is, how much data you have, and how much control over the query execution you want to have. First is usually a no-brainer: if your data is unstructured, then creating a strict RDBMS schema is close to impossible. Second, when you're dealing with huge data, NoSQL usually wins, since you can scale it out onto multiple servers or cloud using platforms like Hadoop. RDBMS can be scaled out as well, but usually it is much harder and, in the process, the DB usually has to be denormalized to avoid expensive JOINs, therefore losing its data integrity advantage. Third, the query complexity. SQL, being a declarative language, lets programmer simply state what data he/she wants, and leave the details of how to get it to the system and rely on the query optimizer for the performance of the query. NoSQL, usually (there are exception, of course), make you work for your data, but allows you more control over how to execute the query in the most effective way.

For me, the decision to go with one DBMS or the other is not so clear cut. I prefer it when the data is represented in as much of a structured way as possible. As an OO programmer, I know first had how great it is to have data abstraction, when you don't have to worry (much) about messing up data retrieval and only worry about the analysis. And what if only some part of my data is unstructured? Would I want to give up all structure because of some unstructured fields? Also, since I'm not a big fan of spending lots of money on one expensive server, would the ability to scale out be enough to turn be into a NoSQL-er?

If you have similar doubts, and you also can't make up your mind about these camps, I invite you to look at one of the "grey areas" of DBMS. What I want to mention here is not a hybrid approach, which I think is very cool as well. But a columnar database. One the face of it, it's just a flavour of a regular RDBMS. You still store your data in tables. You can still have relationships between tables. You still use SQL to access it. But with row-based RDBMS, the DB table is like a spreadsheet: you have one key per row, all the attributes are stored sequentially, and each attribute (column) in the row is as big as the largest one that you need to store. With columnar DB, you have a key per each attribute, the data is stored column by column, so you only take as much space as needed for the value. On top of that, some DBs sort the values in a column and store only unique values, as well as apply more advanced compression techniques, which leads to drastic space reduction and makes it feasible for large datasets. In addition, since data in columnar DB is not saved in monolithic spreadsheet-table structure, it is much easier to scale it out into a cluster. As a result, such databases are usually less normalized then traditional row stores, which makes is a good fit for semi-structured data. And since the data takes so much less space, one technique to improve query performance is to create extra tables, optimized for performance of frequently executed queries. It sounds like not such a good idea to duplicate data like that, but I had an opportunity to evaluate a proprietary columnar DB and compare it to another proprietary row DB. I was very surprised that with the same initial schema, but 3 times as many tables for a columnar DB, columnar DB took 4 times less space on the disk.

Of course, each implementation of columnar DBs have its own nuances (same true for RDBMS and NoSQL), so I'm not advocating for or giving a comprehensive review of columnar DBs here. I merely suggest to explore more options before committing to a DBMS. Because like choosing appropriate data types in programming, choosing a proper DB can make a big difference in the success of the project.

No comments:

Post a Comment