Data Quarry

Thursday, January 24, 2013

Hierarchical Agglomerative Clustering: growing a tree

With k-means clustering, as mentioned in the previous post, the big disadvantage is that you have to specify the number of clusters to divide the data into. One method to avoid this limitation is hierarchical clustering. As the name suggests, the method creates a hierarchical structure of data, thus subdividing the data into clusters naturally. The result can be presented in a tree graph, called dendogram (e.g. phylogenetic tree). There are 2 types of hierarchical clustering. Top-down approach (or divisive) starts with the whole data in one cluster, and then divides it into smaller and smaller clusters until the end is reached (or some stop condition is met). Bottom-up, or agglomerative clustering starts by treating every data instance as a cluster and then merging it into larger and larger clusters. Both methods are computationally expensive (O(2ⁿ) for divisive and O(n³) for agglomerative), but since agglomerative is faster, I'll talk about it in this post.

The general algorithm is quite simple.

First you need to define the distance (or measure of similarity) between instances of the data. As usual with clustering, this is the step that influences performance the most, both in terms of the result produced and time taken to produce it (mostly the first, not so much the latter).
Find the distances between all the instances in the data.
Pick the smallest distance between two instances and form a cluster from them (it's like finding two organisms that are most closely related and connecting them by a shortest branch in a phylogenetic tree).
Re-calculate the distances from all remaining instances to the newly formed cluster (usually the average of distances between an instance and each of the two cluster members).
Repeat from step 3 until all data is merged or some exit condition is met.

Saving the distances at each merging step will give weights to the dendogram splits.

Nitty-gritty

I had my own little experiment with agglomerative clustering. Implementing it in java, I first went the straightforward route of keeping the distances in the 2D array. However, it is not feasible, especially since this array is symmetrical and only half of it is filled with distances (i.e. distance between A and B is the same as between B and A). So dragging the whole big structure from one recursive step to another wasted a lot of resources. Not to mention the mess with indexes and additional array for sequence names. So I implemented a different version, using ConcurrentHashMap, which allows modifying the structure while iterating through it. This saves duplication of the structure at each iteration step, which saves considerable space and a bit of time. Both implementation are in my git repository: array version and hash map version. To try them out, see the UnitTest. It's still easier to understand what's going on by stepping through the array version, but hash map is definitely the winner.

Wednesday, January 16, 2013

K-Means Clustering :: the sense of community

In data analysis, we are often faced with data that we know little about. We can't contrast it with a negative dataset, as with ILP; we don't know which class the instances of the data belong to (unlabelled data). In other words, we don't know what we don't know and need to find something in this data when we don't even know what we are looking for. One technique to find order in such mess is to use clustering. Very generally, clustering is dividing the data into groups (clusters) based on the values of its attributes. For example, if we have a dataset of fruits, we may divide it into clusters based on color and size, thus potentially determining the kind of fruits (apples, oranges, etc.) we have in our dataset.

One of the simplest types of clustering is k-means. The prerequisites for this algorithm are:

your data (obviously), represented as a set of attribute values;
the number of clusters you want the data to be divided in (i.e. number k).

The second prerequisite can be very restrictive, as in many cases we don't know how many clusters the data can be divided in. There are other clustering methods that do not require this, however, even with this method, you can experiment with the number of clusters to see which one produces the best result.

Onto the algorithm now. Here are the basic steps:

Pick the initial cluster centers (centroids, or means). Usually, randomly pick k instances from the data. Although using more intelligent ways to pick initial centroids will pay off later.
Calculate the "distance" from each instance to each centroid. Here is a place to be creative. The distance could be Euclidean distance for numeric attributes (refresher: for two points (x₁,y₁) and (x₂,y₂) Euclidean distance is a square root of (x₁-x₂)² + (y₁-y₂)² ), hamming distance for strings, or arbitrary distance scale (as in the fruits example, you can define that color orange is closer to red then to blue, etc.).
Assign the instances to clusters. Based on the distances calculated in the previous step, assign an instance to the cluster to which it has the smallest distance to.
Re-evaluate the centroids. Now that the instances are assigned to the clusters, calculate new cluster centers (centroids) by averaging the attributes of all instances in the cluster. I.e. if a cluster has 3 instances: (x₁,y₁,z₁), (x₂,y₂,z₂), and (x₃,y₃,z₃), new centroid will be ( (x₁+x₂+x₃)/3, (y₁+y₂+y₃)/3, (z₁+z₂+z₃)/3 ).
Repeat from step 2 and stop when clusters stabilize (i.e. no instances are re-assigned to other clusters, compared to the previous iteration).

This methods is usually found effective. However, there is no guarantee to find global minimum, therefore the step of picking initial cluster centers is very important. For instance, look at the two graphs below. The clusters, shown by red ovals, were identified by the same algorithm, which picks the first k instances as the initial centroids. The two different results were achieved by simply re-ordering the input set.

A better way to select initial cluster centers would be to identify minimum and maximum values for each attribute and then pick k values that are evenly spread in-between. Having some initial knowledge of clusters would also help.

As an afterthought, just want to add that clustering, apart from being a useful method to investigate unlabelled data, can also be used as an additional technique in classification. For instance, as a way of selecting a most representative subset of attributes for classification, which can substantially speed up the algorithm.

Nitty-gritty

Java code mentioned in this post is available in my Git repo

Tuesday, December 18, 2012

Columnar Databases: grey area of the database management systems

When it comes to storing data, there seem to be two database management systems (DBMS) camps: the traditional, posh relational databases (RDBMS), represented by IBM DB2, MySQL, MS SQL Server, etc. and the cool, trendsetting NoSQL databases, most famously represented by a number of open source DBs running on Hadoop. And while one camp may distrust the other, generally there is a clear understanding on when to use one over the other. Main factors to consider are how structured your data is, how much data you have, and how much control over the query execution you want to have. First is usually a no-brainer: if your data is unstructured, then creating a strict RDBMS schema is close to impossible. Second, when you're dealing with huge data, NoSQL usually wins, since you can scale it out onto multiple servers or cloud using platforms like Hadoop. RDBMS can be scaled out as well, but usually it is much harder and, in the process, the DB usually has to be denormalized to avoid expensive JOINs, therefore losing its data integrity advantage. Third, the query complexity. SQL, being a declarative language, lets programmer simply state what data he/she wants, and leave the details of how to get it to the system and rely on the query optimizer for the performance of the query. NoSQL, usually (there are exception, of course), make you work for your data, but allows you more control over how to execute the query in the most effective way.

For me, the decision to go with one DBMS or the other is not so clear cut. I prefer it when the data is represented in as much of a structured way as possible. As an OO programmer, I know first had how great it is to have data abstraction, when you don't have to worry (much) about messing up data retrieval and only worry about the analysis. And what if only some part of my data is unstructured? Would I want to give up all structure because of some unstructured fields? Also, since I'm not a big fan of spending lots of money on one expensive server, would the ability to scale out be enough to turn be into a NoSQL-er?

If you have similar doubts, and you also can't make up your mind about these camps, I invite you to look at one of the "grey areas" of DBMS. What I want to mention here is not a hybrid approach, which I think is very cool as well. But a columnar database. One the face of it, it's just a flavour of a regular RDBMS. You still store your data in tables. You can still have relationships between tables. You still use SQL to access it. But with row-based RDBMS, the DB table is like a spreadsheet: you have one key per row, all the attributes are stored sequentially, and each attribute (column) in the row is as big as the largest one that you need to store. With columnar DB, you have a key per each attribute, the data is stored column by column, so you only take as much space as needed for the value. On top of that, some DBs sort the values in a column and store only unique values, as well as apply more advanced compression techniques, which leads to drastic space reduction and makes it feasible for large datasets. In addition, since data in columnar DB is not saved in monolithic spreadsheet-table structure, it is much easier to scale it out into a cluster. As a result, such databases are usually less normalized then traditional row stores, which makes is a good fit for semi-structured data. And since the data takes so much less space, one technique to improve query performance is to create extra tables, optimized for performance of frequently executed queries. It sounds like not such a good idea to duplicate data like that, but I had an opportunity to evaluate a proprietary columnar DB and compare it to another proprietary row DB. I was very surprised that with the same initial schema, but 3 times as many tables for a columnar DB, columnar DB took 4 times less space on the disk.

Of course, each implementation of columnar DBs have its own nuances (same true for RDBMS and NoSQL), so I'm not advocating for or giving a comprehensive review of columnar DBs here. I merely suggest to explore more options before committing to a DBMS. Because like choosing appropriate data types in programming, choosing a proper DB can make a big difference in the success of the project.

Monday, December 10, 2012

Linear Regression :: when no fanciness is needed

I am a firm believer in using a right tool for the job, and sometimes a simple hammer can be more effective than a jackhammer. When it comes to data mining, such can be the case with linear regression, so I'd like to put in a few words about it.

Most powerful and most used techniques in data mining (i.e. support vector machines, neural networks, etc.) are usually of classification learning style. These techniques are trained on labelled data and the resulting model predicts which class a new instance belongs to. The data they operate on are nominal (constant values that can't be compared) or ordinal (values that can be ordered, but there is no distance between them). If such algorithms need to deal with numeric data, the usual technique is to combine it into discrete intervals (i.e. instead of the numeric range of values {148, 134, 152, 80, 256, 462, 57}, have {large, medium, small} and define each value in terms of these constants). But what if this breakdown is not granular enough for you and what you want to predict is not a class, but a numeric quantity? Then regression is something you should consider.

Linear regression is a simplest of the regressions, but sometimes can do the job nicely as well. Use it if you have a range of numeric values over time, for example (or some other numeric quantity) and want to see where the trend is going. What linear regression does is it fits a line through the data (see graphs below), so that you can see which direction the data is taking and can predict future values. For instance, in time series analysis, if you use linear regression on historic data, you can predict where the future values will be based on the intersection with the regression line. Of course, this model is only useful if it represents your data well. To find out if it does, you need a coefficient of determination, or R². This value, which ranges from 0 to 1, represents the percent of the data that fits the model well, i.e. the higher, the better. I have found one little problem with it, though. If the data does not ascend of descend, but tends toward a horizontal line, like in Figure 2, then R² is very low, even when the data is close to the regression line. I've put two toy examples of data and their corresponding linear regression in two figures below. You can see that in both cases the data is pretty close to the regression line, but while R² for Fig.1 is high as expected, Fig.2 has a very low one. Not quite sure why that happens, but it seems like coefficient of determination is not a good measure when linear regression line is parallel to x axis.

Figure 1. Linear regression on sample data. Coefficient of determination (R²) is 0.903

Figure 2. Linear regression on sample data. Coefficient of determination (R²) is 0.007

Nitty-gritty

Images are generated in R statistical environment. The regression formula and coefficient of determination are calculated in my java implementation of linear regression, found in Git repo.

Tuesday, December 4, 2012

Exploring Data :: association rules and frequent item sets

If I had to choose a propositional method (i.e. attribute based) that is close to inductive logic programming, it would be association rules. Both methods produce a set of rules that describe the data, although ILP, being logic based, is more powerful, since association rules are of the form of conjunction of has_a(_) type predicates, while ILP can have any kind of predicates that you define, such as before(_,_), distance_between(_,_), etc. One great thing about association rules is that you don't need any negative examples to contrast learning, and you don't need your data to be labeled. All you need is your data, which consists of a set of items (i.e. this is not a numeric prediction method) and some idea of the minimum accuracy for the resulting rules (this could be experimented with). So this method is a great exploratory tool, frequently used for basket analysis (ex. when you need to find out if buying milk leads to the purchase of bread and peanut butter).

Since when we are mining for association rules we have no idea of what attribute we are trying to predict and which attributes and their values it will depend on, bluntly going through all data looking for all possible combination of attributes and their values is not feasible. So the first step in finding association rules is finding frequent item sets. In other words, we are just looking for the items that frequently appear together, without inferring any relationship between the items. Taking a grocery store example, we'll be looking for an item set {milk, bread, peanut butter}. There are several algorithms that find frequent item sets, the most common is Apriori (Agrawal, Srikant; VLDB 1994). I'm not going to go through the details of the algo, since it is described pretty well in the paper and on wiki, but generally all you need besides your data is minimum support for the set (i.e. minimum number of time that an item set appears in data). The algo starts with one item item-sets, i.e. looking at each individual item, rejecting those that don't meet minimum support. With every iteration, it adds one more item to the item sets, again rejecting those that don't meet minimum support and stops when no larger item sets that meet min support can be found.

Once the frequent datasets are found, each one is turned into one or more association rule with a specified minimum accuracy. But I think that in most cases finding frequent datasets already solves the data analysis problem, and there is no need to turn sets into rules.

Nitty-gritty

For a java implementation of the Apriori with a small example in main, see my GitHub repo.

Thursday, November 29, 2012

Inductive Logic Programming :: applying logic to discover patterns in data

Inductive Logic Programming (or ILP) is an interesting, if not unusual technique to computationally analyze data. To get a feel of what it is, let’s start with name. Inductive, or induction is reasoning from specific to general. The way I think of it is that it is the opposite of deduction. Remember Sherlock? He solved the crimes by knowing the "rules of the game" (i.e. how people think, etc.) and getting his facts in order. Then the solution, i.e. the outcome of an action is logically deduced. With induction, you know the outcome, you know the facts, what you need to find are the rules that got you from the facts to the outcome.

As to the Logic Programming part, the good example of it is Prolog programming language. It differs from the object oriented or procedural languages like Java or C in the way that you don't specify step by step instructions of how the program should be executed. You specify the facts and the rules, and then you "ask questions" about your newly created world. For instance, you can specify facts like:

parent(john,ann); parent(john, ben); ... - John is a parent of Ann; John is a parent of Ben

and then you define what sibling means:

sibling(X,Y) :- parent(A,X), parent(A,Y) - two people (X,Y) are siblings if they have the same parent

Then you can ask questions like:

Is Ben sibling of Ann?: sibling(ben,ann). And you will get TRUE as your answer.
List all siblings?: sibling(X,Y). And you will get all sibling combinations. I our case, only (ann,ben)

So Prolog (and logic programming in general) is deductive. When you put inductive together with logic programming, what you get is the situation when you have facts (parent(john,ann);...) and you have examples (i.e. answers to the questions like sibling(ben,ann)), what you need to find are the rules (i.e. definition of the sibling().).

When is ILP useful? When you have data, some general rules that apply to it (i.e. basic knowledge) and you need to find out a theory that describes your data. For instance, let's say you are conducting a cancer study, and as a result you have a set of DNA sequences from the cancer cohort and from the healthy cohort. Each sequence contains some markers that you are interested in (transcription factor binding sites, etc.). You also know some general rules that apply to markers/sequence, i.e. two markers can be at some distance from each other; one marker can be located before another; sequence can be on some chromosome, etc. What you'd love to find out is what's so special about the cancer cohort, compared to the healthy cohort. That's where ILP comes in. Machine learning people may recognize this as a classification problem. ILP is a machine learning method, but what is different it compared to, for instance, Decision Trees or Support Vector Machines is that data is represented as logic programs, as opposed to a set of attributes. This could be beneficial when the relationship between the attributes is significant. Another difference or I would even say advantage of ILP is that it is able to describe your data and discover new knowledge about it. I'll illustrate this in an example.

Let's take the cancer study and see how we would use ILP. First, we represent our data as logic programs like these:

sequence(<name of the sequence>, <chromosome it is located on>) - we describe all sequences we have in this form
has_marker(<sequence name>, <marker name>, <location of marker in the sequence>) - describe each marker in the sequence

We also specify which sequences are cancer and which are not (positive and negative examples).

Then we give our ILP engine a set of basic rules of interaction that can describe our data, like:

has_marker(<marker name>)
before(<marker 1>, <marker 2>)
distance_interval(<marker1>, <marker2>, <distance>, <offset>)

When the ILP executes, we might get the following theory:

has_feature(m1), before(m9,m16), distance_interval(m9, m16, 60, 10).

What this means is that a cancer sequence should contain marker m1 AND marker m9 should come before m16 AND the distance between m9 and m16 should be 60 +/- 10.

You can see that the new knowledge is discovered by concatenation of basic rules. It's unlikely that you would be looking for such a specific rule after the experiment, while ILP may discover it for you.

The advantages of ILP is the ability to represent input data more expressively using logic (granted, it may not always be needed) and the ability to not only classify, but describe the data in human readable form (I've seen people who are not familiar with programming learn how to read the rules in 15 minutes). The disadvantage, as you might have guesses, is the speed of execution. So if you have large amount of data and you do have your own theories about it, you're better off testing them first. However, if you have exhausted all the possibilities and still have no clue, ILP can provide you with possible theories. If you consider that some biological experiments take months or even years (not to mention the cost), then a few days of computer processing is peanuts.

Nitty-gritty

To try to use this method, start with picking and installing an ILP engine. I've used Aleph, but there are lots of others available.