Data Quarry : January 2013

Thursday, January 24, 2013

Hierarchical Agglomerative Clustering: growing a tree

With k-means clustering, as mentioned in the previous post, the big disadvantage is that you have to specify the number of clusters to divide the data into. One method to avoid this limitation is hierarchical clustering. As the name suggests, the method creates a hierarchical structure of data, thus subdividing the data into clusters naturally. The result can be presented in a tree graph, called dendogram (e.g. phylogenetic tree). There are 2 types of hierarchical clustering. Top-down approach (or divisive) starts with the whole data in one cluster, and then divides it into smaller and smaller clusters until the end is reached (or some stop condition is met). Bottom-up, or agglomerative clustering starts by treating every data instance as a cluster and then merging it into larger and larger clusters. Both methods are computationally expensive (O(2ⁿ) for divisive and O(n³) for agglomerative), but since agglomerative is faster, I'll talk about it in this post.

The general algorithm is quite simple.

First you need to define the distance (or measure of similarity) between instances of the data. As usual with clustering, this is the step that influences performance the most, both in terms of the result produced and time taken to produce it (mostly the first, not so much the latter).
Find the distances between all the instances in the data.
Pick the smallest distance between two instances and form a cluster from them (it's like finding two organisms that are most closely related and connecting them by a shortest branch in a phylogenetic tree).
Re-calculate the distances from all remaining instances to the newly formed cluster (usually the average of distances between an instance and each of the two cluster members).
Repeat from step 3 until all data is merged or some exit condition is met.

Saving the distances at each merging step will give weights to the dendogram splits.

Nitty-gritty

I had my own little experiment with agglomerative clustering. Implementing it in java, I first went the straightforward route of keeping the distances in the 2D array. However, it is not feasible, especially since this array is symmetrical and only half of it is filled with distances (i.e. distance between A and B is the same as between B and A). So dragging the whole big structure from one recursive step to another wasted a lot of resources. Not to mention the mess with indexes and additional array for sequence names. So I implemented a different version, using ConcurrentHashMap, which allows modifying the structure while iterating through it. This saves duplication of the structure at each iteration step, which saves considerable space and a bit of time. Both implementation are in my git repository: array version and hash map version. To try them out, see the UnitTest. It's still easier to understand what's going on by stepping through the array version, but hash map is definitely the winner.

Wednesday, January 16, 2013

K-Means Clustering :: the sense of community

In data analysis, we are often faced with data that we know little about. We can't contrast it with a negative dataset, as with ILP; we don't know which class the instances of the data belong to (unlabelled data). In other words, we don't know what we don't know and need to find something in this data when we don't even know what we are looking for. One technique to find order in such mess is to use clustering. Very generally, clustering is dividing the data into groups (clusters) based on the values of its attributes. For example, if we have a dataset of fruits, we may divide it into clusters based on color and size, thus potentially determining the kind of fruits (apples, oranges, etc.) we have in our dataset.

One of the simplest types of clustering is k-means. The prerequisites for this algorithm are:

your data (obviously), represented as a set of attribute values;
the number of clusters you want the data to be divided in (i.e. number k).

The second prerequisite can be very restrictive, as in many cases we don't know how many clusters the data can be divided in. There are other clustering methods that do not require this, however, even with this method, you can experiment with the number of clusters to see which one produces the best result.

Onto the algorithm now. Here are the basic steps:

Pick the initial cluster centers (centroids, or means). Usually, randomly pick k instances from the data. Although using more intelligent ways to pick initial centroids will pay off later.
Calculate the "distance" from each instance to each centroid. Here is a place to be creative. The distance could be Euclidean distance for numeric attributes (refresher: for two points (x₁,y₁) and (x₂,y₂) Euclidean distance is a square root of (x₁-x₂)² + (y₁-y₂)² ), hamming distance for strings, or arbitrary distance scale (as in the fruits example, you can define that color orange is closer to red then to blue, etc.).
Assign the instances to clusters. Based on the distances calculated in the previous step, assign an instance to the cluster to which it has the smallest distance to.
Re-evaluate the centroids. Now that the instances are assigned to the clusters, calculate new cluster centers (centroids) by averaging the attributes of all instances in the cluster. I.e. if a cluster has 3 instances: (x₁,y₁,z₁), (x₂,y₂,z₂), and (x₃,y₃,z₃), new centroid will be ( (x₁+x₂+x₃)/3, (y₁+y₂+y₃)/3, (z₁+z₂+z₃)/3 ).
Repeat from step 2 and stop when clusters stabilize (i.e. no instances are re-assigned to other clusters, compared to the previous iteration).

This methods is usually found effective. However, there is no guarantee to find global minimum, therefore the step of picking initial cluster centers is very important. For instance, look at the two graphs below. The clusters, shown by red ovals, were identified by the same algorithm, which picks the first k instances as the initial centroids. The two different results were achieved by simply re-ordering the input set.

A better way to select initial cluster centers would be to identify minimum and maximum values for each attribute and then pick k values that are evenly spread in-between. Having some initial knowledge of clusters would also help.

As an afterthought, just want to add that clustering, apart from being a useful method to investigate unlabelled data, can also be used as an additional technique in classification. For instance, as a way of selecting a most representative subset of attributes for classification, which can substantially speed up the algorithm.

Nitty-gritty

Java code mentioned in this post is available in my Git repo