Thursday, January 24, 2013

Hierarchical Agglomerative Clustering: growing a tree

With k-means clustering, as mentioned in the previous post, the big disadvantage is that you have to specify the number of clusters to divide the data into. One method to avoid this limitation is hierarchical clustering. As the name suggests, the method creates a hierarchical structure of data, thus subdividing the data into clusters naturally. The result can be presented in a tree graph, called dendogram (e.g. phylogenetic tree). There are 2 types of hierarchical clustering. Top-down approach (or divisive) starts with the whole data in one cluster, and then divides it into smaller and smaller clusters until the end is reached (or some stop condition is met). Bottom-up, or agglomerative clustering starts by treating every data instance as a cluster and then merging it into larger and larger clusters. Both methods are computationally expensive (O(2n) for divisive and O(n3) for agglomerative), but since agglomerative is faster, I'll talk about it in this post.

The general algorithm is quite simple.
  1. First you need to define the distance (or measure of similarity) between instances of the data. As usual with clustering, this is the step that influences performance the most, both in terms of the result produced and time taken to produce it (mostly the first, not so much the latter).
  2. Find the distances between all the instances in the data.
  3. Pick the smallest distance between two instances and form a cluster from them (it's like finding two organisms that are most closely related and connecting them by a shortest branch in a phylogenetic tree).
  4. Re-calculate the distances from all remaining instances to the newly formed cluster (usually the average of distances between an instance and each of the two cluster members).
  5. Repeat from step 3 until all data is merged or some exit condition is met.

Saving the distances at each merging step will give weights to the dendogram splits.

Nitty-gritty


I had my own little experiment with agglomerative clustering. Implementing it in java, I first went the straightforward route of keeping the distances in the 2D array. However, it is not feasible, especially since this array is symmetrical and only half of it is filled with distances (i.e. distance between A and B is the same as between B and A). So dragging the whole big structure from one recursive step to another wasted a lot of resources. Not to mention the mess with indexes and additional array for sequence names. So I implemented a different version, using ConcurrentHashMap, which allows modifying the structure while iterating through it. This saves duplication of the structure at each iteration step, which saves considerable space and a bit of time. Both implementation are in my git repository: array version and hash map version. To try them out, see the UnitTest. It's still easier to understand what's going on by stepping through the array version, but hash map is definitely the winner.

No comments:

Post a Comment