The general algorithm is quite simple.
- First you need to define the distance (or measure of similarity) between instances of the data. As usual with clustering, this is the step that influences performance the most, both in terms of the result produced and time taken to produce it (mostly the first, not so much the latter).
- Find the distances between all the instances in the data.
- Pick the smallest distance between two instances and form a cluster from them (it's like finding two organisms that are most closely related and connecting them by a shortest branch in a phylogenetic tree).
- Re-calculate the distances from all remaining instances to the newly formed cluster (usually the average of distances between an instance and each of the two cluster members).
- Repeat from step 3 until all data is merged or some exit condition is met.
Saving the distances at each merging step will give weights to the dendogram splits.
Nitty-gritty
I had my own little experiment with agglomerative clustering. Implementing it in java, I first went the straightforward route of keeping the distances in the 2D array. However, it is not feasible, especially since this array is symmetrical and only half of it is filled with distances (i.e. distance between A and B is the same as between B and A). So dragging the whole big structure from one recursive step to another wasted a lot of resources. Not to mention the mess with indexes and additional array for sequence names. So I implemented a different version, using ConcurrentHashMap, which allows modifying the structure while iterating through it. This saves duplication of the structure at each iteration step, which saves considerable space and a bit of time. Both implementation are in my git repository: array version and hash map version. To try them out, see the UnitTest. It's still easier to understand what's going on by stepping through the array version, but hash map is definitely the winner.
No comments:
Post a Comment