Wednesday, January 16, 2013

K-Means Clustering :: the sense of community

In data analysis, we are often faced with data that we know little about. We can't contrast it with a negative dataset, as with ILP; we don't know which class the instances of the data belong to (unlabelled data). In other words, we don't know what we don't know and need to find something in this data when we don't even know what we are looking for. One technique to find order in such mess is to use clustering. Very generally, clustering is dividing the data into groups (clusters) based on the values of its attributes. For example, if we have a dataset of fruits, we may divide it into clusters based on color and size, thus potentially determining the kind of fruits (apples, oranges, etc.) we have in our dataset.

One of the simplest types of clustering is k-means. The prerequisites for this algorithm are:
  1. your data (obviously), represented as a set of attribute values; 
  2. the number of clusters you want the data to be divided in (i.e. number k). 
The second prerequisite can be very restrictive, as in many cases we don't know how many clusters the data can be divided in. There are other clustering methods that do not require this, however, even with this method, you can experiment with the number of clusters to see which one produces the best result.

Onto the algorithm now. Here are the basic steps:
  1. Pick the initial cluster centers (centroids, or means). Usually, randomly pick k instances from the data. Although using more intelligent ways to pick initial centroids will pay off later. 
  2. Calculate the "distance" from each instance to each centroid. Here is a place to be creative. The distance could be Euclidean distance for numeric attributes (refresher: for two points (x1,y1) and (x2,y2) Euclidean distance is a square root of (x1-x2)2 + (y1-y2)2 ), hamming distance for strings, or arbitrary distance scale (as in the fruits example, you can define that color orange is closer to red then to blue, etc.). 
  3. Assign the instances to clusters. Based on the distances calculated in the previous step, assign an instance to the cluster to which it has the smallest distance to.
  4. Re-evaluate the centroids. Now that the instances are assigned to the clusters, calculate new cluster centers (centroids) by averaging the attributes of all instances in the cluster. I.e. if a cluster has 3 instances: (x1,y1,z1), (x2,y2,z2), and (x3,y3,z3), new centroid will be ( (x1+x2+x3)/3, (y1+y2+y3)/3, (z1+z2+z3)/3 ). 
  5. Repeat from step 2 and stop when clusters stabilize (i.e. no instances are re-assigned to other clusters, compared to the previous iteration). 

This methods is usually found effective. However, there is no guarantee to find global minimum, therefore the step of picking initial cluster centers is very important. For instance, look at the two graphs below. The clusters, shown by red ovals, were identified by the same algorithm, which picks the first k instances as the initial centroids. The two different results were achieved by simply re-ordering the input set.
A better way to select initial cluster centers would be to identify minimum and maximum values for each attribute and then pick k values that are evenly spread in-between. Having some initial knowledge of clusters would also help.

As an afterthought, just want to add that clustering, apart from being a useful method to investigate unlabelled data, can also be used as an additional technique in classification. For instance, as a way of selecting a most representative subset of attributes for classification, which can substantially speed up the algorithm.

Nitty-gritty 

Java code mentioned in this post is available in my Git repo

No comments:

Post a Comment