One of the simplest types of clustering is k-means. The prerequisites for this algorithm are:
- your data (obviously), represented as a set of attribute values;
- the number of clusters you want the data to be divided in (i.e. number k).
Onto the algorithm now. Here are the basic steps:
- Pick the initial cluster centers (centroids, or means). Usually, randomly pick k instances from the data. Although using more intelligent ways to pick initial centroids will pay off later.
- Calculate the "distance" from each instance to each centroid. Here is a place to be creative. The distance could be Euclidean distance for numeric attributes (refresher: for two points (x1,y1) and (x2,y2) Euclidean distance is a square root of (x1-x2)2 + (y1-y2)2 ), hamming distance for strings, or arbitrary distance scale (as in the fruits example, you can define that color orange is closer to red then to blue, etc.).
- Assign the instances to clusters. Based on the distances calculated in the previous step, assign an instance to the cluster to which it has the smallest distance to.
- Re-evaluate the centroids. Now that the instances are assigned to the clusters, calculate new cluster centers (centroids) by averaging the attributes of all instances in the cluster. I.e. if a cluster has 3 instances: (x1,y1,z1), (x2,y2,z2), and (x3,y3,z3), new centroid will be ( (x1+x2+x3)/3, (y1+y2+y3)/3, (z1+z2+z3)/3 ).
- Repeat from step 2 and stop when clusters stabilize (i.e. no instances are re-assigned to other clusters, compared to the previous iteration).
This methods is usually found effective. However, there is no guarantee to find global minimum, therefore the step of picking initial cluster centers is very important. For instance, look at the two graphs below. The clusters, shown by red ovals, were identified by the same algorithm, which picks the first k instances as the initial centroids. The two different results were achieved by simply re-ordering the input set.
As an afterthought, just want to add that clustering, apart from being a useful method to investigate unlabelled data, can also be used as an additional technique in classification. For instance, as a way of selecting a most representative subset of attributes for classification, which can substantially speed up the algorithm.
No comments:
Post a Comment