k-Meas
- Minimize sum of squared errors (against representative point in cluster)
- Representative points
- Centroid = average of points in cluster
- Medoid = similar to centroid but exists in the dataset
- Median (take median on different axis to create new object)
- Mode (similar to above)
Algorithm
- Algorithms
- Global optimum: Minimize sum of squared errors (SSE)
- k-Means, Medoids, Medians, Modes
- Steps
- Randomly select k points as initial centroids
- Form k clusters by assigning points to nearest centroid
- Move each centroid to the mean of the cluster
- Repeat until convergence
Validation
- The Elbow Criterion - determine number of clusters
- Run k-means with different k
- Choose the smallest k with a low SSE
- Plotting the graph and find the elbow
Pros and Cons
- Pros
- Easy to compute
- Easy to interpret
- Efficient - O(tkN), where t is num of iterations
- Cons
- Results sensitive to distance measures
- Sensitive to initialized centroids
- Need to specify num of clusters
- Results sensitive to outliers and noise
- Not suitable to discover clusters with arbitrary shapes