Skip to content

K-Means Clustering

Concept

K means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (low intra-cluster variation), whereas objects from different clusters are as dissimilar as possible (i.e., high inter-cluster variation).
In k-means clustering, each cluster is represented by its centre (i.e. centroid).

Steps

In K-means clustering, the observations in the sample are assigned to one of the clusters by using the following steps:
1. Choose K observations from the data that are likely to be in different clusters
2. The K observations chosen in step 1 are the centroids of those clusters
3. For remaining observations, find the cluster closest to the centroid. Add the new observation (say observation j) to the cluster with the closest centroid
4. Adjust the centroid after adding a new observation to the cluster. The closest centroid is chosen based on an appropriate distance measure
5. Repeat step 3 and 4 till all observations are assigned to a cluster

The centroids keep moving when new observations are added.

Data

The data used in this blog is taken from kaggle. It is a customer segmentation problem to define market strategy. The sample Dataset summarizes the usage behaviour of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioural variables. Visit this link to know more about the data. Sample data is given below.

Credit card dataset
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
C16568 3948.7769 1.000000 0.00 0.00 0.00 5291.215 0.000000 0.000000 0.000000 0.500000 17 0 4000 4576.5676 1819.1397 0.083333 12
C13598 2104.9617 0.727273 4115.95 230.50 3885.45 9358.071 0.833333 0.166667 0.666667 0.916667 20 59 8000 15172.5853 778.2392 0.000000 12
C13190 2089.4517 1.000000 1256.49 879.61 376.88 1381.853 0.666667 0.583333 0.500000 0.166667 2 32 2500 637.0447 1072.3948 0.090909 12
C16080 912.9127 0.857143 0.00 0.00 0.00 1563.921 0.000000 0.000000 0.000000 0.285714 9 0 1500 212.0849 331.4871 0.000000 7
C11056 380.3848 0.545455 1188.00 1188.00 0.00 0.000 0.083333 0.083333 0.000000 0.000000 0 2 9000 2461.3203 158.3398 0.000000 12

Optimal number of clusters

Identifying the optimal number of clusters is the first step in K-Means clustering. This can be done using many ways, some of which are:
1. Business decision
2. Calinski and Harabasz index
3. Silhouette statistic
4. Elbow curve

Calinski and Harabasz index

Calinhara index is an F-statistic kind of index which compares the between and within cluster variances. It is given by:
$$ CH(k) = \frac{B(k)/k-1}{W(k)/(n-k)} $$ Where k is number of clusters and B(k) and W(k) are between and within cluster variances respectively.

From the above plot, the ideal k value to be selected is 2.

Silhouette statistic

Silhouette statistic is the ratio of average distance within the cluster with average distance outside the cluster. If a(i) is the average distance between an observation i and other points in the cluster to which observation i belongs and b(i) is the minimum average distance between observation i and observations in other clusters. Then the Silhouette statistic is defined by

$$ S(i) = (\frac{b(i)-a(i)}{Max[a(i), b(i)]})$$

The ideal k using silhouette 2 to 6 as there is not much reduction in silhouette metric.

Elbow curve

Elbow curve is based on within-cluster variation. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters. (see blog on hierarchical clustering for more)

fviz_nbclust(cluster_data, kmeans, method = "wss")

From the above three metrics, the optimum value of k is 3.

Analysing the cluster

K-means can be computed in R with the kmeans function. We will group the data into three clusters (centres = 3). The kmeans function also has a nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations. This approach is often recommended.

final_cluster <- kmeans(cluster_data, centers = 3, nstart = 25)

The final clusters can be visualized below. In the below figure, all the columns are reduced to two columns for visualization using PCA.

In each column, the variation between the three clusters can be analysed using the following plots.

From the comparison plot, we can observe the properties of the three clusters. They are:

Cluster 1

They are customers who purchase lesser amounts using the card but have reasonably high balance in their account.

Cluster 2

They are customers who have a low balance in their account and also purchase less using the account.

Cluster 3

They are customers who purchase in higher value using their card.


References

  1. Business Analytics: The Science of Data-Driven Decision Making Available
  2. UC Business Analytics R Programming Guide - University of Cincinnati - Online
  3. Alboukadel Kassambara - factoextra package - git
Back to top