K-Means Algorithm
The unsupervised learning looks for previously undetected pattern/insight with minimum supervision in a dataset with no pre-existing labels such as unstructured data.
K-means clustering is a type of unsupervised learning.
The goal of K-means algorithm is to find groups in data, with the number of groups represent by the variable K.
The algorithm works iteratively to assign each data point to one of K groups based on the features provided. Data points are clustered based on features similarity. The result of the K-means algorithm are:
- The centroids of the K clusters, which can be used to label new data
- Each data point is assigned to a single cluster

K-Means algorithm step summary:
- Specify the number of clusters, K, need to be generated by this algorithm
- Randomly select K data points and assign each data point to a cluster. In simple words, classify the data based on the number of data points
- Compute the cluster centroid
- Keep iterating the following until we find optimal centroid which is the assignment of data points to the clusters that are not changing anymore
- 4.1. Compute the sum of squared distance between data points and centroids.
- 4.2. Assign each data point to the cluster that is closer than other cluster (centroid)
- 4.3. Compute the centroids for the clusters by taking the average of all data points of that cluster.

Choosing the right K
The Elbow Method
WCSS (Within Cluster Sum of Squares) is a parameter used to determine the right K. The aim is to determine the optimum number of clusters when there is no significant decrease anymore in WCSS.
WCSS formula when K=3 is represented as:
where summation distance (p,c) is the sum of distance of points in a cluster from the centroid.
In the picture below, there is no significant decrease in WCSS after 3 clusters. Besides, there is an elbow shape that forms around the number of clusters=3. In this particular case, K=3 is the best choice based on the elbow method.