Explaining k-mean clustering and its use cases
Every machine learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types — supervised and unsupervised. K-means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.
Let’s understand the algorithm deeply —
What is K-means clustering?
K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.
For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsmen and bowlers.
Solution:
Assign data points
Here, we have our data set plotted on ‘x's and ‘y’ coordinates. The information on the y-axis is about the runs scored, and on the x-axis is about the wickets taken by the players.
If we plot the data, this is how it would look:
Perform Clustering
We need to create the clusters, as shown below:
Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2).
The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two points are assigned as centroids. Note that the points can be anywhere, as they are random points. They are called centroids, but initially, they are not the central point of a given data set.
The next step is to determine the distance between each of the randomly assigned centroids’ data points. For every point, the distance is measured from both the centroids and whichever distance is less, that point is assigned to that centroid. You can see the data points attached to the centroids and represented here in blue and yellow.
The next step is to determine the actual centroid for these two clusters. The original randomly allocated centroid is to be repositioned to the actual centroid of the clusters.
This process of calculating the distance and repositioning the centroid continues until we obtain our final cluster. Then the centroid repositioning stops.
As seen above, the centroid doesn’t need any more repositioning, and it means the algorithm has converged, and we have the two clusters with a centroid.
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in real life, like:
- Academic performance
- Diagnostic systems
- Search engines
- Wireless sensor networks
Academic Performance
Based on the scores, students are categorized into grades like A, B, or C.
Diagnostic systems
The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.
Search engines
Clustering forms the backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.
Wireless sensor networks
The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.
Usecases —
1. Problem Statement — Walmart wants to open a chain of stores across the state of Florida, and it wants to find the optimal store locations to maximize revenue.
solution —
The issue here is if they open too many stores close to each other, they will not make a profit. But, if the stores are too far apart, they do not have enough sales coverage.
Solution — An organization like Walmart is an e-commerce giant. They already have the addresses of their customers in their database. So they can use this information and perform K-Means Clustering to find the optimal location.
2. Problem Statement — A pizza chain wants to open its delivery centers across a city. What do you think would be the possible challenges?
- They need to analyze the areas from where the pizza is being ordered frequently.
- They need to understand how many pizza stores have to be opened to cover delivery in the area.
- They need to figure out the locations for the pizza stores within all these areas in order to keep the distance between the store and delivery points minimum.
Resolving these challenges includes a lot of analysis and mathematics. We would now learn about how clustering can provide a meaningful and easy method of sorting out such real-life challenges. Before that let’s see what clustering is.
K-means Clustering Method:
If k is given, the K-means algorithm can be executed in the following steps:
- Partition of objects into k non-empty subsets
- Identifying the cluster centroids (mean point) of the current partition.
- Assigning each point to a specific cluster
- Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
- After re-allotting the points, find the centroid of the new cluster formed.
The step by step process:
So this is k-means clustering and their use-cases please give a clap 👏👏