Clustering is basically breaking down large hetrogeneous population into small homogeneous groups.
We have taken one imaginary point in both the groups above and circled around it. All the observations that fall within that circle are grouped togeather into one cluster.
K – means clustering:
It means when this algorithm is applied, it will break data into K different clusters. In case it is unable to find K different clusters it is going to break data into K-1 clusters.
K – value has to be specified before it starts. Suppose we take value of K as 3, then what algorithm will do is, it will assign seats to 3 different random points.
Now points are assigned to these seeds based on their proximity or closeness. To check whether points are close or not we will draw straight line between the seeds and after that will draw perpendicular bisector from the midpoint.
Points on the left are closer to left seed & on right are closer to right seed. These are the 1st set of clusters that are formed. After forming the 1st cluster or iteration, Now algorithm will calculate the centroid of each clusters. So it will identify mid-point of each clusters. And it will assign that seed for the next round.
Centroid of the clusters are the mid-point of the clusters. It will represent cluster better than initial seeds.
Calculate centroids and reassign as new seeds. And after that will restart the process of assigning new seeds. And after that will restart the process of assigning new seeds. We will again draw the perpendicular bisector. We will draw a straight line between the 3 seeds & draw the perpendicular bisectors. And in this way, we will assign each observation to one of the seeds.
Clustering algorithm working (in short)
It starts off by taking random seeds, and then creating clusters around those random seeds taken; then calculating the centroids of the clusters; and then assigning them as the new seeds. And then it creates new clusters around these centroids.
This process will continue until the boundary of the clusters ceases to change.Once the boundary ceases to change, it means that the algorithm has found the stable solution. That is, we found the optimal clusters and that’s where the algorithm will end. Normally in real life algorithm will take 5 to 25 iteration to arrive at a solution.