Understanding Core Data Science Algorithms: K-Means and K-Medoids Clustering
This article provides an overview of core data science algorithms used in statistical data analysis, specifically k-means and k-medoids clustering.
Join the DZone community and get the full member experience.Join For Free
Clustering is one of the major techniques used for statistical data analysis.
As the term suggests, “clustering” is defined as the process of gathering similar objects into different groups or distribution of datasets into subsets with a defined distance measure.
K-means clustering is touted as a foundational algorithm every data scientist ought to have in their toolbox. The popularity of the algorithm in the data science industry is due to its extraordinary features:
How Does it Work?
K-means and k-medoids are methods used in partitional clustering algorithms whose functionality works based on specifying an initial number of groups or, more precisely, iteratively by reallocation of objects among groups.
The algorithm works by first segregating all the points into an already selected number of clusters. The process is carried out by measuring the distance between the point and the center of each cluster. And because k-means can function only in the Euclidean space, the functionality of the algorithm is limited. Despite the drawbacks or shortcomings of algorithm possesses, k-means is still one of the most powerful tools used in clustering. The applications can be seen widely used in multiple fields – physical sciences, natural language processing (NLP), and healthcare.
The extension of the k-means algorithm involves smarter starting positions for its k-centers, which further allows for more variable cluster sizes. When this happens, the distance that gets created will be more than the Euclidean distance.
Furthermore, we will talk about different other methods such as CLARANS, CLARA, and PAM that helps integrate the distance measured beyond Euclidean distance.
However, before talking about the other methods let us address the shortcomings of k-means clustering.
Most often, outliers occur due to fraud behavior, human errors, and mechanical faults. And can be seen in k-means clustering too. First, the k-means clustering algorithm needs to be applied in a data set then you can start identifying outliers from each cluster. The distance-based method and cluster-based method to identify or detect outliers and anomalies in a data set.
The major objective is to first detect the outlier and then remove it making clustering more reliable.
Below are the points that project the failure of k-means clustering:
- Does not work properly when the clusters differ in size and density.
- Predicting the accurate number of centroids to divide the data gets difficult.
- Initial placement of the k-centroid tends to affect the result.
- The centroid is an imaginary point in the data set, this might be of less value.
- Sensitive to the scale of dimensions thus rescaling data can get difficult.
- Utilizes Euclidean distance while dividing points. However, it can become ineffective in a high dimensional setup since all points become equally distant from one another.
- The algorithm divides space even when partition makes no sense.
Partitioning Around Medoids (PAM) Algorithm
Besides the mean of the cluster, you can use medoid for partition or maybe the data point located right at the central point in the cluster. The medoid is said to have the least dissimilar point to all the points in a cluster. Medoid is less sensitive to outliers in a data set.
The clustering algorithm demonstrates itself under unsupervised learning in machine learning (ML). One of the major ideas behind k-means is that we want to add new points (k) to the data we already have – each of these points is called the centroid. The k-means algorithm is one of the simplest data science algorithms every data scientist must have in their toolbox.
Now, these partitions can use arbitrary distance while not always relying on Euclidean distance. This is one of the most critical points in PAM, CLARA, and CLARANS.
Below are the steps involved in PAM:
- Given k
- Now pick random k to be the initial medoid
- Each of these instances needs to be assigned to the nearest medoid (x)
- Then calculate the objective function i.e. sum pf dissimilarities of every instance to the nearest medoids
- Select any random instance (y)
- Replace x by y, when this happens and the swap or replacement reduces the function
- Then repeat (3-6) until no further change
CLARA (Clustering For Large Applications) is a faster version of PAM that helps enable the nested order of loops in the algorithm. We need a faster version of PAM just in case the time complexity of the PAM algorithm slows down as compared to the k-means algorithm.
Despite the multiple drawbacks of the k-means clustering algorithm like being susceptible to outliers, being reliant on Euclidean distance, and gathering centroids that do not represent the real data point - PAM, CLARA, and CLARANS played an important role in resolving the issue.
Published at DZone with permission of Niti Sharma. See the original article here.
Opinions expressed by DZone contributors are their own.