Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Leveraging PAM (Partitioning Around Medoids) Implementation in R

DZone's Guide to

Leveraging PAM (Partitioning Around Medoids) Implementation in R

An introductory level tutorial to the concept of Partitioning Around Medoids (PAM) and how to use this in R.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

For those who are aware of K-means clustering, Partitioning Around Medoids (PAM) should be easier to understand and utilize.

Before I discuss and show the rapidity with which R can accomplish such partitioning in a given data set, it will be good to understand what PAM is and how it works algorithmically. Hopefully, this will serve as an intuitive and no-code introduction to the algorithm for readers who do not have a Computer Science or Data Science background.

Medoids, similar in concept to clusters, can be understood as the centers around which observations are expected to gather or position, based on satisfaction of certain similarity criterion. 

The process of selection (and also of un-selection), progresses either with a given set of Medoids or randomly assigned ones. 

So, we have two bins to collect any given set of observations per medoid.

The first one is the one for selection. It's easy to imagine picking your favorite fruit depending on how ripe or fresh it is. As you do that, you are also un-selecting fruits that were not up to your taste and preferences. That becomes the other bin, where you leave those that were not chosen.

Of course, when you make a selection, you could also have a change of mind, where you would prefer to swap the recently selected one with another (which now appears much better in comparison) from the leftovers.

If you do that, there is a cost associated with such swapping along with the initially perceived benefit.

That's the simplified explanation of PAM. You take a bin for selection, you start with your ideas or criteria for selection, you possibly have an ideal fruit in mind, including color, smell (indicating ripeness), weight, texture, and possibly taste. Such criteria become your initial medoid selection criteria, as you begin your search for the perfect collection for a given price. 

Apart from selection criteria, you will also need to have an evaluation mechanism which can measure the similarity or dissimilarity of a potential object (in this case, the fruit of your choosing) for selection. 

For instance, you can use any known distance measurement techniques such as Manhattan, Canberra, Minkowski or Euclidean. Or you could use a custom one. The point is that you will utilize the distance computation to estimate how far away or close your current instance of selection is from the instantiation of your ideal set of criteria.

And so, you apply it to each of your potential selections until you exhaust the consideration set.

This is your build phase, in which you begin making selections based on the computation of similarity or proximity to an ideal set.

In order to be meaningful, each of your selections must be the closest one possible to all of your previously selected items (assuming that the selection criteria for each selection remain identical).

We can use this process to answer questions like: Does it really belong here? Can it be assimilated with prior selections?

The PAM approach is used to gauge the value of each selection based on the aggregate magnitude of how many other potential items for selection it has managed to surpass, in terms of similarity.

That is, each selection must be looked at in the light of how many data points would belong, even though they may apparently resemble the current item closely (think about differentiating an orange from a tangerine when selecting oranges).

The essence of this is that you must justify each selection in the contrasting light of rejection/misfit of another competing object. The sharper the contrast, the better the selection.

Those observations that need to be re-evaluated in contrast with those yet un-selected items contribute to the "swap" phase.

The trade-offs are perceived benefits (to what extent the current swap will improve or worsen your overall selection and at what cost, in terms of time, cognitive effort, and impact on cost to be borne).

The extent to which you will have to swap items from the selected bin to the un-selected bin, and vice versa, is an indicator of the stability of your algorithm.

Ultimately, the objective of this algorithm is to minimize the disconnect or dissimilarity between the chosen ones, in contrast with those who were not chosen as per the given selection criteria.

Let's see the code in action. It is not overly complex due to the power R draws from being a functional language.

I have simulated the dataset as four groups of observations with varying mean and standard deviation. 

## method 3 Medoids  Kaufman and Rousseeuw(1990)
x <- rbind(cbind(rnorm(50, 4, 3), rnorm(50, 4, 3),rnorm(50,4,3),rnorm(50,4,3)),
           cbind(rnorm(15, 5, 0.5), rnorm(15, 5, 0.5),rnorm(15,5,0.5),rnorm(15,5,0.5)),
           cbind(rnorm(10,3,2), rnorm( 10,3,2),rnorm(10,3,2),rnorm(10,3,2)),
   cbind(rnorm(9,3,0.5), rnorm( 9,3,0.5),rnorm(9,3,0.5),rnorm(9,3,0.5)))

So, we have four sets of observations with corresponding central tendencies and deviations distributed normally. However, the observation sets themselves do not mean that they indeed belong there. 

We need to discover whether they belong by applying PAM.

pam(x,4,diss=FALSE,do.swap=TRUE,metric="euclidean")

So, here it is: we take the dataset, enable swapping, and specify the metric for discovering whether the similarity is Euclidean.

Here is the output:

Medoids:
     ID                                     
[1,] 64 5.715970 4.580717 4.908412 5.0204445
[2,]  2 2.414596 6.762871 2.135370 5.8222303
[3,] 45 4.313428 2.943159 5.955265 0.8609366
[4,] 80 2.831837 2.634825 2.999780 2.9415961
Clustering vector:
 [1] 1 2 2 3 2 1 4 2 4 1 3 3 3 4 3 3 1 2 1 4 1 1 1 4 1 4 4 2 1 2 1 3 4 1 2 2 1 1
[39] 1 3 1 2 3 4 3 1 4 2 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 3 4 4 4 4
[77] 4 4 4 4 4 4 4 4
Objective function:
   build     swap 
3.343045 3.319747 

Available components:
 [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
 [6] "clusinfo"   "silinfo"    "diss"       "call"       "data" 

You'll notice the Medoid formation. We got four, but those are not necessarily identical to the initial assignment present in the simulation dataset.

The clustering assignments can be seen for each observation (Clustering Vector). 

You probably noticed the "silinfo" bit. Leaving aside the phonetics, this piece of information captures the main theme in the algorithm, which is about how astute and stable the assignment of any given observation is.

Let us see what it is. For each observation, you can see the cluster to which it is currently assigned and the nearest neighbor to which it could have belonged.

$`widths`
   cluster neighbor   sil_width
64       1        4  0.44397266
62       1        4  0.43668946
61       1        4  0.43549370
55       1        4  0.43402252
51       1        2  0.42210648
65       1        4  0.41162922
54       1        2  0.40672139
57       1        2  0.39872645
58       1        4  0.38877098

A positive sign indicates that the assignment to the current cluster is stable and that its given observation is less dissimilar to other observations in the current cluster than it is from observations in the nearest neighboring cluster.

The magnitude of this metric indicates how accurately the partitioning and corresponding assignment has been done, hinting towards the stability of each observation.

Considering both aspects together, a large positive value of this metric indicates a sharp and stable partition of data among clusters/medoids and corresponding assignment of observations.

On the other hand, a large negative value of this metric indicates an unstable, fudgy partitioning, where the current assignment can easily change, if redone, for those observations.

Obviously, it is much easier to visualize this concept. R provides a built-in function for this too.

silplot<-silhouette(c4$clustering,dist(x,"euclidean"))
rainbow <- c("red", "forest green", "blue", "purple")
plot(silplot,col=rainbow[1:4])


PAM

You can easily notice that those partitions for which negative values of the silhouette's width exists are the ones susceptible to instability.

The references section below provide links where you can learn more about the theory and implementation of this algorithm in R.

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
clustering methods ,partitioning ,big data ,r tutorials for beginners

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}