DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Leveraging PAM (Partitioning Around Medoids) Implementation in R

Leveraging PAM (Partitioning Around Medoids) Implementation in R

An introductory level tutorial to the concept of Partitioning Around Medoids (PAM) and how to use this in R.

Abhijit Telang user avatar by
Abhijit Telang
CORE ·
Mar. 07, 19 · Tutorial
Like (5)
Save
Tweet
Share
6.66K Views

Join the DZone community and get the full member experience.

Join For Free

For those who are aware of K-means clustering, Partitioning Around Medoids (PAM) should be easier to understand and utilize.

Before I discuss and show the rapidity with which R can accomplish such partitioning in a given data set, it will be good to understand what PAM is and how it works algorithmically. Hopefully, this will serve as an intuitive and no-code introduction to the algorithm for readers who do not have a Computer Science or Data Science background.

Medoids, similar in concept to clusters, can be understood as the centers around which observations are expected to gather or position, based on satisfaction of certain similarity criterion. 

The process of selection (and also of un-selection), progresses either with a given set of Medoids or randomly assigned ones. 

So, we have two bins to collect any given set of observations per medoid.

The first one is the one for selection. It's easy to imagine picking your favorite fruit depending on how ripe or fresh it is. As you do that, you are also un-selecting fruits that were not up to your taste and preferences. That becomes the other bin, where you leave those that were not chosen.

Of course, when you make a selection, you could also have a change of mind, where you would prefer to swap the recently selected one with another (which now appears much better in comparison) from the leftovers.

If you do that, there is a cost associated with such swapping along with the initially perceived benefit.

That's the simplified explanation of PAM. You take a bin for selection, you start with your ideas or criteria for selection, you possibly have an ideal fruit in mind, including color, smell (indicating ripeness), weight, texture, and possibly taste. Such criteria become your initial medoid selection criteria, as you begin your search for the perfect collection for a given price. 

Apart from selection criteria, you will also need to have an evaluation mechanism which can measure the similarity or dissimilarity of a potential object (in this case, the fruit of your choosing) for selection. 

For instance, you can use any known distance measurement techniques such as Manhattan, Canberra, Minkowski or Euclidean. Or you could use a custom one. The point is that you will utilize the distance computation to estimate how far away or close your current instance of selection is from the instantiation of your ideal set of criteria.

And so, you apply it to each of your potential selections until you exhaust the consideration set.

This is your build phase, in which you begin making selections based on the computation of similarity or proximity to an ideal set.

In order to be meaningful, each of your selections must be the closest one possible to all of your previously selected items (assuming that the selection criteria for each selection remain identical).

We can use this process to answer questions like: Does it really belong here? Can it be assimilated with prior selections?

The PAM approach is used to gauge the value of each selection based on the aggregate magnitude of how many other potential items for selection it has managed to surpass, in terms of similarity.

That is, each selection must be looked at in the light of how many data points would belong, even though they may apparently resemble the current item closely (think about differentiating an orange from a tangerine when selecting oranges).

The essence of this is that you must justify each selection in the contrasting light of rejection/misfit of another competing object. The sharper the contrast, the better the selection.

Those observations that need to be re-evaluated in contrast with those yet un-selected items contribute to the "swap" phase.

The trade-offs are perceived benefits (to what extent the current swap will improve or worsen your overall selection and at what cost, in terms of time, cognitive effort, and impact on cost to be borne).

The extent to which you will have to swap items from the selected bin to the un-selected bin, and vice versa, is an indicator of the stability of your algorithm.

Ultimately, the objective of this algorithm is to minimize the disconnect or dissimilarity between the chosen ones, in contrast with those who were not chosen as per the given selection criteria.

Let's see the code in action. It is not overly complex due to the power R draws from being a functional language.

I have simulated the dataset as four groups of observations with varying mean and standard deviation. 

## method 3 Medoids  Kaufman and Rousseeuw(1990)
x <- rbind(cbind(rnorm(50, 4, 3), rnorm(50, 4, 3),rnorm(50,4,3),rnorm(50,4,3)),
           cbind(rnorm(15, 5, 0.5), rnorm(15, 5, 0.5),rnorm(15,5,0.5),rnorm(15,5,0.5)),
           cbind(rnorm(10,3,2), rnorm( 10,3,2),rnorm(10,3,2),rnorm(10,3,2)),
   cbind(rnorm(9,3,0.5), rnorm( 9,3,0.5),rnorm(9,3,0.5),rnorm(9,3,0.5)))

So, we have four sets of observations with corresponding central tendencies and deviations distributed normally. However, the observation sets themselves do not mean that they indeed belong there. 

We need to discover whether they belong by applying PAM.

pam(x,4,diss=FALSE,do.swap=TRUE,metric="euclidean")

So, here it is: we take the dataset, enable swapping, and specify the metric for discovering whether the similarity is Euclidean.

Here is the output:

Medoids:
     ID                                     
[1,] 64 5.715970 4.580717 4.908412 5.0204445
[2,]  2 2.414596 6.762871 2.135370 5.8222303
[3,] 45 4.313428 2.943159 5.955265 0.8609366
[4,] 80 2.831837 2.634825 2.999780 2.9415961
Clustering vector:
 [1] 1 2 2 3 2 1 4 2 4 1 3 3 3 4 3 3 1 2 1 4 1 1 1 4 1 4 4 2 1 2 1 3 4 1 2 2 1 1
[39] 1 3 1 2 3 4 3 1 4 2 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 3 4 4 4 4
[77] 4 4 4 4 4 4 4 4
Objective function:
   build     swap 
3.343045 3.319747 

Available components:
 [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
 [6] "clusinfo"   "silinfo"    "diss"       "call"       "data" 

You'll notice the Medoid formation. We got four, but those are not necessarily identical to the initial assignment present in the simulation dataset.

The clustering assignments can be seen for each observation (Clustering Vector). 

You probably noticed the "silinfo" bit. Leaving aside the phonetics, this piece of information captures the main theme in the algorithm, which is about how astute and stable the assignment of any given observation is.

Let us see what it is. For each observation, you can see the cluster to which it is currently assigned and the nearest neighbor to which it could have belonged.

$`widths`
   cluster neighbor   sil_width
64       1        4  0.44397266
62       1        4  0.43668946
61       1        4  0.43549370
55       1        4  0.43402252
51       1        2  0.42210648
65       1        4  0.41162922
54       1        2  0.40672139
57       1        2  0.39872645
58       1        4  0.38877098

A positive sign indicates that the assignment to the current cluster is stable and that its given observation is less dissimilar to other observations in the current cluster than it is from observations in the nearest neighboring cluster.

The magnitude of this metric indicates how accurately the partitioning and corresponding assignment has been done, hinting towards the stability of each observation.

Considering both aspects together, a large positive value of this metric indicates a sharp and stable partition of data among clusters/medoids and corresponding assignment of observations.

On the other hand, a large negative value of this metric indicates an unstable, fudgy partitioning, where the current assignment can easily change, if redone, for those observations.

Obviously, it is much easier to visualize this concept. R provides a built-in function for this too.

silplot<-silhouette(c4$clustering,dist(x,"euclidean"))
rainbow <- c("red", "forest green", "blue", "purple")
plot(silplot,col=rainbow[1:4])


PAM

You can easily notice that those partitions for which negative values of the silhouette's width exists are the ones susceptible to instability.

The references section below provide links where you can learn more about the theory and implementation of this algorithm in R.

R (programming language) clustering Implementation Data science

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Tracking Software Architecture Decisions
  • Monolithic First
  • How To Choose the Right Streaming Database
  • Introduction to Spring Cloud Kubernetes

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: