Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

When k-means Clustering Fails

DZone's Guide to

When k-means Clustering Fails

A list of several examples of what happens when the k-means clustering algorithm has to deal with groups that have very different sizes.

· Big Data Zone
Free Resource

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Letting the computer automatically find groupings in data is incredibly powerful and is at the heart of “data mining” and “machine learning”. One of the most widely used methods for clustering data is k-means clustering. Unfortunately, k-means clustering can fail spectacularly as in the example below.

Centroid-based clustering algorithms work on multi-dimensional data by partitioning data points into k clusters such that the sum of squares from points to the assigned cluster centers is minimized. In simple terms, clusters contain all of the data points that are “close” to the cluster center.

Many of the simple examples online demonstrate what happens when a clustering algorithm partitions data into roughly equal groups. But what happens when we know ahead of time that the groups we are looking for have very different sizes?

In our case, we are working with pollution monitoring data. Each file is associated with a particular monitor and each record within the file contains the latitude and longitude associated with an hourly measurement made by the monitor. These files give one insight into the life of one of these monitors:

  1. Turned on in the lab for a few hours for testing
  2. Turned on in the parking lot across from the lab for a day of outside testing
  3. Field tested a few miles away
  4. Field tested again after repositioning
  5. Shipped across the country to be deployed
  6. Deployed at site A
  7. Moved to site B
  8. Repositioned at site B

So we have a few records in one or two locations very close to each other. Then a few more records at a couple of sites nearby. Then some longer time series at one or more deployment sites with minor repositioning at the sites. Our goal is to cluster the latitude-longitude paris into groups that differentiate between deployments that are hundreds of miles apart but that group together small movements associated with “repositioning”.

A simple mockup of this situation would be the following:

x<-y<-c(rep(0,3),rep(1,3),rep(10,20),rep(11,20),rep(100,50),rep(101,50))

m<-cbind(x,y)

plot(m,cex=4,lwd=2)

clustersThe most powerful statistical tools we will ever use, our eyes, clearly identify three groups. Lets see how R’s kmeans function does.

layout(matrix(seq(8),ncol=2))

par(mar=c(2,2,2,2))

for(iin1:8){

clus<-stats::kmeans(m,3)

plot(x,y,xlab='',ylab='',axes=FALSE,xpd=NA,

cex=4,pch=as.character(clus$cluster))

# Test the within cluster sum of squares

col<-ifelse(clus$tot.withinss<500,'black','red')

box(col=col)

text(50,50,'kmeans',cex=2,font=2)

kmeansWhen you run the code you may get different results because kmeans starts by picking random numbers. With our experimental dataset, the R implementation of kmeans generates bogus clusters for this dataset about 20% of the time. And the cluster identifiers change between runs. Yikes!

Luckily for those of us who aren’t statisticians, the CRAN Cluster Analysis Task View has advice which quickly leads to the cluster package which has LOTS of clustering functions.

Cutting to the chase, for our very simple use of clustering, the sister functions pam and clara worked well. Partitioning Around Medoids (pam) is a k-medoids function that you can read more about if you’re really interested in why it works better than k-means. For large datasets pam can be very slow and clara is recommended.

Here are the results of running pam on our dataset.

for(iin1:8){

clus<-cluster::pam(m,3)

plot(x,y,xlab='',ylab='',axes=FALSE,xpd=NA,

cex=4,pch=as.character(clus$cluster))

box()

text(50,50,'pam',cex=2,font=2)

pam

Clusters where we expect them and consistent cluster identifiers between runs.

For clara we will try a slightly larger dataset:

x<-y<-c(rep(0,30),rep(1,30),rep(10,200),rep(11,200),rep(100,500),rep(101,500))

m<-cbind(x,y)

for(iin1:8){

clus<-cluster::clara(m,3,samples=10)

plot(x,y,xlab='',ylab='',axes=FALSE,xpd=NA,

cex=4,pch=as.character(clus$cluster))

box()

text(50,50,'clara',cex=2,font=2)

clara

Excellent results again.

While this may not be news to long-time clustering gurus it was a little sobering for us. It taught us once again that just because you’ve heard of some fancy algorithm and are using R’s implementation of that algorithm does not guarantee that you’ll get the results you expect. You always have to inspect results visually.

In the end, “The eyes have it.”

Best of luck creating meaningful clusters!

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
r language ,statistical analysis

Published at DZone with permission of Jonathan Callahan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}