Over a million developers have joined DZone.

K-means Clustering and Voronoi Sets

DZone's Guide to

K-means Clustering and Voronoi Sets

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

In the context of http://latex.codecogs.com/gif.latex?k-means, we want to partition the space of our observations into http://latex.codecogs.com/gif.latex?k classes. Each observation belongs to the cluster with the nearest mean. Here “nearest” is in the sense of some norm, usually the http://latex.codecogs.com/gif.latex?\ell_2 (Euclidean) norm.

Consider the case where we have 2 classes. The means being respectively the 2 black dots. If we partition based on the nearest mean, with the http://latex.codecogs.com/gif.latex?\ell_2 (Euclidean) norm we get the graph on the left, and with the http://latex.codecogs.com/gif.latex?\ell_1 (Manhattan) norm, the one on the right,

Points in the red region are closer to the mean in the upper part, while points in the blue region are closer to the mean in the lower part. Here, we will always use the standard http://latex.codecogs.com/gif.latex?\ell_2 (Euclidean) norm. Note that the graph above is related to Voronoi diagrams (or Voronoy, from Вороний in Ukrainian, or Вороно́й in Russian) with 2 points, the 2 means.

In order to illustrate the http://latex.codecogs.com/gif.latex?k-means clustering algorithm (here Lloyd’s algorithm) consider the following dataset

pts <- cbind(X=rnorm(500,rep(seq(1,9,by=2)/10,100),.022),Y=rnorm(500,.5,.15))

Here, we have 5 groups.  So let us run a 5-means algorithm here.

  • We draw randomly 5 points in the space (intial values for the means), http://latex.codecogs.com/gif.latex?\boldsymbol{\mu}_1^{(1)},\cdots,\boldsymbol{\mu}_k^{(1)}
  • In the assignment step, we assign each point to the nearest mean


  • In the update step, we compute the new centroids of the clusters


To visualize it, see:

The code the get the clusters is:

kmeans(pts, centers=5, nstart = 1, algorithm = "Lloyd")

Observe that the assignment step is based on computations of Voronoi sets. This can be done in R using:

V <- voronoi.mosaic(means[,1],means[,2])
P <- voronoi.polygons(V)

This is what we can visualize below:

The code to visualize the http://latex.codecogs.com/gif.latex?k means, and the http://latex.codecogs.com/gif.latex?k clusters (or regions), use:

km1 <- kmeans(pts, centers=5, nstart = 1, algorithm = "Lloyd")
CL5 <- brewer.pal(5, "Pastel1")
V <- voronoi.mosaic(km1$centers[,1],km1$centers[,2])
P <- voronoi.polygons(V)

Here, starting points are draw randomly. If we run it again, we might get:


On that dataset, it is difficult to get cluster that are the five groups we can actually see. If we use:

A <- c(rep(.2,100),rep(.2,100),rep(.5,100),rep(.8,100),rep(.8,100))
B <- c(rep(.2,100),rep(.8,100),rep(.5,100),rep(.2,100),rep(.8,100))
pts <- cbind(X=rnorm(500,A,.075),Y=rnorm(500,B,.075))

we usually get something better.

Colors are obtained from clusters of the http://latex.codecogs.com/gif.latex?k-means function, but additional lines are obtained using as outputs of Voronoi diagrams functions.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.


Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}