Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Clusters of (French) Regions

DZone's Guide to

Clusters of (French) Regions

Here's a neat data science example with functions that show cluster analysic with data from the 2012 French elections.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

For the data science course tomorrow, I just wanted to post some functions to illustrate cluster analysis. Consider the dataset of the French 2012 elections:

> elections2012=read.table(
"http://freakonometrics.free.fr/elections_2012_T1.csv",sep=";",dec=",",header=TRUE)
> voix=which(substr(names(
+ elections2012),1,11)=="X..Voix.Exp")
> elections2012=elections2012[1:96,]
> X=as.matrix(elections2012[,voix])
> colnames(X)=c("JOLY","LE PEN","SARKOZY","MÉLENCHON","POUTOU","ARTHAUD","CHEMINADE","BAYROU","DUPONT-AIGNAN","HOLLANDE")
> rownames(X)=elections2012[,1]
> cah=hclust(dist(X))
> plot(cah,cex=.6)

To get five groups, we have to prune the tree

> rect.hclust(cah,k=5)
> groups.5 <- cutree(cah,5)

We have to zoom-in to visualize the French regions,

It is also possible to use

> library(dendroextras)
> plot(colour_clusters(cah,k=5))

And again, if we zoom in, we get

The interpretation of the clusters can be obtained using

> aggregate(X,list(groups.5),mean)
 Group.1 JOLY LE PEN SARKOZY
1 1 2.185000 18.00042 28.74042
2 2 1.943824 23.22324 25.78029
3 3 2.240667 15.34267 23.45933
4 4 2.620000 21.90600 34.32200
5 5 3.140000 9.05000 33.80000

It is also possible to visualize those clusters on a map, using

> library(RColorBrewer)
> CL=brewer.pal(8,"Set3")
> carte_classe <- function(groupes){
+ library(stringr)
+ elections2012$dep <- elections2012[,2]
+ elections2012$dep <- tolower(elections2012$dep)
+ elections2012$dep <- str_replace_all(elections2012$dep, pattern = " |-|'|/", replacement = "")
+ library(maps)
+ france<-map(database="france")
+ france$dep <- france$names
+ france$dep <- tolower(france$dep)
+ france$dep <- str_replace_all(france$dep, pattern = " |-|'|/", replacement = "")
+ corresp_noms <- elections2012[, c(1,2, ncol(elections2012))]
+ corresp_noms$dep[which(corresp_noms$dep %in% "corsesud")] <- "corsedusud"
+ col2001<-groupes+1
+ names(col2001) <- corresp_noms$dep[match(names(col2001), corresp_noms[,1])]
+ color <- col2001[match(france$dep, names(col2001))]
+ map(database="france", fill=TRUE, col=CL[color])
+ }
> carte_classe(cutree(cah,5))

or, if we simply want 4 clusters

> carte_classe(cutree(cah,4))

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data science ,presidential elections ,visualization

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}