Over a million developers have joined DZone.

Heuristics on Correspondence Analysis

Check out this cool data science example of heuristics on correspondence analysis with presidential election examples.

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

In the course on non-supervised techniques for data science, we’ve been using a dataset, with a candidate for the presidential elections in 2002 (per row) and newspapers (per column). In order to visualize that dataset, consider three candidates, and three newspapers

> base=read.table(
> sb=base[,c(2,3,4)]
> sb=sb[c(4,12,7),]
> (N=sb)
 LeFigaro Liberation LeMonde
Jospin 7 41 26
Chirac 35 9 18
Mamere 1 10 7

The first part is based on a description of rows. Consider here rows are conditional probabilities, in the set of newspapers,

> (L=N/apply(N,1,sum))
 LeFigaro Liberation LeMonde
Jospin 0.09459459 0.5540541 0.3513514
Chirac 0.56451613 0.1451613 0.2903226
Mamere 0.05555556 0.5555556 0.3888889

The “average row” is the marginal distribution of newspapers

> (Lbar=apply(N,2,sum)/sum(N))
 LeFigaro Liberation LeMonde 
 0.2792208 0.3896104 0.3311688

If we visualize those individuals, in the set of newspapers (in the simplexe in the newspapers space), we have

Here it is,

But actually, we will not stay in the simplexe. A PCA is considered, with weights on individuals, that take into account the importance of the different candidates, and weights for the scalar product (in order to have a distance related to the chi-square distance, and not a standard Euclidean distance)

> matL0=t(t(L)-Lbar)
> library(FactoMineR)
> acpL=PCA(matL0,scale.unit=FALSE,
+   row.w=(apply(N,1,sum)),
+   col.w=1/(apply(N,2,sum)))
> plot.PCA(acpL,choix="ind",ylim=c(-.02,.02))

The second part is based on a description of columns. Here Columns are conditional probabilities, in the set of candidates,

> (C=t(t(N)/apply(N,2,sum)))
 LeFigaro Liberation LeMonde
Jospin 0.16279070 0.6833333 0.5098039
Chirac 0.81395349 0.1500000 0.3529412
Mamere 0.02325581 0.1666667 0.1372549

Here again, we can compute the “average column”

> (Cbar=apply(N,1,sum)/sum(N))
 Jospin Chirac Mamere 
0.4805195 0.4025974 0.1168831

In the simplex, points are


But here again, we won’t use that simplexe. We consider a PCA, with two vectors of weights, some to take into account the weights of the newspapers, and some to get a chi-square distance

> Cbar=apply(N,1,sum)/sum(N)
> matC0=C-Cbar
> acpC=PCA(t(matC0),scale.unit=FALSE,
+          row.w=(apply(N,2,sum)),
+          col.w=1/(apply(N,1,sum)))

Now, we can almost overlap the two projections. Almost because we might, sometimes switch right and left, top and bottom. Because if

 is a (unit) eigenvector, so is . Here, for instance, we should switch them

> CA(N)

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison

data science,presidential elections,newspapers,big data

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}