Over a million developers have joined DZone.

Heuristics on Correspondence Analysis

DZone's Guide to

Heuristics on Correspondence Analysis

Check out this cool data science example of heuristics on correspondence analysis with presidential election examples.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In the course on non-supervised techniques for data science, we’ve been using a dataset, with a candidate for the presidential elections in 2002 (per row) and newspapers (per column). In order to visualize that dataset, consider three candidates, and three newspapers

> base=read.table(
> sb=base[,c(2,3,4)]
> sb=sb[c(4,12,7),]
> (N=sb)
 LeFigaro Liberation LeMonde
Jospin 7 41 26
Chirac 35 9 18
Mamere 1 10 7

The first part is based on a description of rows. Consider here rows are conditional probabilities, in the set of newspapers,

> (L=N/apply(N,1,sum))
 LeFigaro Liberation LeMonde
Jospin 0.09459459 0.5540541 0.3513514
Chirac 0.56451613 0.1451613 0.2903226
Mamere 0.05555556 0.5555556 0.3888889

The “average row” is the marginal distribution of newspapers

> (Lbar=apply(N,2,sum)/sum(N))
 LeFigaro Liberation LeMonde 
 0.2792208 0.3896104 0.3311688

If we visualize those individuals, in the set of newspapers (in the simplexe in the newspapers space), we have

Here it is,

But actually, we will not stay in the simplexe. A PCA is considered, with weights on individuals, that take into account the importance of the different candidates, and weights for the scalar product (in order to have a distance related to the chi-square distance, and not a standard Euclidean distance)

> matL0=t(t(L)-Lbar)
> library(FactoMineR)
> acpL=PCA(matL0,scale.unit=FALSE,
+   row.w=(apply(N,1,sum)),
+   col.w=1/(apply(N,2,sum)))
> plot.PCA(acpL,choix="ind",ylim=c(-.02,.02))

The second part is based on a description of columns. Here Columns are conditional probabilities, in the set of candidates,

> (C=t(t(N)/apply(N,2,sum)))
 LeFigaro Liberation LeMonde
Jospin 0.16279070 0.6833333 0.5098039
Chirac 0.81395349 0.1500000 0.3529412
Mamere 0.02325581 0.1666667 0.1372549

Here again, we can compute the “average column”

> (Cbar=apply(N,1,sum)/sum(N))
 Jospin Chirac Mamere 
0.4805195 0.4025974 0.1168831

In the simplex, points are


But here again, we won’t use that simplexe. We consider a PCA, with two vectors of weights, some to take into account the weights of the newspapers, and some to get a chi-square distance

> Cbar=apply(N,1,sum)/sum(N)
> matC0=C-Cbar
> acpC=PCA(t(matC0),scale.unit=FALSE,
+          row.w=(apply(N,2,sum)),
+          col.w=1/(apply(N,1,sum)))

Now, we can almost overlap the two projections. Almost because we might, sometimes switch right and left, top and bottom. Because if

 is a (unit) eigenvector, so is . Here, for instance, we should switch them

> CA(N)

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

data science ,presidential elections ,newspapers ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}