Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# Heuristics on Correspondence Analysis

### Check out this cool data science example of heuristics on correspondence analysis with presidential election examples.

· Big Data Zone
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

In the course on non-supervised techniques for data science, we’ve been using a dataset, with a candidate for the presidential elections in 2002 (per row) and newspapers (per column). In order to visualize that dataset, consider three candidates, and three newspapers

``````> base=read.table(
> sb=base[,c(2,3,4)]
> sb=sb[c(4,12,7),]
> (N=sb)
LeFigaro Liberation LeMonde
Jospin 7 41 26
Chirac 35 9 18
Mamere 1 10 7``````

The first part is based on a description of rows. Consider here rows are conditional probabilities, in the set of newspapers,

``````> (L=N/apply(N,1,sum))
LeFigaro Liberation LeMonde
Jospin 0.09459459 0.5540541 0.3513514
Chirac 0.56451613 0.1451613 0.2903226
Mamere 0.05555556 0.5555556 0.3888889``````

The “average row” is the marginal distribution of newspapers

``````> (Lbar=apply(N,2,sum)/sum(N))
LeFigaro Liberation LeMonde
0.2792208 0.3896104 0.3311688``````

If we visualize those individuals, in the set of newspapers (in the simplexe in the newspapers space), we have

Here it is,

But actually, we will not stay in the simplexe. A PCA is considered, with weights on individuals, that take into account the importance of the different candidates, and weights for the scalar product (in order to have a distance related to the chi-square distance, and not a standard Euclidean distance)

``````> matL0=t(t(L)-Lbar)
> library(FactoMineR)
> acpL=PCA(matL0,scale.unit=FALSE,
+   row.w=(apply(N,1,sum)),
+   col.w=1/(apply(N,2,sum)))
> plot.PCA(acpL,choix="ind",ylim=c(-.02,.02))``````

The second part is based on a description of columns. Here Columns are conditional probabilities, in the set of candidates,

``````> (C=t(t(N)/apply(N,2,sum)))
LeFigaro Liberation LeMonde
Jospin 0.16279070 0.6833333 0.5098039
Chirac 0.81395349 0.1500000 0.3529412
Mamere 0.02325581 0.1666667 0.1372549``````

Here again, we can compute the “average column”

``````> (Cbar=apply(N,1,sum)/sum(N))
Jospin Chirac Mamere
0.4805195 0.4025974 0.1168831``````

In the simplex, points are

i.e.

But here again, we won’t use that simplexe. We consider a PCA, with two vectors of weights, some to take into account the weights of the newspapers, and some to get a chi-square distance

``````> Cbar=apply(N,1,sum)/sum(N)
> matC0=C-Cbar
> acpC=PCA(t(matC0),scale.unit=FALSE,
+          row.w=(apply(N,2,sum)),
+          col.w=1/(apply(N,1,sum)))``````

Now, we can almost overlap the two projections. Almost because we might, sometimes switch right and left, top and bottom. Because if

$\boldsymbol{u}$ is a (unit) eigenvector, so is $-\boldsymbol{u}$. Here, for instance, we should switch them

``> CA(N)``

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.

Topics:
data science ,presidential elections ,newspapers ,big data

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.