Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Heuristics on Correspondence Analysis

DZone's Guide to

Heuristics on Correspondence Analysis

Check out this cool data science example of heuristics on correspondence analysis with presidential election examples.

· Big Data Zone
Free Resource

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

In the course on non-supervised techniques for data science, we’ve been using a dataset, with a candidate for the presidential elections in 2002 (per row) and newspapers (per column). In order to visualize that dataset, consider three candidates, and three newspapers

> base=read.table(
"http://freakonometrics.free.fr/election2002.txt",header=TRUE)
> sb=base[,c(2,3,4)]
> sb=sb[c(4,12,7),]
> (N=sb)
 LeFigaro Liberation LeMonde
Jospin 7 41 26
Chirac 35 9 18
Mamere 1 10 7

The first part is based on a description of rows. Consider here rows are conditional probabilities, in the set of newspapers,

> (L=N/apply(N,1,sum))
 LeFigaro Liberation LeMonde
Jospin 0.09459459 0.5540541 0.3513514
Chirac 0.56451613 0.1451613 0.2903226
Mamere 0.05555556 0.5555556 0.3888889

The “average row” is the marginal distribution of newspapers

> (Lbar=apply(N,2,sum)/sum(N))
 LeFigaro Liberation LeMonde 
 0.2792208 0.3896104 0.3311688

If we visualize those individuals, in the set of newspapers (in the simplexe in the newspapers space), we have

Here it is,

But actually, we will not stay in the simplexe. A PCA is considered, with weights on individuals, that take into account the importance of the different candidates, and weights for the scalar product (in order to have a distance related to the chi-square distance, and not a standard Euclidean distance)

> matL0=t(t(L)-Lbar)
> library(FactoMineR)
> acpL=PCA(matL0,scale.unit=FALSE,
+   row.w=(apply(N,1,sum)),
+   col.w=1/(apply(N,2,sum)))
> plot.PCA(acpL,choix="ind",ylim=c(-.02,.02))

The second part is based on a description of columns. Here Columns are conditional probabilities, in the set of candidates,

> (C=t(t(N)/apply(N,2,sum)))
 LeFigaro Liberation LeMonde
Jospin 0.16279070 0.6833333 0.5098039
Chirac 0.81395349 0.1500000 0.3529412
Mamere 0.02325581 0.1666667 0.1372549

Here again, we can compute the “average column”

> (Cbar=apply(N,1,sum)/sum(N))
 Jospin Chirac Mamere 
0.4805195 0.4025974 0.1168831

In the simplex, points are

i.e.

But here again, we won’t use that simplexe. We consider a PCA, with two vectors of weights, some to take into account the weights of the newspapers, and some to get a chi-square distance

> Cbar=apply(N,1,sum)/sum(N)
> matC0=C-Cbar
> acpC=PCA(t(matC0),scale.unit=FALSE,
+          row.w=(apply(N,2,sum)),
+          col.w=1/(apply(N,1,sum)))

Now, we can almost overlap the two projections. Almost because we might, sometimes switch right and left, top and bottom. Because if

 is a (unit) eigenvector, so is . Here, for instance, we should switch them

> CA(N)

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.

Topics:
data science ,presidential elections ,newspapers ,big data

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}