Heuristics on Correspondence Analysis
Check out this cool data science example of heuristics on correspondence analysis with presidential election examples.
Join the DZone community and get the full member experience.Join For Free
In the course on non-supervised techniques for data science, we’ve been using a dataset, with a candidate for the presidential elections in 2002 (per row) and newspapers (per column). In order to visualize that dataset, consider three candidates, and three newspapers
> base=read.table( "http://freakonometrics.free.fr/election2002.txt",header=TRUE) > sb=base[,c(2,3,4)] > sb=sb[c(4,12,7),] > (N=sb) LeFigaro Liberation LeMonde Jospin 7 41 26 Chirac 35 9 18 Mamere 1 10 7
The first part is based on a description of rows. Consider here rows are conditional probabilities, in the set of newspapers,
> (L=N/apply(N,1,sum)) LeFigaro Liberation LeMonde Jospin 0.09459459 0.5540541 0.3513514 Chirac 0.56451613 0.1451613 0.2903226 Mamere 0.05555556 0.5555556 0.3888889
The “average row” is the marginal distribution of newspapers
> (Lbar=apply(N,2,sum)/sum(N)) LeFigaro Liberation LeMonde 0.2792208 0.3896104 0.3311688
If we visualize those individuals, in the set of newspapers (in the simplexe in the newspapers space), we have
Here it is,
But actually, we will not stay in the simplexe. A PCA is considered, with weights on individuals, that take into account the importance of the different candidates, and weights for the scalar product (in order to have a distance related to the chi-square distance, and not a standard Euclidean distance)
> matL0=t(t(L)-Lbar) > library(FactoMineR) > acpL=PCA(matL0,scale.unit=FALSE, + row.w=(apply(N,1,sum)), + col.w=1/(apply(N,2,sum))) > plot.PCA(acpL,choix="ind",ylim=c(-.02,.02))
The second part is based on a description of columns. Here Columns are conditional probabilities, in the set of candidates,
> (C=t(t(N)/apply(N,2,sum))) LeFigaro Liberation LeMonde Jospin 0.16279070 0.6833333 0.5098039 Chirac 0.81395349 0.1500000 0.3529412 Mamere 0.02325581 0.1666667 0.1372549
Here again, we can compute the “average column”
> (Cbar=apply(N,1,sum)/sum(N)) Jospin Chirac Mamere 0.4805195 0.4025974 0.1168831
In the simplex, points are
But here again, we won’t use that simplexe. We consider a PCA, with two vectors of weights, some to take into account the weights of the newspapers, and some to get a chi-square distance
> Cbar=apply(N,1,sum)/sum(N) > matC0=C-Cbar > acpC=PCA(t(matC0),scale.unit=FALSE, + row.w=(apply(N,2,sum)), + col.w=1/(apply(N,1,sum)))
Now, we can almost overlap the two projections. Almost because we might, sometimes switch right and left, top and bottom. Because if
is a (unit) eigenvector, so is . Here, for instance, we should switch them
Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.