Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Extracting Information from a Picture, Round 2

DZone's Guide to

Extracting Information from a Picture, Round 2

We'll continue to extract the original dataset from a map of France using a simple R function.

· Big Data Zone ·
Free Resource

The Architect’s Guide to Big Data Application Performance. Get the Guide.

Yesterday, I published a post on extracting information from a picture, but it did not work as expected. I claimed that it was because of the original graph I had. More precisely, it was based on some weird projection, and I could not reconcile it. So I decide to cheat a little bit by creating my own map.

The colors are ugly, I know. But I got them using: 

u = seq(0,1,length=30)
couleurs = rgb(u,rev(u),0,1)

The picture is:

url = "https://freakonometrics.hypotheses.org/files/2018/12/chomage3.png"
library(pixmap)
library(png)
IMG = readPNG(url)

I used those colors because it would make things easy when extracting reds and greens.

ROUGE=t(IMG[,,1])[x1:x2,]
ROUGE=ROUGE[,y2:y1]
library(scales)
image(x1:x2,y1:y2,ROUGE,col=alpha(colour=rgb(1,0,0,1), alpha = seq(0,1,by=.01)))
VERT=t(IMG[,,2])[x1:x2,]
VERT=VERT[,y2:y1]
image(x1:x2,y1:y2,VERT,col=alpha(colour=rgb(0,1,0,1), alpha = seq(0,1,by=.01)))

Let's see if the contour of France can be overlaid.

library(maptools)
library(PBSmapping)
download.file("http://biogeo.ucdavis.edu/data/gadm2.8/rds/FRA_adm0.rds","FRA_adm0.rds")
FR=readRDS("FRA_adm0.rds")
library(maptools)
PP = SpatialPolygons2PolySet(FR)
par(mfrow=c(1,1))
PP=PP[(PP$X<=8.25)&(PP$Y>=42.2),]
u=(x1:x2)-x1
v=(y1:y2)-y1
ax=min(PP$X)
bx=max(PP$X)-min(PP$X)
ay=min(PP$Y)
by=max(PP$Y)-min(PP$Y)
PP$X=(PP$X-ax)/bx*max(u)
PP$Y=(PP$Y-ay)/by*max(v)
image(u,v,ROUGE,col=alpha(colour=rgb(1,0,0,1), alpha = seq(0,1,by=.01)))
points(PP$X,PP$Y)

We have a perfect match, don't we?

Let us now use a shapefile based on départements.

download.file("http://biogeo.ucdavis.edu/data/gadm2.8/rds/FRA_adm2.rds","FRA_adm2.rds")
FR2=readRDS("FRA_adm2.rds")
library(maptools)
PP = SpatialPolygons2PolySet(FR2)
image(u,v,ROUGE,col=alpha(colour=rgb(1,0,0,1), alpha = seq(0,1,by=.01)))
k=35
pX=(PP$X[PP$PID==k]-ax)/bx*max(u)
pY=(PP$Y[PP$PID==k]-ay)/by*max(v)
points(pX,pY)nge(pX)


For instance, the thirty-fifth polygon is the following:

Let's extract the color inside that polygon.

u=1:nrow(ROUGE)
v=1:ncol(ROUGE)

The code would be:

pX=(PP$X[PP$PID==k]-ax)/bx*max(u)
pY=(PP$Y[PP$PID==k]-ay)/by*max(v)
E=expand.grid(u,v)
M=matrix(point.in.polygon(E[,1],E[,2],pX,pY)>0,length(u),length(v))
image(u,v,ROUGE*M,col=alpha(colour=rgb(1,0,0,1), alpha = seq(0,1,by=.01)))
points(pX,pY)

Now, for each département, I extract the average value of red and the average value of green:

extract_info = function(k){
  pX=(PP$X[PP$PID==k]-ax)/bx*max(u)
  pY=(PP$Y[PP$PID==k]-ay)/by*max(v)
  E=expand.grid(u,v)
  M=matrix(point.in.polygon(E[,1],E[,2],pX,pY)>0,length(u),length(v))
nom=FR2[FR2$OBJECTID ==k,c("NAME_2","CCA_2")]
return(c(as.numeric(nom$CCA_2),sum(ROUGE[M==1])/sum(M),sum(VERT[M==1])/sum(M)))
}
donnees = Vectorize(extract_info)(1:95)
x2=donnees[1,]
y2=donnees[2,]/(donnees[2,]+donnees[3,])
df2=data.frame(dpt=x2,extract=y2)
x1=as.numeric(as.character(baseChomage$no))
y1=baseChomage$chomagePremierTrimestre2017
df1=data.frame(dpt=x1,obs=y1)
df=merge(df1,df2)
plot(df$obs,df$extract)


On the graph below, we have the original values on the x-axis (unemployment, as a percentage) and the "average value of red." Note that points are almost perfectly correlated... The accumulation can be explained because on the original map, different values could have the same color.

So far, I can claim that we've been able to extract useful information from the original picture.

Consider the case now that the original map was the following one:

The picture can be downloaded using the following code:

url = "https://freakonometrics.hypotheses.org/files/2018/12/chomage5.png"
library(pixmap)
library(png)
IMG = readPNG(url)

Here, the colors are obtained from a standard palette:

library(pals)
couleurs = rev(brewer.rdylgn(30))

Here again, we use our previous code to extract reds and greens.

And if we use our function:

extract_info = function(k){
  pX=(PP$X[PP$PID==k]-ax)/bx*max(u)
  pY=(PP$Y[PP$PID==k]-ay)/by*max(v)
  E=expand.grid(u,v)
  M=matrix(point.in.polygon(E[,1],E[,2],pX,pY)>0,length(u),length(v))
nom=FR2[FR2$OBJECTID ==k,c("NAME_2","CCA_2")]
return(c(as.numeric(nom$CCA_2),sum(ROUGE[M==1])/sum(M),sum(VERT[M==1])/sum(M)))
}
donnees = Vectorize(extract_info)(1:95)
x2=donnees[1,]
y2=donnees[2,]/(donnees[2,]+donnees[3,])
df2=data.frame(dpt=x2,extract=y2)
x1=as.numeric(as.character(baseChomage$no))
y1=baseChomage$chomagePremierTrimestre2017
df1=data.frame(dpt=x1,obs=y1)
df=merge(df1,df2)
plot(df$obs,df$extract)


We obtain the following graph:

Here again, we have a strong correlation, not to say comonotonic variables (in the sense that ranks are identical). Nice, isn't it?

Learn how taking a DataOps approach will help you speed up processes and increase data quality by providing streamlined analytics pipelines via automation and testing. Learn More.

Topics:
big data ,data extraction ,tutorial ,r

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}