Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Extracting Information from a Picture, Round 2

DZone's Guide to

Extracting Information from a Picture, Round 2

We'll continue to extract the original dataset from a map of France using a simple R function.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Yesterday, I published a post on extracting information from a picture, but it did not work as expected. I claimed that it was because of the original graph I had. More precisely, it was based on some weird projection, and I could not reconcile it. So I decide to cheat a little bit by creating my own map.

The colors are ugly, I know. But I got them using: 

u = seq(0,1,length=30)
couleurs = rgb(u,rev(u),0,1)

The picture is:

url = "https://freakonometrics.hypotheses.org/files/2018/12/chomage3.png"
library(pixmap)
library(png)
IMG = readPNG(url)

I used those colors because it would make things easy when extracting reds and greens.

ROUGE=t(IMG[,,1])[x1:x2,]
ROUGE=ROUGE[,y2:y1]
library(scales)
image(x1:x2,y1:y2,ROUGE,col=alpha(colour=rgb(1,0,0,1), alpha = seq(0,1,by=.01)))
VERT=t(IMG[,,2])[x1:x2,]
VERT=VERT[,y2:y1]
image(x1:x2,y1:y2,VERT,col=alpha(colour=rgb(0,1,0,1), alpha = seq(0,1,by=.01)))

Let's see if the contour of France can be overlaid.

library(maptools)
library(PBSmapping)
download.file("http://biogeo.ucdavis.edu/data/gadm2.8/rds/FRA_adm0.rds","FRA_adm0.rds")
FR=readRDS("FRA_adm0.rds")
library(maptools)
PP = SpatialPolygons2PolySet(FR)
par(mfrow=c(1,1))
PP=PP[(PP$X<=8.25)&(PP$Y>=42.2),]
u=(x1:x2)-x1
v=(y1:y2)-y1
ax=min(PP$X)
bx=max(PP$X)-min(PP$X)
ay=min(PP$Y)
by=max(PP$Y)-min(PP$Y)
PP$X=(PP$X-ax)/bx*max(u)
PP$Y=(PP$Y-ay)/by*max(v)
image(u,v,ROUGE,col=alpha(colour=rgb(1,0,0,1), alpha = seq(0,1,by=.01)))
points(PP$X,PP$Y)

We have a perfect match, don't we?

Let us now use a shapefile based on départements.

download.file("http://biogeo.ucdavis.edu/data/gadm2.8/rds/FRA_adm2.rds","FRA_adm2.rds")
FR2=readRDS("FRA_adm2.rds")
library(maptools)
PP = SpatialPolygons2PolySet(FR2)
image(u,v,ROUGE,col=alpha(colour=rgb(1,0,0,1), alpha = seq(0,1,by=.01)))
k=35
pX=(PP$X[PP$PID==k]-ax)/bx*max(u)
pY=(PP$Y[PP$PID==k]-ay)/by*max(v)
points(pX,pY)nge(pX)


For instance, the thirty-fifth polygon is the following:

Let's extract the color inside that polygon.

u=1:nrow(ROUGE)
v=1:ncol(ROUGE)

The code would be:

pX=(PP$X[PP$PID==k]-ax)/bx*max(u)
pY=(PP$Y[PP$PID==k]-ay)/by*max(v)
E=expand.grid(u,v)
M=matrix(point.in.polygon(E[,1],E[,2],pX,pY)>0,length(u),length(v))
image(u,v,ROUGE*M,col=alpha(colour=rgb(1,0,0,1), alpha = seq(0,1,by=.01)))
points(pX,pY)

Now, for each département, I extract the average value of red and the average value of green:

extract_info = function(k){
  pX=(PP$X[PP$PID==k]-ax)/bx*max(u)
  pY=(PP$Y[PP$PID==k]-ay)/by*max(v)
  E=expand.grid(u,v)
  M=matrix(point.in.polygon(E[,1],E[,2],pX,pY)>0,length(u),length(v))
nom=FR2[FR2$OBJECTID ==k,c("NAME_2","CCA_2")]
return(c(as.numeric(nom$CCA_2),sum(ROUGE[M==1])/sum(M),sum(VERT[M==1])/sum(M)))
}
donnees = Vectorize(extract_info)(1:95)
x2=donnees[1,]
y2=donnees[2,]/(donnees[2,]+donnees[3,])
df2=data.frame(dpt=x2,extract=y2)
x1=as.numeric(as.character(baseChomage$no))
y1=baseChomage$chomagePremierTrimestre2017
df1=data.frame(dpt=x1,obs=y1)
df=merge(df1,df2)
plot(df$obs,df$extract)


On the graph below, we have the original values on the x-axis (unemployment, as a percentage) and the "average value of red." Note that points are almost perfectly correlated... The accumulation can be explained because on the original map, different values could have the same color.

So far, I can claim that we've been able to extract useful information from the original picture.

Consider the case now that the original map was the following one:

The picture can be downloaded using the following code:

url = "https://freakonometrics.hypotheses.org/files/2018/12/chomage5.png"
library(pixmap)
library(png)
IMG = readPNG(url)

Here, the colors are obtained from a standard palette:

library(pals)
couleurs = rev(brewer.rdylgn(30))

Here again, we use our previous code to extract reds and greens.

And if we use our function:

extract_info = function(k){
  pX=(PP$X[PP$PID==k]-ax)/bx*max(u)
  pY=(PP$Y[PP$PID==k]-ay)/by*max(v)
  E=expand.grid(u,v)
  M=matrix(point.in.polygon(E[,1],E[,2],pX,pY)>0,length(u),length(v))
nom=FR2[FR2$OBJECTID ==k,c("NAME_2","CCA_2")]
return(c(as.numeric(nom$CCA_2),sum(ROUGE[M==1])/sum(M),sum(VERT[M==1])/sum(M)))
}
donnees = Vectorize(extract_info)(1:95)
x2=donnees[1,]
y2=donnees[2,]/(donnees[2,]+donnees[3,])
df2=data.frame(dpt=x2,extract=y2)
x1=as.numeric(as.character(baseChomage$no))
y1=baseChomage$chomagePremierTrimestre2017
df1=data.frame(dpt=x1,obs=y1)
df=merge(df1,df2)
plot(df$obs,df$extract)


We obtain the following graph:

Here again, we have a strong correlation, not to say comonotonic variables (in the sense that ranks are identical). Nice, isn't it?

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,data extraction ,tutorial ,r

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}