Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Subjective Ways of Cutting a Continuous Variables

DZone's Guide to

Subjective Ways of Cutting a Continuous Variables

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

You have probably seen @coulmont's maps. If you haven't, you should probably go and spend some time on his blog (but please, come back afterwards, I still have my story to tell you). Consider, for instance, the maps we obtained for a post published in Monkey Cage, a few months ago:

The code was discussed on a blog post (I spent some time on the econometric model, not really on the map, by that time).

My mentor in cartography, Reka (aka @visionscarto) taught me that maps were always subjective. And indeed.

Consider the population below 24 years old, in Paris. Or to be more specific, the proportion in a quartier of the population below 24.

> Young=(df$POP0017+df$POP1824)/df$POP)*100

There is a nice package to cut properly a continuous variable:

> library(classInt)

And there are many possible options. Breaks can be at equal distances:

> class_e=classIntervals(Young,7,style="equal")

Or, based on quantiles (here probabilities are at equal distances):

> class_q=classIntervals(Young,7,style="quantile")

So, what could be the impact on a map. Here, we consider a gradient of colors, with 200 values:

> library(RColorBrewer)
> plotclr=colorRampPalette(brewer.pal(7,
"RdYlBu")[7:1] )(200)

With the so-called "equal" option (which divides the range of the variable into 200 parts), we have the breaks on the right of the legend. With the "quantile" options (where quantiles are obtained for various probabilities, where here we divide the range of probabilities into 200 parts), we have the breaks on the left of the legend. If we get back to the graph with the cumulative distribution function, above, in the first case, we equally split the range of the variable (on the x-axis), while in the second case,  we equally split the range of the probability (on the y-axis).

Breaks are very different with those two techniques. Now, if we try to visualize where the young population is located, on a map, we use the following code:

> colcode=findColours(class_e, plotclr) 
> plot(paris,col=colcode,border=colcode)

Here, with the equal option, we have the following map:

While with the quantile option, we get:

> colcode=findColours(class_q, plotclr) 
> plot(paris,col=colcode,border=colcode)

Those two maps are based on the same data. But I have the feeling that they do tell different stories...

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}