DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Open Data and Ecological Fallacy

Open Data and Ecological Fallacy

Arthur Charpentier user avatar by
Arthur Charpentier
·
Sep. 13, 12 · Interview
Like (0)
Save
Tweet
Share
3.78K Views

Join the DZone community and get the full member experience.

Join For Free

A couple of days ago, on Twitter, @alung mentioned an article I wrote (in French) about open data, explaining how difficult it was to get access to data in France. @alung wondered if it was still so difficult to access nice datasets. My first answer was that I now know more people willing to share their data. And on the internet, amazing datasets can be found now very easily. In France, for instance, you can find detailed information about qualitifications, houses and jobs, by small geographical areas on http://www.recensement.insee.fr. And that's great for researchers.

But one should be aware that those aggregate data might not be sufficient to build up econometric models, and to infer individual behaviors. Supposing that relationships observed for groups necessarily hold for individuals is a common fallacy -- the so-called ecological fallacy. 

In a popular paper, Robinson (1950) discussed "ecological inference," stressing the difference between ecological correlations (on groups) and individual correlations (see also Thorndike) He considered two aggregated quantities, per American state: the percent of the population that was foreign-born, and the percent that was literate. One dataset used in the paper was the following:

> library(eco)
> data(forgnlit30)
> tail(forgnlit30)
Y          X         W1          W2 ICPSR
43 0.076931986 0.03097168 0.06834300 0.077206504    66
44 0.006617641 0.11479052 0.03568792 0.002847920    67
45 0.006991899 0.11459207 0.04151310 0.002524065    68
46 0.012793782 0.18491515 0.05690731 0.002785916    71
47 0.007322475 0.13196654 0.03589512 0.002978594    72
48 0.007917342 0.18816461 0.02949187 0.002916866    73

The correlation between foreign-birth and literacy was

> cor(forgnlit30$X,1-forgnlit30$Y)
[1] 0.2069447

This suggests a positive correlation, so one quick interpretation could be that in the 1930's, Americans were illiterate, but literate immigrants got the idea to come to the US. But here, like in Simpson's paradox, the sign should be negative, as obtained on individual studies. In the state-based data study, correlation was positive primarily because foreign-born people tend to live in states where the native-born are relatively literate.

So the problem lies in the way that individuals were grouped. Consider the following set of individual observations:

> n=1000
> r=-.5
> Z=rmnorm(n,c(0,0),matrix(c(1,r,r,1),2,2))
> X=Z[,1]
> E=Z[,2]
> Y=3+2*X+E
> cor(X,Y)
[1] 0.8636764

Consider now some regrouping, e.g.

> I=cut(Z[,2],qnorm(seq(0,1,by=.05)))
> Yg=tapply(Y,I,mean)
> Xg=tapply(X,I,mean)

Then the correlation is rather different:

>  cor(Xg,Yg)
[1] 0.1476422

Here we have a strong positive individual correlation, and a small positive correlation on grouped data, but almost anything is possible.

Models with random coefficients have been used to make ecological inferences. But that is a long story, and I will probably come back with a more detailed post on that topic, since I am still working on this with @coulmont (following some comments by @frbonnet on his post on recent French elections on http://coulmont.com/blog/).

Data (computing) Open data Fallacy Correlation (projective geometry)

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Cloud-Native Application Networking
  • The Data Leakage Nightmare in AI
  • Deploying Java Serverless Functions as AWS Lambda
  • Java Development Trends 2023

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: