10 Open Data Sources You Need to Know
Finding the perfect dataset to round out your project or story is often the most challenging and time-consuming part of the process. Here are ten go-to datasets.
Join the DZone community and get the full member experience.Join For Free
Think about when you completed your last significant data project. How much time did you spend collecting, curating, and engineering datasets?
I’ve found that finding the perfect dataset to complete your story or analysis can often be the most difficult part of the process. I recently spent a considerable amount of time researching specific US wildfire and forestry data to support a new analysis and visualization series. I was unsuccessful – until my colleague sent me to the California Forest Observatory and I found exactly what I needed.
Over the years, many of the datasets that I've used in my own projects were shared with me by colleagues. I decided to compile a list of searchable repositories, individual datasets of note, and emerging data platforms to help make these sources more easily accessible to others.
Open Data Sources to Keep in Mind for Your Next Project
1. California Forest Observatory
The California Forest Observatory is a data-driven forest monitoring system that maps wildfire hazard drivers across California, including forest structure, weather, topography, and infrastructure.
You can download canopy cover, canopy height, canopy base height, canopy bulk density, canopy layer count, ladder fuel density, and surface fuels geodata for the state by county, community, or watershed.
Additional Resources: Modeling & Monitoring Powerline Tree Strike Risk at Scale
OpenStreetMap provides a broad range of map data maintained by a worldwide community of geographers and cartographers.
You can access roads, trails, points of interest, railways, and much more worldwide.
Geofabrik's OpenStreetMap Data Extracts are one of the easiest ways to download information for your area of interest quickly.
3. Registry of Open Data on AWS
The Registry of Open Data on AWS has empowered laboratories, research institutions, and various other organizations to deliver open datasets to developers, startups, and enterprises worldwide since its launch in 2018.
Anyone can easily access the registry through a web interface and search for datasets with keywords or tags like flood risk, remote sensing, imagery, or human genome.
Users are encouraged to grow the adoption of the registry by contributing datasets of their own, usage examples, tutorials, or applications built on data from the registry.
4. Nasa Earth Observations
About: Nasa Earth Observations offer climate and environmental data for the globe. You can browse and download the satellite data from NASA's constellation of Earth Observing System satellites. Over 50 different global datasets are represented with daily, weekly, and monthly images available in various formats.
Additional Resources: During last year's #30DayMapChallenge we used Nasa's Earth Observation's Chlorophyll Concentration product on day 13.
5. Google Big Query Public Datasets
A Google BigQuery public dataset is any dataset made available to the general public through the Google Cloud Public Dataset Program.
Google hosts the data, covers the costs of storage, and offers public access to the data for use in any project.
Of all the data sources we highlight in this post this is the only one with a catch. You must sign up for a Google Cloud Platform account to access the data, to begin with, and only the first 1 TB of data per month is free, after that you are subject to query pricing.
Just be aware of the volumes you are extracting and you should be fine. There are some awesome sources to choose from including cryptocurrency exchanges, the American Community Survey, international real estate listings, and much more.
Additional Resources: OmniSci CTO Todd Mostak highly recommends the Hacker News dataset! The BigQuery subreddit is an easy way to see what data is publicly available. Other open data sources available through Google include Google Data Search and Google Public Data.
Koordinates is an emerging geospatial data management platform where you can host, manage, share, publish, and access geodata.
While Koordinates' primary product is their geodata management software, they give users the opportunity to share and access open geospatial datasets.
You can browse through thousands of geospatial data layers from around the world from New Zealand property parcels to the United States hazmat routes using regional and publisher filters or classic search.
7. Natural Earth
A collection of public domain map datasets available in vector or raster formats and various scales that I've trusted since graduate school.
Data comes in cultural, physical, and raster categories, and users benefit from solid metadata, attribution, neatness, and overall convenience.
Natural Earth is a collaboration that involves members of the North American Cartographic Information Society (NAICS) and cartographers worldwide.
About: Kaggle is a wicked cool platform for new and experienced data scientists and explorers.
You can search their massive library of open datasets, grab sample code, ask their burgeoning community questions, take part in a data competition, and learn as you go.
They have over 95,000 datasets you browse and download on just about any topic you can conjure. You may have to sift through data with varying levels of quality, but more than likely, you'll find a gem.
My only suggestion is to be aware of the original source of the data you plan on downloading, its collection date, and overall fidelity before using it in your project.
9. Safegraph Open Census Data & Neighborhood Demographics
About: Safegraph has made an impressive name for itself in the data space these past few years. And while the majority of their data comes at a price, they do offer a spread of open census data and neighborhood demographics.
The datasets they offer for free have a clean schema, are joined with Census Block Group geometries, and include 7500+ demographic attributes (income, age, education, etc.).
Additional Resources: If you are interested in some of their other datasets, check out our Retail Cross Promotion Opportunity demo!
10. Canada Open Government Data
About: OmniSci has a growing contingent of employees and users (perhaps yourself) in Canada that require reliable and accurate open data and the Canadian government has you covered with their Open Data Portal.
You can search or browse through data categorized into the following:
- Economics and Industry
- Health and Safety
- Nature and Environment
- Science and Technology
- Society and Culture
- and more!
Additional Resources: This last suggestion comes from my colleague, OmniSci's Director of Customer Success, and a proud resident of the great white north, Tony Young.
Opinions expressed by DZone contributors are their own.