Building Big Data Prototypes
Building Big Data Prototypes
Unlike traditional approaches, cloud-based solutions can link a large number of datasets and make it easy to add your own datasets and benefit from the crosslinks.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Source: IBM Research PAIRS website
Note: Article updated March 2018 with IBM PAIRS Geoscope information. IBM PAIRS moving from research to enterprise service.
As a big data user, I often find myself on the hunt for large datasets. Each month, I look at public big datasets at places like Kaggle, Data.gov, USPTO Open Data Portal, HealthData.gov, U.S. Census Bureau, CIA World Factbook, Amazon Web Services public datasets (including the 1,000 Genome, NASA satellite imagery, and more), and DataPortals.org.
Here’s a traditional way of finding data sources on DataPortals.org, which lists 524 separate portals for data.
Surprisingly, the data format is not something out of the ordinary. Most datasets are downloaded as CSV files. Ultimately, it’s structured data made up of cells in rows and columns. Depending on the size, you can easily open the dataset files up in Excel or other spreadsheet programs and look through the data yourself.
A typical dataset on Kaggle is this Bitcoin data from 2012 to 2017. The uncompressed data size is 877MB.
The data looks like this:
Timestamp,Open,High,Low,Close,Volume_(BTC),Volume_(Currency),Weighted_Price 1325317920,4.39,4.39,4.39,4.39,0.45558087,2.0000000193,4.39 1325317980,4.39,4.39,4.39,4.39,0.45558087,2.0000000193,4.39
Once I download the data, I need to import it into a database — either in-memory, GPU, or a traditional database — before visualizing the data.
Classic Approach: Start With Data Processing
In the example below, I focused on a single Twitter demo with 400 million tweets running on a hosted version of a GPU database. In order to run the demo locally and develop applications with the dataset, I downloaded a small set of 572,000 tweets. While it’s cool to work with a GPU database, this approach also illustrates the problem with building a visualization for a single dataset.
In the classic approach to build a new demo with any standard visualization system in a database, the process is:
Obtain the data source.
Load data into the database.
Connect the database to the processing system.
Connect the visualization front end to the cached processed data.
While this approach works, it is challenging to find and clone another person’s dashboard.
Collaborative Data Science Approach: Start With Visualization
Another approach is to start with pre-processed data and a visualization chart, clone it, then modify the existing chart. In order to do this, the database, data, and visualization system would have to be hosted in the cloud. I imagine that only big companies like Amazon, Microsoft, Google, Facebook, or IBM could host large petabyte stores of data.
After reviewing my options, I looked at IBM Research’s PAIRS Geoscope project, which provided a quick and easy way to first visualize, play with, then either clone or download the data. I probably shouldn’t say “play” with the data, but it’s probably a more accurate description than saying, “experiment with different, queries, layers, and layouts.”
The reality is that part of data analysis is the ability to find cool things by visual analysis.
In the data example below, we can search for atmospheric phenomenon such as Mean Cloud Cover.
The layers can be easily sorted. You can also set up a layer for population density.
If you like the dashboard, you can clone it.
Or, you can download a subset of the data generated by a query.
There’s a good set of data included with PAIRS that people can easily explore.
Hosted Big Data Collaboration Architecture
As I wanted to understand the technology behind PAIRS, I read the PAIRS User Manual from IBM Research. PAIRS is an acronym for Physical Analytics Data Repository and Services. The original research focused on cloud-based big data analytics with a large data store of pre-processed and curated geospatial data.
Original research group diagram; may not represent IBM’s commercial service (source)
The system looks like it’s based on HBase and Hadoop to host PetaBytes of data.
The user manual indicates three types of inputs for new queries:
Spatial coverage is easy to understand. It’s basically a shape on a map.
A time-based data query is simply applying time intervals to multiple layers.
Data is selected based on filtering layers together.
API and Data Upload
PAIRS provides an extensive API for developers to write scripts. Users can also upload their own geospatial data and have it cross-linked with the existing datasets. The combination of an API for customization specific to your project with the ability to share and cross-link data with other data scientists is the real value of PAIRS in my opinion.
In order to assess the public API, I logged into a free account from the IBM PAIRS Services site.
The API structure is standard and there’s numerous usage example for different languages.
I browsed through the document IBM PAIRS Services REST API Specification Developer Guide Freemium Release 1.0.
To use the IBM PAIRS API, I needed to get my API keys. There’s a wonderful video on this process here.
Once you have your API keys, you can then use the documentation for real-time API queries.
The API key in the documentation is auto-populated.
You’ll need to input your IBM username and password. I like to use web-based clients for my API testing. Here’s an example using Restlet Client.
While Kaggle is great for sharing data, it has limits for collaborating with other data scientists and linking together different datasets from other people. Traditional approaches to data visualization with a database and data processing produce great visuals that can be shared but limit the ability to cross-link other data sets into the visualization.
Cloud-based solutions from companies with a large platform can link together a large number of datasets and make it easy to add your own datasets and benefit from the crosslinks. By exposing a cloud-based API, these services offer great flexibility for your own custom use while also providing a rapid way to test or use multiple data sets that are linked together.
Cloud-based solutions also offer scalability in addition to rapid prototyping. IBM PAIRS Geoscope is a good example of the new breed of collaboration big data solutions. It is from IBM, a large company with a strong data science background and a scalable cloud platform.
Opinions expressed by DZone contributors are their own.