DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Building Big Data Prototypes

Building Big Data Prototypes

Unlike traditional approaches, cloud-based solutions can link a large number of datasets and make it easy to add your own datasets and benefit from the crosslinks.

Jesse Casman user avatar by
Jesse Casman
CORE ·
Mar. 02, 18 · Tutorial
Like (12)
Save
Tweet
Share
13.53K Views

Join the DZone community and get the full member experience.

Join For Free

Source: IBM Research PAIRS

Source: IBM Research PAIRS website

Note: Article updated March 2018 with IBM PAIRS Geoscope information. IBM PAIRS moving from research to enterprise service.

As a big data user, I often find myself on the hunt for large datasets. Each month, I look at public big datasets at places like Kaggle, Data.gov, USPTO Open Data Portal, HealthData.gov, U.S. Census Bureau, CIA World Factbook, Amazon Web Services public datasets (including the 1,000 Genome, NASA satellite imagery, and more), and DataPortals.org.

Here’s a traditional way of finding data sources on DataPortals.org, which lists 524 separate portals for data.

Surprisingly, the data format is not something out of the ordinary. Most datasets are downloaded as CSV files. Ultimately, it’s structured data made up of cells in rows and columns. Depending on the size, you can easily open the dataset files up in Excel or other spreadsheet programs and look through the data yourself.

A typical dataset on Kaggle is this Bitcoin data from 2012 to 2017. The uncompressed data size is 877MB.

The data looks like this:

Timestamp,Open,High,Low,Close,Volume_(BTC),Volume_(Currency),Weighted_Price

1325317920,4.39,4.39,4.39,4.39,0.45558087,2.0000000193,4.39

1325317980,4.39,4.39,4.39,4.39,0.45558087,2.0000000193,4.39

Once I download the data, I need to import it into a database — either in-memory, GPU, or a traditional database — before visualizing the data.

Classic Approach: Start With Data Processing

In the example below, I focused on a single Twitter demo with 400 million tweets running on a hosted version of a GPU database. In order to run the demo locally and develop applications with the dataset, I downloaded a small set of 572,000 tweets. While it’s cool to work with a GPU database, this approach also illustrates the problem with building a visualization for a single dataset.

In the classic approach to build a new demo with any standard visualization system in a database, the process is:

  1. Obtain the data source.

  2. Load data into the database.

  3. Connect the database to the processing system.

  4. Connect the visualization front end to the cached processed data.

While this approach works, it is challenging to find and clone another person’s dashboard.

Collaborative Data Science Approach: Start With Visualization

Another approach is to start with pre-processed data and a visualization chart, clone it, then modify the existing chart. In order to do this, the database, data, and visualization system would have to be hosted in the cloud. I imagine that only big companies like Amazon, Microsoft, Google, Facebook, or IBM could host large petabyte stores of data.

After reviewing my options, I looked at IBM Research’s PAIRS Geoscope project, which provided a quick and easy way to first visualize, play with, then either clone or download the data. I probably shouldn’t say “play” with the data, but it’s probably a more accurate description than saying, “experiment with different, queries, layers, and layouts.”

The Developers Section of the IBM PAIRS Geoscope website includes the REST API and Python SDK.

The reality is that part of data analysis is the ability to find cool things by visual analysis.

In the data example below, we can search for atmospheric phenomenon such as Mean Cloud Cover.

The layers can be easily sorted. You can also set up a layer for population density.

If you like the dashboard, you can clone it.

Or, you can download a subset of the data generated by a query.

There’s a good set of data included with PAIRS that people can easily explore.

Hosted Big Data Collaboration Architecture

As I wanted to understand the technology behind PAIRS, I read the PAIRS User Manual from IBM Research. PAIRS is an acronym for Physical Analytics Data Repository and Services. The original research focused on cloud-based big data analytics with a large data store of pre-processed and curated geospatial data.

Original research group diagram; may not represent IBM’s commercial service (source)

The system looks like it’s based on HBase and Hadoop to host PetaBytes of data.

Queries

The user manual indicates three types of inputs for new queries:

  1. Spatial coverage

  2. Temporal coverage

  3. Data selection

Spatial

Spatial coverage is easy to understand. It’s basically a shape on a map.

Temporal

A time-based data query is simply applying time intervals to multiple layers.

Data Selection

Data is selected based on filtering layers together.

API and Data Upload

PAIRS provides an extensive API for developers to write scripts. Users can also upload their own geospatial data and have it cross-linked with the existing datasets. The combination of an API for customization specific to your project with the ability to share and cross-link data with other data scientists is the real value of PAIRS in my opinion.

In order to assess the public API, I logged into a free account from the IBM PAIRS Services site.

The API structure is standard and there’s numerous usage example for different languages.

I browsed through the document IBM PAIRS Services REST API Specification Developer Guide Freemium Release 1.0.

To use the IBM PAIRS API, I needed to get my API keys. There’s a wonderful video on this process here.

Once you have your API keys, you can then use the documentation for real-time API queries.

The API key in the documentation is auto-populated.

You’ll need to input your IBM username and password. I like to use web-based clients for my API testing. Here’s an example using Restlet Client.

Summary

While Kaggle is great for sharing data, it has limits for collaborating with other data scientists and linking together different datasets from other people. Traditional approaches to data visualization with a database and data processing produce great visuals that can be shared but limit the ability to cross-link other data sets into the visualization.

Cloud-based solutions from companies with a large platform can link together a large number of datasets and make it easy to add your own datasets and benefit from the crosslinks. By exposing a cloud-based API, these services offer great flexibility for your own custom use while also providing a rapid way to test or use multiple data sets that are linked together.

Cloud-based solutions also offer scalability in addition to rapid prototyping. IBM PAIRS Geoscope is a good example of the new breed of collaboration big data solutions. It is from IBM, a large company with a strong data science background and a scalable cloud platform.

Big data Data science Database

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • DevOps Roadmap for 2022
  • Silver Bullet or False Panacea? 3 Questions for Data Contracts
  • Multi-Cloud Database Deep Dive
  • Unit of Work With Generic Repository Implementation Using .NET Core 6 Web API

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: