Over a million developers have joined DZone.

I Have All the Data in the World, Now What?

DZone's Guide to

I Have All the Data in the World, Now What?

· Big Data Zone ·
Free Resource

The Architect’s Guide to Big Data Application Performance. Get the Guide.

The Big Data Workshop at InterOp Las Vegas wrapped up the morning with a presentation on Big Data requirements by John West, CTO and Founder of Fabless Labs. John kicked off with the challenge of having your enormous data set all ready to work with when you discover any one of the following problems:

  • Your network is too slow to handle the extra load
  • Hadoop has created a crater where your virtualized storage array used to be
  • Map/Reduce programming is slow and hard
  • At a large scale, math is really hard
  • It takes two days to load your big data cluster each week
  • Tuning some of the queries is a bear
  • The data is corrupt
  • My Hadoop queries seem to be very network intensive
  • You failed your security audit

These aren’t a stretch as John emphasized that he has encountered one or several of these challenges on a regular basis when working with Big Data projects. The truth of the matter is that Big Data is quite complicated.

John continued to expand on the simple facts that:

  • Analytics are amazing but action is usually required (OK, always)
  • Big data deployments have implications for the rest of your environments
  • Current big stacks have a lot of components
  • Hadoop is not generally a system of record. It is a data processing environment. Date has to be moved in and out to be useful.
  • Hadoop is not quite like your existing data warehousing cluster database

What is the deal with NoSQL?

He next gave the audience a great overview of NoSQL, BigData’s response to unstructured and semi-structured data. West described NoSQL as “The data workhorse of Big Data.” He want on to describe it as:

  • Open source, horizontally scalable, distributed database system
  • Database platform that doesn’t adhere to the RDBMS standard
  • Designed for enormous datasets where retrieve and append operations are the norm
  • Data is sharded and replicated, no single point of failure
  • Key-value store
  • Hive, Hbase and similar tools give database semantics
  • In-memory systems

The number one issue of Big Data

In West’s view, the single biggest problem of Big Data is ‘data provenance’, which includes the following challenges:

  • How is data stored and changes are tracked over time?
  • How is the data secured?
  • How will data issues be investigated?
  • Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline
  • If one of your key products is to “crunch data” and derive or extract value from it then you should be concerned about data provenance
  • This is true whether you are crunching your own data or third-party data

An excellent session that had a great amount of detail around Big Data requirements. John can be reached at John@FablessLabs.com.

John’s sample Big Data architecture:

Fabless Labs Big Data Environment Example

Learn how taking a DataOps approach will help you speed up processes and increase data quality by providing streamlined analytics pipelines via automation and testing. Learn More.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}