Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

I Have All the Data in the World, Now What?

DZone's Guide to

I Have All the Data in the World, Now What?

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

The Big Data Workshop at InterOp Las Vegas wrapped up the morning with a presentation on Big Data requirements by John West, CTO and Founder of Fabless Labs. John kicked off with the challenge of having your enormous data set all ready to work with when you discover any one of the following problems:

  • Your network is too slow to handle the extra load
  • Hadoop has created a crater where your virtualized storage array used to be
  • Map/Reduce programming is slow and hard
  • At a large scale, math is really hard
  • It takes two days to load your big data cluster each week
  • Tuning some of the queries is a bear
  • The data is corrupt
  • My Hadoop queries seem to be very network intensive
  • You failed your security audit

These aren’t a stretch as John emphasized that he has encountered one or several of these challenges on a regular basis when working with Big Data projects. The truth of the matter is that Big Data is quite complicated.

John continued to expand on the simple facts that:

  • Analytics are amazing but action is usually required (OK, always)
  • Big data deployments have implications for the rest of your environments
  • Current big stacks have a lot of components
  • Hadoop is not generally a system of record. It is a data processing environment. Date has to be moved in and out to be useful.
  • Hadoop is not quite like your existing data warehousing cluster database

What is the deal with NoSQL?

He next gave the audience a great overview of NoSQL, BigData’s response to unstructured and semi-structured data. West described NoSQL as “The data workhorse of Big Data.” He want on to describe it as:

  • Open source, horizontally scalable, distributed database system
  • Database platform that doesn’t adhere to the RDBMS standard
  • Designed for enormous datasets where retrieve and append operations are the norm
  • Data is sharded and replicated, no single point of failure
  • Key-value store
  • Hive, Hbase and similar tools give database semantics
  • In-memory systems

The number one issue of Big Data

In West’s view, the single biggest problem of Big Data is ‘data provenance’, which includes the following challenges:

  • How is data stored and changes are tracked over time?
  • How is the data secured?
  • How will data issues be investigated?
  • Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline
  • If one of your key products is to “crunch data” and derive or extract value from it then you should be concerned about data provenance
  • This is true whether you are crunching your own data or third-party data

An excellent session that had a great amount of detail around Big Data requirements. John can be reached at John@FablessLabs.com.

John’s sample Big Data architecture:

Fabless Labs Big Data Environment Example

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:

Published at DZone with permission of Christopher Taylor, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}