The Big Data Workshop at InterOp Las Vegas wrapped up the morning with a presentation on Big Data requirements by John West, CTO and Founder of Fabless Labs. John kicked off with the challenge of having your enormous data set all ready to work with when you discover any one of the following problems:
- Your network is too slow to handle the extra load
- Hadoop has created a crater where your virtualized storage array used to be
- Map/Reduce programming is slow and hard
- At a large scale, math is really hard
- It takes two days to load your big data cluster each week
- Tuning some of the queries is a bear
- The data is corrupt
- My Hadoop queries seem to be very network intensive
- You failed your security audit
These aren’t a stretch as John emphasized that he has encountered one or several of these challenges on a regular basis when working with Big Data projects. The truth of the matter is that Big Data is quite complicated.
John continued to expand on the simple facts that:
- Analytics are amazing but action is usually required (OK, always)
- Big data deployments have implications for the rest of your environments
- Current big stacks have a lot of components
- Hadoop is not generally a system of record. It is a data processing environment. Date has to be moved in and out to be useful.
- Hadoop is not quite like your existing data warehousing cluster database
What is the deal with NoSQL?
He next gave the audience a great overview of NoSQL, BigData’s response to unstructured and semi-structured data. West described NoSQL as “The data workhorse of Big Data.” He want on to describe it as:
Open source, horizontally scalable, distributed database system
Database platform that doesn’t adhere to the RDBMS standard
Designed for enormous datasets where retrieve and append operations are the norm
Data is sharded and replicated, no single point of failure
Hive, Hbase and similar tools give database semantics
The number one issue of Big Data
In West’s view, the single biggest problem of Big Data is ‘data provenance’, which includes the following challenges:
- How is data stored and changes are tracked over time?
- How is the data secured?
- How will data issues be investigated?
- Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline
- If one of your key products is to “crunch data” and derive or extract value from it then you should be concerned about data provenance
- This is true whether you are crunching your own data or third-party data
An excellent session that had a great amount of detail around Big Data requirements. John can be reached at John@FablessLabs.com.
John’s sample Big Data architecture: