On Structured and Unstructured Big Data Analytics
On Structured and Unstructured Big Data Analytics
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Analysis of unstructured data is the hot topic these days – organizations are lured by the promise of deriving huge incremental value by gaining insights from crunching vast pools of seemingly random numbers to determine patterns and trends. While visiting the incredible SuperNAP data center in Las Vegas, I was shown some racks that are running a 50000 node Hadoop cluster dedicated to making sense of the mass of data that eBay holds. Given these exciting use-cases, it’s not a huge surprise that structured data analytics gets forgotten sometimes.
At the recent HP Discover event in Las Vegas (disclosure – HP covered my T&E to attend Discover), it was easy to forget for a moment that structured data exists. In his keynote, newly minted Autonomy CEO Stouffer Egan sold the audience an aspirational story of making sense of the ever increasing quantities of unstructured data in existence – he pointed out the rise of social media, the impending emergence of the internet of things and the already high proportion of existing data that is unstructured to sell the attendees on his product.
But even before the hotly debated acquisition of Autonomy by HP, HP had a product that made sense out of mass data. Vertica, the largely forgotten (at least at Discover) HP product that was itself acquired in 2011 by HP. Vertica is an analytic database management software company that was founded in 2005 by database researcher Michael Stonebraker, someone who has gone on record on many occasions with some strong thoughts about structured and unstructured data as it relates to databases.
This apparent shift of attention was a topic of discussion between myself and some other discover attendees – the closest we came to an explanation was that HP was caught up in two particular traits;
- Chasing the industry theme du jour. With the world talking about unstructured data ad infinitum, this has reflected upon HP’s own messaging of its own products
- The “shiny new thing” syndrome. Autonomy was acquired just a year ago and as the new kid on the block, HP is eager to talk it up (and earn some revenue from the price they paid)
Anyway – I digress.
Somewhat confused by the comments around the rise of unstructured data analysis, I begun a LinkedIn conversation to harness practitioners views on the assertion. Their comments are illuminating as they break through much of the hand waving and hype around the topic;
Phillip Jaenke provided a very coherent, and perhaps prescient comment;
Structured data analysis is going to continue to trump unstructured data analysis for the simple reason of efficiency. People are all about the cargo cult of unstructured analysis; usually without any consideration for the real world cost. Unstructured analysis is inherently inefficient – you always have to perform at least one very painful additional extraction step.
Many folks these days forget or ignore the reason structured data analysis is done the way it is – efficiency. There is no sane or justifiable reason to throw structured data such as order histories or network analysis data into the big unstructured bucket. It’s outright insane.
That said, the future is not in unstructured storage + analysis in one package. Big buckets for everything are a bad idea, and eventually people will realize this. They’re great for some types of data and absolutely horrible for others. That’s just a fact. The real future is in software which can perform complex systems analysis on diverse data sets. e.g. Taking ERP SQL, web logs, unstructured data and creating coherent data from that combination.
The notion Jaenke raises is one of an analytics hub if you will – an engine that can process data sets no matter where they come from (massive unstructured sets from social media for example, or vast structured tables of customer history) and crunch some insights from the combination. This view would seem to gel with the most recent version of Vertica that HP released just before Discover. What Vertica created is the “FlexStore” architecture that is designed to result in Vertica being a true analytics hub. With the hub, Vertica can federate data from various data sets for analysis. These varied inputs include Hadoop and HP’s own Autonomy IDOL platform – thus brining unstructured analytics into the fold.
One could argue that a degree of rationalization needs to occur between Autonomy and Vertica – the messaging and use cases articulated conflict and overlap to an extent and I’d expect that over the months ahead, as HP CEO Meg Whitman drives more efficiencies and consistencies that we’ll begin to see a much more unified message that looks at analytics holistically – without creating siloes on either the structured, or the unstructured, side of the house.
Published at DZone with permission of Ben Kepes , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.