Over a million developers have joined DZone.

Why Don't We Just Hadoopify It?

DZone's Guide to

Why Don't We Just Hadoopify It?

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Matt Schumpert took the stage at the InterOp Big Data Workshop in Las Vegas yesterday to talk about the myths and realities of Big Data. Matt is Director of Solutions Engineering at Datameer and has customers that include Visa, Sears, and three out of the worlds largest five banks. He’s an expert on Big Data and brought the following insights:

Hadoop isn’t a just a term

Screen Shot 2013-05-08 at 3.58.35 PMMatt had some fun with the hypsters and opportunists when he stated, “Hadoop integration is not a checkbox on your marketing material, your press release or a skill to add to LinkedIn.” He described it instead as a completely new way of looking at an organization’s data pipeline. He explained that Hadoop requires new tools, architectural approaches and also new processes, people skills and hardware.

To drive that point home, he put up the graphic that maps all of the companies that claim a piece of the big data pie and their relationships with other companies that make the same claim. It was a great example of data visualization that had the crowd pulling out their smartphones to snap the image.

You can’t just hadoopify it

Screen Shot 2013-05-08 at 4.52.55 PMHis second point was that Hadoop is indeed transformative, but can’t be approached without a certain degree of commitment, rigor and investment. “It is a completely new platform for managing data that represents the biggest shift since the introduction of the relational database.” He went on to say that people who think of simply bolting on Hadoop or writing custom code find they are:

  • “Shackled by the chains of the past,” with enterprise data warehouse report requests and IT-driven analytical processes
  • “Stuck in a quagmire of code,” written in an esoteric language only understood by data scientists whose longevity in the organization isn’t at all clear
  • No better off with new but fragile systems cast in stone using brittle methods

Matt explained that the real challenge of simply ‘hadoopifying’ a problem or an organization is the missed opportunity that represents. From his perspective, BI and analytics should be the most fluid of any company’s technology because it supports fully understanding the marketplace and being able to adapt quickly. It should reveal and answer the questions that allow a business to move to new opportunities.

Using Hadoop the right way

Schumpert said that the alternative to this bad-news scenario is to explore and communicate around your data using the tools everyone knows, like straight-forward spreadsheets and dashboards. He said that the business, which is the true consumer of data, needs to be comfortable with managing their work themselves and has to be put back in the driver’s seat. He explained that IT has taken over managing end users’ Big Data needs because Hadoop tools are limited, require a great deal of infrastructure and are programming intensive.

Knowing that Hadoop is disruptive but incomplete in its raw form. While it has the advantages of economics, flexibility and scalability, it has equal challenges in complexity, resource demands and a lack of packaged applications.

The Big Data path forward

Screen Shot 2013-05-08 at 4.51.58 PMRather than leave the audience with dire warnings, he offered the following advice on moving forward wisely with Big Data.

  • Clearly identify use cases around either fast results, big value or big pain points
  • Choose your project type, be it reporting, a purpose-built data application or data discovery
  • Figure out your total cost of ownership and expected return on investment early
  • Make sure the team involves all the right players including Security, SME’s, Networking and others
  • Know your data sources well based on what’s available, what’s missing, access, cost, frequency and security
  • Consider the hardware implications including fault tolerance, storage, compute, network and application servers
  • Figure out the software early and have skills aligned for Linux, Java, Hadoop, Analytics and monitoring
  • Decide what to build and what to buy, knowing the true cost of each
  • Get realistic on timeline and know the time from planning to acting on insights
  • Make sure you deploy hardware, software, network, monitoring, security and data integration wisely
  • Aggregate data where it makes sense
  • Take a hypothesis approach to using your Big Data
  • Have your analytics figured out from dirty data to self service
  • Ensure your data visualization gives you the perspectives and collaboration you need
  • Iterate on the work to make sure you sharpen your insights as you learn
  • Industrialize your solution to keep it from being fragile and losing value

Matt’s presentation was a great reminder of the pitfalls of going into Big Data without understanding the myths and realities. He can be reached at mschumpert@datameer.com.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}