Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Top 5 Reasons Most Big Data Projects Never Go Into Production

DZone's Guide to

Top 5 Reasons Most Big Data Projects Never Go Into Production

Big data is changing the world. But, imagine how much more it could do if more projects made it into production. We look at how to make this possible.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Enterprises are struggling to deploy big data workloads in products. This was captured very well by Gartner with a late 2016 press release which stated, “Only 15 percent of businesses reported deploying their big data project to production.” Gartner was pretty careful in their word selection. They didn’t mean there isn’t a lot of experimentation, or that data scientists haven’t found new insights by using big data techniques. They specifically stated that these projects didn’t make it into production. The problem isn’t with big data analytics or even with much of the data science experiments. The challenge is the lack of big data automation to make it easy to promote initial experiments out of the sandbox and into a fully functional production environment.

And this isn’t completely surprising.

Most people think that getting analytics into production is just about tuning the cluster. Sure, you can write a sqoop script and bring a table in once. But it is another challenge to bring it in multiple time without affecting the source systems. Then you have to be sure that the data pipeline you’ve built delivers the data within the timeframe set by the service level agreements (SLAs). Additionally, the data models are optimized for consumption by your users with tools, like Tableau, Qlik, etc. that they are currently using along with the responsiveness they have grown to expect.

There has been a ton of effort and investment in using tools on top of Hadoop and Spark to do rapid prototyping against large datasets. But prototyping is one thing. It is a completely different challenge to get that prototype to create a data workflow that runs every day without failing, or to enable elegant recovery when the data flow job does fail.

So then, here are our top 5 technical reasons that 85% of big data projects don’t go to production.

  1. Can’t load data fast enough to meet SLAs. While tools like sqoop support parallelization for data ingest to get data from legacy sources into a data lake, you need an expert to make it work. How do you partition the data? How do you know how many containers to run? If you can’t properly parallelize the ingest of data, ingestion tasks that could be done in an hour can take 10 to 20 times longer. The problem is that most people don’t know how to tune this properly.
  2. Can’t incrementally load data to meet SLAs. Most organizations aren’t moving their entire operations onto a big data environment. They move data there from existing operational systems to perform new kinds of analysis or machine learning. This means that they need to keep loading new data as it arrives. The problem is that these environments don’t support the concept of adds, deletes or inserts. This means you have to reload the entire dataset again (see point 1 above) or you have to code your way around this classic change data capture problem.
  3. Can’t provide reporting access to data interactively. Imagine you have 1000 BI analysts, and none of them want to use your data models because they take too long to query.  Actually, you only need one data analyst to make this unbearable. This is a classic problem with Hadoop and is the reason why lots of companies only use Hadoop for preprocessing and applying specific machine learning algorithms but then move the final data set back to a traditional data warehouse for use by a BI tool. Regardless, this adds yet one more step in the process that gets in the way of successfully completing a big data project.
  4. Can’t migrate from test to production. Many organizations have been able to identify the potential for new insights from the data scientist working within their sandbox environment. Once they have identified a new “recipe” for analytics, they need to move from an individual data scientist running this analysis in their sandbox to a production environment that can run every day. Moving from dev to production is a complete lift and shift operation that is generally done manually. And while it ran just fine on the dev cluster, now that same data pipeline has to be re-optimized on the production cluster. This tuning can often require significant rework to get it to perform efficiently. This is especially true if the dev environment is in any way different from the production environment.
  5. Can’t manage end to end production workloads. Most organizations have focused on tooling up so their data analyst and scientists can more easily identify new insights. They have not invested, however, in similar tooling for running data workflows in production where you have to worry about starting, pausing, and restarting jobs. You have to also worry about ensuring fault tolerance of your jobs, handle notifications, and orchestrating multiple workflows to avoid “collisions.”

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,hadoop ,data scientists

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}