Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data With Thought Leadership and Prioritizing ROI Over Data Collection

DZone's Guide to

Big Data With Thought Leadership and Prioritizing ROI Over Data Collection

I was doing a project involving tons of data. The company was spending millions to analyze this data. But after almost three years, there was no ROI. Here's why.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Imagine: It's somewhere in the years of 2014-2016, when big data was being hyped up as a source of big business success. Fast forward to 2016 and 2017, when we started seeing headlines like about big data having no ROI. Soon, we saw many posts analyzing what went wrong. Here, I'll be talking about some of my real experiences with this.

I was doing a project involving tons of data. The project I was working on was sourcing data from a vendor to a data lake and doing an analysis to check the performance of its own product compared to other products in the market. The company was spending millions of dollars to get this data, process it, and analyze it using big data technologies.

But after almost three years, this investment had not shown any ROI. Here's why.

Isolation in Work Culture

In big organizations, this is a very common issue. Not only do business units work in isolation, even individual projects run in isolation. It's very complex to bring all business units together, but once isolation starts to keep the organization moving towards its goals, there's a problem. When it comes to big data, sharing is a best practice because it is very fast, is huge, and requires quick action. Isolation kills the zeal of big data initiatives.

No-Action Data Analysis

When I joined the project, I was keen to know how it was going to help business. This project was running from few years and I wanted to know if there were any visible business benefits. In meetings with product owners, managers, etc., I questioned the realized benefit from the data analysis. It's been six months now, and I haven't gotten a concrete answer. It was very clear that there was some analysis of the data, but there no action had been taken from that analysis. And if the project age is three years, then $3+ million is lost.

Lack of Skill Sets and Architectural Insight

The project was running with very new big data resources. Running without any architectural insight, lots of manual work was involved in running, fixing, and re-running jobs. That's why none of the jobs have been moved to production even after 2.5 years. Also, the jobs are taking a very long time to run. It's been suggested that this is because of the MapReduce engine, and it's been proposed to move to Spark.

I looked into the code, added a few performance optimization parameters to the job, and was able to reduce runtime by almost 25%. Spark would be able to reduce runtime by about half.

Which Tools and Technology to Use

On the analytics side, SAS was being used. The company was trying to use SparkML or Revolutionary R, keeping big data in mind. The project is not getting the go-ahead for SparkML from directors because the focus was on making Rev-R an enterprise analytical tool. I checked with the ML lead about what the problem was with Rev-R. Apparently, R has a different execution model. It's not like SAS. I suggested we adopt it. I think Rev-R is cool — it has a stronger library than SparkML. And the project is not doing any real-time analytics. Then I guessed the problem was somewhere else. Again, it was not clear what was going on and what would be stable in the long-term — and no one wants to take the risk of investing hard dollars and ending up nowhere.

There weren't even any guidelines to help decide when to use Hive/MapReduce jobs or when to go for in-memory processing. 

Missing Use Cases

I also found that in meetings and such, big names were always used for proposing new technologies — i.e. Facebook is using this and LinkedIn is using that. But the real talk is always missed — i.e. What is the use case of Facebook? and What is the use case of LinkedIn? And most importantly, What is the use case of this organization?

If we take these big organizations as models of big data transitions, then there should be an understanding that these organizations contribute heavily in open source to solve their issues and bottlenecks — which might not be the case with this organization.

Very Tight Governance

Let's go a little more in-depth with the problem statement. As I mentioned earlier, the organization was divided into multiple units, each unit having its own tools technologies and practices for implementing big data projects. It takes a lot more effort to get the data from one unit to another unit if it is required for any kind of learning and analysis. There is a long approval cycle and limitations to who can get the information, who cannot, etc. I went through so many sites to research a governance model of big data and did a study that it's recommended to have not tight governance but intelligent governance for big data.

Crisis Has Arrived

The same organization was in a dwindling state to go along with big data technologies or roll back to Netezza or Teradata. So now, with respect to the big data roadmap, directors are divided. Few have clearly used no big data technologies and came out happy with foolproof, stable technologies.

But again, the organization was confused. If this time, we roll back from big data, moving to it again might not be possible. So, they had to make a decision. And again, they thought about taking the initiatives to looks into the gaps. They decided to migrate from an older version of big data technology stack to newer versions, as the vendor has proposed more stable versions with proposed use cases.

But again, to me, it looks like the right initiative has not been taken. I think that the following things are missing in the reinitiaves:

  1. Insight in use cases or big data. What is different they are going to do with this big data? What is different from the data that they have in the current data warehouse?

  2. Even if use cases were known and any metrics were generated, there's no plan to bring that into action.

  3. Filling gaps in roles, responsibilities, and skills.

  4. Not looking into improving the project quality.

  5. There is nobody to review what has been delivered. The jobs are running and giving output by any means, sometimes running for five to six days and facing multiple failures.

  6. Expectations are too high. The expectation is that the earlier version was not stable and that the newer version will solve all of their problems.

  7. Even if they do move to a stable version, there's no plan to fill the gaps between business units.

  8. There are no plans to bring business and implementation teams together.

One more point here to mention is that the organization has engaged so many vendors as implementation partners and there is no clear, agreed-upon roadmap.

For me, the missing part of this organization is thought leadership. Big data evangelism is not thought about. It started with project implementations with no clear business strategy, as it was thought as DWH migration to new technology.

When it comes to big data, the transition should be considered a change in enterprise architecture. It should not only involve changes in technologies but also should change the work culture and thought of current organization. Rather than IT, this transition should be driven by business with a clear strategy and roadmap. The corporate culture should move towards quick decision making, innovative work style, and governance. In short, an enterprise architecture is required to bring big data into the enterprise. Otherwise, the organization will keep ending up creating data in data lakes with investments that have no returns.

The main idea of this article was to give insight to the areas we should think about before jumping into a big data project. I hope you enjoyed reading it!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,thought leadership ,roi ,data collection ,data analytics

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}