What You Need to Know Before Embarking on Your Next Data Analytics Project
What You Need to Know Before Embarking on Your Next Data Analytics Project
The majority of big data projects don't go beyond experimentation or even piloting, and are usually even abandoned. Check out these tips to avoid falling into that trap.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Gartner predicts that through 2017, 60% of big data projects will fail to go beyond piloting and experimentation and ultimately will be abandoned. In this article, we’ll outline the potential pitfalls that account for this low success rate and suggest ways you can address them.
There are several internal challenges that impede success: data-related, team-related, and even at the executive level.
There are three main challenges to consider here...
Disparate Data Sources and Data Silos
The typical enterprise that we encounter has many internal databases, sometimes even in the thousands, that don’t talk to each other. They also lack a robust API layer that enables programmatic access to data. Instead, they steer towards data warehouses where big data is collected and stored in one place before it’s used for analysis. Data warehouses have their place but are often overrated. Costly and time-consuming to implement, data warehouses also add to your data glut and introduce delays into the analytics process.
Data Warehouses Are Not the Only Option
Data warehouses can also increase your storage costs and make the data analytics process more laborious. Each time you need access to the data, the warehouse extracts data into data cubes creating duplicate data (which can quickly grow your data surplus)—sometimes referred to data explosion. Another downfall of the warehouse approach is that the data-cube data isn’t updated automatically, potentially leaving you with stale data that creates noise in your system making the data cleaning process more protracted.
Dirty or inconsistent data is also a prevalent issue. Empty data fields, misspellings, and so on can mess up your insights. Data inconsistencies also create a problem. For example, a company’s marketing system may define the northeast U.S. in a way that includes Washington, D.C. However, the sales team categorizes D.C. as the Mid-Atlantic. When these datasets are merged, you have a problem. You can buy tools that can pull data out of disparate systems, but these typically don’t flag data conflicts.
Let's talk about roadmaps, expertise, and data fiefdom.
Need for an Analytics Roadmap
Data sprawl is a huge problem — users generate data at alarming rates. Without an analytics planner, things can get out of control. An analytics roadmap is needed.
Internal vs. External Expertise
Finding the experience to manage data analytics projects from the inside is tough. Any team embarking on data mining, predictive analytics, and predictive modeling must be very well versed in math and statistics. External experts bring valuable know-how and a better understanding of what’s needed. Generating that kind of know-how internally is typically beyond the reach of most organizations.
Fear and Fiefdom
Data fiefdom can also get in the way of success. Data owners don’t want to lose their relevance or control of the data they generate/own. Sharing their data sets with other teams or external consultants may expose their mistakes or even hinder their ability to do their jobs.
Leadership Team Challenges
And, of course, there are going to be leadership challenges.
The C-suite and managerial team also have a role to play in why analytics projects fail. Many business leaders trust the old way of doing things. Instead of basing their decisions on actionable results, they want to review each bar of the “green bar report,” which creates delays.
Lack of Continuous Involvement
Executives are rarely involved throughout the analytics process. They may be involved in the kick-off meeting but fail to re-engage until several months down the line when they find that the outcomes are not what they wanted or that their priorities have changed. Of course, the analytics team was kept in the dark. This approach is the antithesis of Agile.
Consider the following external challenges that make analytics projects fail.
The Big Bang Approach vs. Low-Risk Approach
Our viewpoints are often shaped by external influencers. It’s challenging to go against the flow. If everyone says you need to take the big bang, warehouse, multi-million-dollar approach to data analytics, people don’t even consider other options.
Pretty Visualizations vs. Actionable Insights
A lot of vendors impress you with pretty visuals that look great but aren’t actionable. Sexy always wins over quality. This is augmented by a lack of leadership involvement throughout the process. Without continuous involvement, business leaders may not understand what’s presented to them or realize that the outcome or product isn’t what they really needed. This approach also means that mistakes aren’t detected along the way. However, with an Agile approach, stakeholders are involved throughout the project and mistakes can be detected and remediated in real-time. A pretty visualization is nice to have, but not a must-have. It’s merely a means to an end: actionable insights.
Common Misconceptions About Predictive Analytics
Analytics is a one-time effort: False. You can’t just set everything up and reap the benefits. Done right, analytics involves a continuous feedback loop — it never stops.
It’s a black box effort: False. The project must happen transparently, in the open, with immediate disciplined, regular feedback.
Actionable insights require big data: Actionable insights don’t require big data. Some of the most valuable business insights are derived from surprisingly small datasets (it also costs less and minimizes risk to start small). Instead of focusing on big data, we recommend focusing on the right data at the right time and finding a way to ask the right questions of that data.
There must be a tool for this: Unfortunately, there's not. There’s no such thing as a predictive analytics tool that you can install, press a button on, and marvel at your insights. Instead, many enterprises invest in more than one tool (some on-premise, some in the cloud), each of which requires customization. And not all these tools will be future-ready and may need to be switched out with time. As your business needs change and digital transformation reaches a new step on the maturity ladder, your choice of tools will also change.
My data will give me 100% certainty: False. There is no 100% certainty. In many cases, a margin of error of 30-40% accuracy may be good enough.
Actionable insights are essential: Not always. Sometimes, the goal could be to achieve no actionable data from your efforts. For example, if you’re using data and predictive analytics to alert you of potential equipment failure in a medical environment, no data in the form of alerts means there’s nothing to report and all is well with the equipment.
New, Nimbler Approaches to Analytics
Though there isn’t an approach that avoids all of the above-mentioned pitfalls, there are methods that help to avoid at least some of them. We’ve introduced several best practices into our client engagements that break down these barriers and produce rapid, iterative, actionable insights and give management what they need without alienating data owners or breaking the bank.
A Software Development Approach
This is a novel way of looking at things, but data analytics projects have many parallels with software product development. Instead of delivering a piece of software, you’re delivering a data product and product/software development best practices still apply. Yet, most data analytics consultants and external vendors don’t have that product mentality. They prefer to focus on maximizing billable hours. This prevents the transfer of knowledge and intellectual property to the product owner who is going to run the tool, aka the customer.
Just as a proof of concept (POC) in software development must prove the unknown, the POC in an analytics project uncovers any impediments to delivering value in the fastest possible way. Once the POC is affirmed, you are led to your first minimal viable prediction (MVP).
An MVP approach is all about disregarding the noise and assembling only the data that correlates with your number one problem, as fast as you can, and iterating from there. While your executives or business sponsor doesn’t need to be involved all the time, be sure to schedule periodic, short feedback loops. Make it your goal to deliver minuscule pieces of fast progress. This will ensure they see immediate, incremental results, get exactly what they need, and drive greater engagement.
The MVP approach goes against the grain of an all-in, high-risk, long-haul data analytics practice. It also keeps you focused and helps you avoid many of the pitfalls above. With this kind of discipline, you’ll find it easier to say no to frivolous feature requests, charts, graphs, etc. because you need something actionable ASAP.
Adopting an iterative approach rather than diving into your big data all at once can pay dividends and put AI-driven decision-making quickly within your reach. This sounds straightforward, but it can be tempting to jump into the data lake in front of you instead of just following small, relevant, iteratively assembled data breadcrumbs.
Read more about how to implement MVP here.
Unfortunately, a software development approach to analytics doesn’t solve the problem of disparate systems. You may have the right method, but if your systems don’t talk to each other, getting to the data is challenging.
Data ownership, or fear and fiefdom, often impedes or makes sharing data difficult. While you can counteract fiefdom through evangelism (it behooves your organization to prove how sharing data creates value across the board), we’ve come up with a concept to aid in this effort. We call it “API-in-a-Box.” API-in-a-Box has another benefit, it allows systems to quickly talk to each other without having to conduct a costly and time-consuming traditional system integration.
API-in-a-Box breaks out data silos without data owners fearing they’ll lose control of their data. By packing all relevant technology into a container and giving each department access to the data that’s relevant to them (or that the data owner feels comfortable sharing) via an API — silos are easily overcome. An API-in-a-Box can be spun up in days, eliminating the time-consuming data integration problem. Plus, after data errors are found and one department’s data is merged with another, actionable insights start to emerge and the barriers of fear and fiefdom start to break down.
Internal vs. External Data Sources Concept
We also recommend moving away from internal vs. external data sources concepts. For your analytics project, it shouldn’t really matter where your data comes from. Use as many relevant data sources as you need and as few as you can. Try to create and maintain as little data as is necessary. If someone is already collecting, cleaning, and making that data available, why should you replicate their efforts? Data can be readily leased or purchased. For example, weather data is available for purchase and is much cheaper than collecting your own. Some companies generate so much data that it even generates new business opportunities.
Externally collected data can even have a monetization value. A client assembles sports data from national leagues, packages it into an API, and licenses it. All the data is created and maintained by the leagues, but the company simply utilizes that data and provides it to others who may not have the resources to assemble that data for a fee.
New Technologies Enabling these New Approaches
Getting started on any data analytics project can take several months – all that data gathering, cleaning, structuring, model building, enhancing, and reviewing takes time. However, emerging technology is enabling nimbler approaches. Cloud services, for example, have revolutionized analytics. Everything you need to crunch data is available out-of-the-box. Prior to these developments, you’d have to work with Software-as-a-Service (SaaS), manage your own data center, and deal with non-user-friendly processes. Quite frankly, it was a pain. Here are just a few examples of the numerous cloud services for analytics:
- Microsoft:Machine learning, Azure Analysis Services, and Video Indexer.
- Amazon Web Services: Quicksight, Deep Learning AMIs, and Rekognition.
- Google Cloud: Cloud Dataflow, Cloud Speech API, and Cloud Machine Learning Engine.
- GovCloud: Provides additional security protections with the same user-friendly tools.
Cloud services are democratizing analytics and putting actionable insights into the hands of IT and non-IT teams in ways that weren’t previously possible.
Of course, data analytics is more than just best practices and data integration/storage/crunching tools. Data scientists often need access to huge computational power quickly, on very short notice, and for a short period of time. Likewise, proximity and access to cognitive services like artificial intelligence (AI) and machine learning in the cloud (Microsoft Azure and Amazon Web Services) are needed in unplannable ways.
Different technologies such as Kublr enable data scientists to move huge amounts of data between clouds, the data center, or wherever your data needs to go. This provides unprecedented access to data and computes power that can scale up and down as quickly as you need it.
Food for Thought as Your Data Analytics Projects Evolve
Hopefully, these best practices and technology insights have provided some food for thought about how you approach your next data analytics project.
But don’t stop with your first MVP. As your organization’s analytics capabilities mature, you’ll be able to incrementally feed your engine more data—in terms of volume and diversity. First, you start understanding simple correlations, then you’ll start getting information, then you’ll get predictions and in the next step, you receive recommendations. As your system matures, it will slowly turn into an AI engine.
One final area to ponder are advancements in AI and how that factors into your BI strategy.
More and more decision-making is being delegated to machines. Whereas in the past machines were relied upon for alert-only notifications, now they enable real-time, context-aware decisions. For example, AI monitors and auto-corrects manufacturing processes, help the C-suite make strategic decisions, and marketers determine which promotions they should offer.
AI can seem overwhelming. But just as you approach your next data analytics project by starting small, iterating constantly, and providing regular feedback to business sponsors, apply the same baby step approach to AI.
Published at DZone with permission of Wolf Ruzicka . See the original article here.
Opinions expressed by DZone contributors are their own.