Join the DZone community and get the full member experience.
Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
To understand the current and future state of big data, we spoke to 31 IT executives from 28 organizations. We asked them, "What are the most common failures you see in big data initiatives?" Here's what they told us:
Lack of Security
- As it applies to security one of them is putting security policy definitions in the hand of the developer to begin with. Push the burden down to the person responsible for the data. The team defining the policy should be the ones owning the liability and the burden. Take policies and push upward. Security needs to enforce policy, not developers. Boiling the ocean is still a problem.
- Struggle around data scientists paranoia about data security – compliance, SARBOX. Data governance acts as gatekeepers around the data. Too much of a barrier to get at the data and be agile.
- The risk is higher in big companies. It comes down to an internal process. Enterprises become complacent. More experienced employees can become lax. Need to be more worried today because things are happening quicker, and attacks are more sophisticated. A lot more means for bad actors to hide their presence.
- 1) Most failures emanate from a difficulty in integrating business processes and applications. 2) Management resistance, internal politics, lack of skills, security and governance challenges. The same reasons companies look for a big data platform. It enables them to overcome technical hurdles, integration connecting data repositories – no need to build connectors, already built and tested, different data formats (over 350), cognitive analytics.
- The biggest challenge in big data initiatives, like in all data analytics projects, is the recruitment of qualified employees. As easy as the tool can make the work of analyzing data, some background knowledge and experience is still necessary. Not having people with the right skills to drive a big data analytics project might lead to a complex path and possibly to failure.
- One of the biggest issues in big data is that organizations try to code everything themselves. This is behavior that never seems to end. Back when relational databases were still new, companies would ask themselves, “Why should we buy Oracle when we can build a database ourselves?” Many organizations mistakenly employ the same approach with big data. They soon discover they’re not Google, Facebook, or Netflix. They don’t have the technical expertise or engineering capacity to keep up with all the changes today’s data-driven economy requires. The problem is particularly devastating for legacy companies that weren’t born digitally. Their data projects stall as soon as they come face-to-face with the full size and scale of their integration, people capacity, and other needs. It just turns out to be much harder than they think which is why they need automated platforms.
- 1) Integration is key in big data initiatives, and if organizations don’t equip themselves with a solution that allows for that, customers will go elsewhere. Data silos are a surefire way to experience failure with your big data initiative. The isolation of data makes it impossible to glean insight and patterns. But when everyone on the team has access to the same information and insight, decision making across the organization is aligned for better overall results and an improved bottom line. 2) Another common failure is a lack of a skilled services team. Skilled data analysts are imperative for a successful big data initiative. Knowing how to digest the data, distill it on a granular level, and translate it into actionable insights is critical for success when it comes to big data.
No Clear Business Goals
- Companies focus on collecting data but they're not able to answer questions from the beginning. The business strategy is to collect everything because I may need it later. Data lakes have massive amounts of information, the majority of which isn’t needed. Companies are taking on data projects assuming they have the data and in a format that's needed. One to two years in they realize that they don’t. There's a disconnect between technology and business reasons for collecting the data. Master data management (MDM) can identify the relationship between data. In a live system, you can push the golden record back to get all of that data. Look at data, identify relationships, and feed relationship metadata in to get more value from other data.
- There's been a lot about experimentation with technology and how to use it over the last eight years — "Watson learns ways that don’t work." Flip the switch to get more scalability and breadth of use. According to Gartner, “through 2019, 90% of information assets will be siloed and unusable across multiple business processes.” Stop thinking about experiments and get back to identifying classic business problems and using data to find solutions. The hype is over, there is great compute and data capabilities. It’s about the business and not the technology.
- Need to be transparent and learn from mistakes. The digital path is a hodgepodge improperly designed from a security perspective. Even if security is not the biggest problem jumping into the project headfirst can give the wrong expectations and metrics. Whether it's a virtual data lake with real-time analysis or a traditional data lake with long-term data analysis. Figure out the problem to be solved before deciding on the technology you choose.
- While the number of big data initiatives keep growing, over half of them will not be successful in delivering business value. Top common failure scenarios are: 1) Not having a clear business objective – Conducting big data analytics just because of its hype often leads to failures as businesses jump into “the how” without defining “the what” and “the why.” This lack of objectives can cause some of the biggest failures. 2) Not having a way to act on the insights – Many times the big data/data science teams come up with insights that cannot be acted upon. For instance, the analytics cannot be implemented in a scalable fashion in production without causing significant latency issues which are unacceptable to the business. Hypotheses are only as good as actions that can follow. Needing a scalable system that operationalizes these machine learning models is one of the biggest impediments leading to big data project failures.
- Not having clear, precise goals for any data project is a common failure. Too many people, even today, hear big data and immediately think of Hadoop. Or, as stated elsewhere here, just think about an analytics infrastructure rather than seeing operations/transactions as being a critical piece of any big data infrastructure. Underestimating demand for data and the challenges of trying to scale up a compromised architecture after the fact to try to deal with a larger demand and a broader user base is also a common failure of some big data initiatives. The correctness of data is also something that should be built into a new initiative from the start. How do you ensure you are starting with correct data and it stays correct throughout all the processing, especially when there are changes being made to the data (through transactions, updates, etc.)? In an always-on world, eventual consistency of different data stores is becoming more difficult to accept so thinking about full/strict consistency for data across the globe or across the organization should be a consideration for all big data initiatives.
- 1) There are a couple of classic failure modes for big data initiatives. All of the factors we discussed as essential to a successful program have a converse side – when they are missing you have failures. Lack of clear goals and mandate is a top cause of project failure. We’ve seen many big data initiatives fail through lack of data model flexibility. Overlooking data sources or even overlooking the need for interoperability altogether is another frequent cause of failure. 2) Beyond this, the most common failure we see is related to scaling and performance. Data volumes and velocity often grow well beyond what projects imagine, even if they understand they are “big data.”
Inability to Scale
- People don't really understand how the technology scales, and this effects how to build the data model to take advantage of it. To achieve scalability you need to build your application a certain way.
- Start with a small dataset then become familiar with the business need and go into production to get value out. If you pick up the wrong tool can it scale? Six months to a year with an initial tool or database can get stuck and not get to production because it would not scale. If you go into production look at how quickly you can load the data. Finish in minutes versus days. Start over, talk with other vendors. Get proof that the tool or database will do what you need. Everyone has been burned. Time to value. We are asking better questions to understand the building blocks in the infrastructure. This helps the discovery of value for a customer. Scale and size of data are difficult for a vendor. Need to be able to understand. Tell me what you are going to do to solve my business problem.
- We were not able to use Spark or Hadoop earlier in this decade. Our data volumes caused OOM and failed syncs. However, with the open source community creating better data flow processes and memory management, these tools have become more accessible to Petabyte scale.
- A challenge a lot of projects have is that it's hard to test scale and volume and variety within a PLC. How do you test a petabyte system? How do you work through this? Almost building production systems to build a test. Failures of scale is a theme I’m seeing a lot. We have to get to a referential scale where everyone can go and look and check the box. How does the industry learn about scale, especially for emerging vendors?
- 1) Scope. This one thing you can do with AI. Know what you are getting into and what the capabilities are – people, process, and technology. Know what your people are capable of. What training have you put in place for them to be successful? 2) Security and data governance. If you don’t know the lineage of the data and can’t ensure it’s secure you are asking for trouble. Must be able to verify the accuracy of what I’m doing with the data. Need to understand the data, it’s all wildly different. 3) Able to identify when something bad has the potential to happen. That’s the expectation but it’s a long way off. Improper expectations and being realistic. Understand what you are doing and the questions you are asking. Too narrow or too specific for the data you will never get to where you want to go. Set expectations early.
- One of the most common failures we see in big data initiatives is companies choosing technologies which fundamentally cannot provide the performance they ultimately will require. For example, a common, challenging scenario is a company that chose a distributed NoSQL disk-based database that is eventually consistent for high-value applications such as inventory management. As their business expands, they then find that the distributed nature of the database allows them to expand but they have trouble accessing the data easily without SQL, the system is not fast enough because it is disk-based, and the system loses transactions because it is eventually consistent, not strictly consistent. In these cases, a company can find that it has a large dataset deployed on a platform which can never provide the speed and accuracy it requires to run its business.
- In an unstructured world, we need to recognize what information is important. Inventory the data and understand how to recognize the value and the need to secure it. Organizations do data dumps and it causes risks and failure in success rates. The first step is to inventory information, what you have, its value, and the business process for use. Two petabytes of information are two billion documents. 100 TB is 143,000 documents which take 2,000 man-years to process. We can reduce that to seven man-years or 2.5 people over three years. That's still a long time but it's doable. Organizations don’t take the time necessary to understand, categorize, and tag their information. That’s where people are failing. People just shortcut and dump or keep. Mismanagement is not knowing what you have.
- One of the most common failures is trying to tackle monumental tasks in-house with a small team. As more companies embrace digital and cloud-based models there’s still a struggle to move the needle with people as part of the equation. CIOs might oversee a group of frustrated and overworked IT staff and come to discover that using outsourced expertise can ensure in-house professionals do more than just keeping the lights on.
- 1) Getting access to data. Failure patterns decide what questions get asked and how to find data. Today, it takes lots of data available to power users without restricting access so you can figure-in new and innovative uses. Need to be in control but more open. 2) If you segment off stats with business owners they might find something for the sake of finding something that is not valuable.
- Focus on improving, moving forward, and trying again. Ability to iterate. Revisit failure in a short timeframe. Move quickly to the second, third, and fourth iteration. Iterate continuously throughout the process. Move quickly and efficiently with smaller pieces to get value along the way.
- A blocker to big data success is when you have infrastructure challenges that limit data collection, infrastructure or legacy unable to collect all the data needed to solve. Data in disconnected realms. Physically unable to connect, restrictions on sharing. Understand how the data relates to each other. Projects start collecting all the data but digging out the value from all the data is blocked. Look at the data in a lot of different ways. Keep on working and trying different things.
- Engagement has failed, or people are taking incorrect actions. Plan on how to reach through the entire enterprise. Drive understanding of data and dissemination of the story. Spent a ton of money and time to become more data-driven but it’s not leading to more data-driven decisions. Inability to communicate at scale with an individual user.
- Organizations have data swamps because they ingest a lot of data that sits unused. They are unable to find the data they need, they can’t open it, and they’re paying for a lot of unusable data that keeps growing. We automatically tag data profiles and make sense of it. Without the data properly tagged or knowing the variables or the location of the data you can’t get value from it. If data is bad it’s hard to reverse engineer advanced analytics. Find what you need, make sense of it, decide if it’s worth using.
- There are a lot of advantages to big data, but organizations have to improve their maturity in using it. Many organizations are still in the early stages of learning. Make big data easier for enterprises to use, configure, deploy. As companies standardize, start with infrastructure, second phase data governance, security, access control, third is application performance management to get more out of the infrastructure. Think through all three or you will constantly be fixing them. You can’t trust platform because the infrastructure doesn’t support it. If you don’t have the right governance and control, organizations don’t want to risk their data with you. The platform needs to be open to everyone. When they write a query and response time is not what was promised, it creates a gaping hole. No one is willing to wait. You need to think through all of the nuances. If they don’t trust the platform, the process, and the data they won’t come back to it.
- The most common failures are “bootstrap,” unfunded projects that start with some open source platform because the software is “free.” Organizations end up spending thousands, if not millions more in time, scarce resources, training, and developing standard features from proven, purpose-built commercial data analytical platforms. Beyond the financial costs, organizations miss most on the opportunity costs tied to the promise of big data analytics initiatives. Also, millions of developers and data scientists are increasingly adopting Python and R as de-facto languages for ML model development. However, the problem is that, without an analytical database with ML built-in, data scientists are building models based on a subset of data, so organizations need to be concerned with the accuracy of those models.
Here’s who we spoke to:
- Cheryl Martin, V.P. Research Chief Data Scientist, Alegion
- Adam Smith, COO, Automated Insights
- Amy O’Connor, Chief Data and Information Officer, Cloudera
- Colin Britton, Chief Strategy Officer, Devo
- OJ Ngo, CTO and Co-founder, DH2i
- Alan Weintraub, Office of the CTO, DocAuthority
- Kelly Stirman, CMO and V.P. of Strategy, Dremio
- Dennis Duckworth, Director of Product Marketing, Fauna
- Nikita Ivanov, founder and CTO, GridGain Systems
- Tom Zawacki, Chief Digital Officer, Infogroup
- Ramesh Menon, Vice President, Product, Infoworks
- Ben Slater, Chief Product Officer, Instaclustr
- Jeff Fried, Director of Product Management, InterSystems
- Bob Hollander, Senior Vice President, Services & Business Development, InterVision
- Ilya Pupko, Chief Architect, Jitterbit
- Rosaria Silipo, Principal Data Scientist and Tobias Koetter, Big Data Manager and Head of Berlin Office, KNIME
- Bill Peterson, V.P. Industry Solutions, MapR
- Jeff Healey, Vertica Product Marketing, Micro Focus
- Derek Smith, CTO and Co-founder and Katie Horvath, CEO, Naveego
- Michael LaFleur, Global Head of Solution Architecture, Provenir
- Stephen Blum, CTO, PubNub
- Scott Parker, Director of Product Marketing, Sinequa
- Clarke Patterson, Head of Product Marketing, StreamSets
- Bob Eve, Senior Director, TIBCO
- Yu Xu, Founder and CEO, and Todd Blaschka, CTO, TigerGraph
- Bala Venkatrao, V.P. of Product, Unravel
- Madhup Mishra, VP of Product Marketing, VoltDB
- Alex Gorelik, Founder and CTO, Waterline Data
Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.
,big data analytics
,big data technologies
Opinions expressed by DZone contributors are their own.