Join the DZone community and get the full member experience.
Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
To gather insights on the state of big data in 2018, we talked to 22 executives from 21 companies who are helping clients manage and optimize their data to drive business value. We asked them, "What are the most common issues you see preventing companies from realizing the benefits of big data?" Here's what they told us.
- Depending on traditional systems. The people aspect is real, as expertise is needed to leverage big data systems. How to enable existing employees to use the data. Find a mix of people who can solve problems together. Have a desired end state but start small. Get wins. Stay focused. Have thoughtful, methodical implementation.
- Inability to deal with the shifting sands and technical debt of legacy systems and new software.
- Willingness to embrace the cloud. Understand that there are multiple ways to approach. It is not feasible to keep supporting legacy enterprise systems. They are not able to scale with the influx of data.
- Setting up the right backbone infrastructure (i.e. storage, transport, compute, failover). Getting data delivered from servers to analyze. How to deal with datasets. Scale, complexity, modeling.
- As organizations try to build big data projects, they are often unable to successfully execute it because costs are constrained, they lack right talents, they omit to adopt agile procedures, and they want to reuse the existing infrastructure. As a result, the business initiatives that depend on the big data foundation are often implemented in regional or line of business silos, ultimately failing to deliver a return on investment (ROI) or taking much longer to achieve results. Initiatives are sometimes limited by ideas and resources or lengthy delays getting from new ideas to execution. Organizations also often fail to analyze big data due to its complexities, which is, in some cases, linked to the lack of data analysts and other IT professionals to help interpret data.
Lack of Knowledge
- They don’t understand the cloud. They will do a “lift and shift” adopting Infrastructure as a Service without gaining any efficiency because they do not understand the benefits. They’ll kill their IT department and end up outsourcing management of the cloud to a third-party provider, still not understand the potential efficiency gains. More like Salesforce where they use the cloud for features, scalability, performance, and storage savings. Elastic cloud will scale up and down. You must use an SQL servid4r network and other components to scale instantly. Public cloud providers are now providing cognitive and AI/ML.
- While everyone is excited about big data, there are still some common issues that prevent some companies from realizing its benefits (although these are getting less and less pressing):
- A zoo of technologies that make it hard to choose which one to bet on.
- Lack of technical talent.
- Organizational roadblocks on adopting common data formats. Our recommendation for companies that are early with big data adoption is to pay attention to the new wave of technologies, especially data streaming technologies such as Apache Flink to avoid being left behind as a result of using already-outdated big data technologies that weren’t built for real-time applications.
- They like the promise of big data but do not understand specific use cases. There’s lack of buy-in by the different lines of business or specific business drivers. Lack of understanding of the best technology for the job, be it a data lake, platform, cloud, or software. It’s a complex decision and one that changes daily with all of the new solutions being introduced. It’s not a good idea to rebuild a data warehouse in a Hadoop data lake. Skillsets are less of an issue because of public cloud toolsets but you still need to understand the use case and the best tools to accomplish your goals.
- The customer might understand the potential benefit of big data based on what they see their competitors doing but they don’t know where to start. If they stick with the same tools, the same data sources, and the same knowledge, they’re not going to get anywhere. Tap into new talent and tools to solve the problem. I see a lot of cases where the project doesn’t go well and the company walks away from their big data initiative. Need someone on board who knows how to approach big data projects. The first and foremost challenge is fear of the unknown (usually expressed as fear of change), but there are a number of other challenges, including those that I mentioned before, such as ensuring that data analytics meets ethical requirements, regulatory and legal frameworks, and the ever-present challenge of acquiring, retaining, and growing the proper talent in the data science arena.
Business Problem Definition
- Start with the application and the use case and work from there. You cannot treat data as an afterthought. Key to success is laying the groundwork for many applications. Pay attention to the underlying data store and data fabric. Look at the volume, variety, and velocity of a few solutions to solve for reality: mission-critical, multi-location.
- The proliferation of technologies and solutions in the space. Start with Hadoop and realize you need different storage and streaming which leads to Spark. The time spent configuring and managing open-source components in one place can hurt the ROI of your project. We recommend understanding what the best solution for the problem you want to solve is. Look for out-of-the-box solutions to reduce configuration and management time.
- Not understanding that big data analytics is a set of tools and technologies that must be selected and applied for measurable outcomes. For measurable outcomes, companies must apply enough rigor to the documentation and analysis of what they're trying to achieve. Companies must then base the selection of the tools and technologies on their capabilities to meet or exceed the desired outcomes. I’ve seen too much "download it, install it, use it," or "try without a defined purpose in mind." In technology, we don’t generally apply a cost to materials when tackling projects with the assumption computing power is available — unlike building a house, for example. However, we burn time on endeavors that are too often unfruitful or ill-fated due to little upfront planning.
Data Quality and Management
- Ability to get their head around the data. Move data from storage to compute and back as needed.
- Lack of focus on metadata — not looking at the problem holistically.
- The systems being used to record data in the first place. No easy way to get data out for comparisons. Data silos by schema and implementation. Inconsistencies in systems and schemas. We normalize data across all systems and schemas.
- One of the biggest challenges is their ability to use all the data that they have without a lot of manual and time-consuming processes to copy the data to where the analysis is happening. Moving data is very expensive and time-consuming.
- Unorganized or unstructured data collection and processing. For NLG, in particular, the narrative output is often limited to the cleanliness of the data input.
- Inability to scale up concurrently in Hadoop. Query engine with single threading. Security’s ability to conform to GDPR. Process technology in place to delete records. Leave data in place — local administrators can know local laws. Prevent queries that may break the law.
- Slow, manual, one-off efforts that are discarded. Too much time spent finding data. No common authoritative set of data assets for everyone to use. Preparing and cleaning data takes weeks, leaving insufficient time for analytics. Data lakes become data swamps from data that is inaccurate, incomplete, and without context.
- Complexity in the technology stack. Retailers want real-time information from shopping carts and 12 months of purchasing history. Stitch three or four systems together. More moving parts result in more opportunities for breakage and latency. Help simplify the data pipeline for greater availability. Data architect the enterprise so that it’s able and ready to scale.
Here’s who we spoke to:
Emma McGrattan, S.V.P. of Engineering, Actian
Neena Pemmaraju, VP, Products, Alluxio, Inc.
Tibi Popp, Co-founder and CTO, Archive360
Laura Pressman, Marketing Manager, Automated Insights
Sébastien Vugier, SVP, Ecosystem Engagement and Vertical Solutions, Axway
Kostas Tzoumas, Co-founder and CEO, Data Artisans
Shehan Akmeemana, CTO, Data Dynamics
Peter Smails, V.P. of Marketing and Business Development, Datos IO
Tomer Shiran, Founder and CEO and Kelly Stirman, CMO, Dremio
Ali Hodroj, Vice President Products and Strategy, GigaSpaces
Flavio Villanustre, CISO and V.P. of Technology, HPCC Systems
Fangjin Yang, Co-founder and CEO, Imply
Murthy Mathiprakasam, Director of Product Marketing, Informatica
Iran Hutchinson, Product Manager and Big Data Analytics Software/Systems Architect, InterSystems
Dipti Borkar, V.P. of Products, Kinetica
Adnan Mahmud, Founder and CEO, LiveStories
Jack Norris, S.V.P. Data and Applications, MapR
Derek Smith, Co-founder and CEO, Naveego
Ken Tsai, Global V.P., Global Vice President, Head of Database and Data Management Product Marketing, SAP
Clarke Patterson, Head of Product Marketing, StreamSets
Seeta Somagani, Solutions Architect, VoltDB
Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.
Opinions expressed by DZone contributors are their own.