Join the DZone community and get the full member experience.
Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
To understand the current and future state of big data, we spoke to 31 IT executives from 28 organizations. We asked them, "What do you consider to be the most important elements of a successful big data initiative?"
- Choose the right project - a well-defined problem that’s a pain point. Start small and simple. It requires planning and organizational buy-in as well as executive sponsorship. This requires a cultural change with new policies and procedures.
- Depending on the size of the company and projects, a plan is necessary. The plan does not have to be perfect from the beginning; it can be refined via a pilot project. However, before starting a big data initiative, it is necessary to define: 1) The projects touched by big data technology. 2) The key people with key competencies. 3) The time horizon. 4) Success factors/metrics to check along the way to determine if the project is on track (and still useful). 5) The right tool to ingest, transform, and analyze the data and eventually visualize the results. The tool should also integrate seamlessly with a variety of other open source and widely used data analytics and data visualization tools.
- Collecting the data they need, and identifying the value they have in their data. Companies don’t know what they want to use AI for. Defining the business problem they are trying to solve is transformational. Ask them to look at the data of their business and where the value proposition is in the data they have. What decisions are they making with the data? ML relies on statistics of the decisions being made that can be automated.
- Any big data initiative should have the following elements in mind: 1) What business decisions are you trying to drive? 2) What technology and analytics methodologies will help achieve those decisions? 3) How do you plan to operationalize the derived insights from big data? 4) What are the dimensions of quality that you will keep up? 5) What type of skillsets do you need for people on the project team?
- Organizationally, it’s essential to have a clear set of goals and a mandate to achieve those goals. That may seem obvious, and it’s not unique to big data initiatives, but too often it’s missing. Technically, it’s important to have the flexibility to look at data using multiple different tools and data models. Most important applications require multiple data models. Even when it seems at first that you can do everything with one model such as relational tables, very often you discover mid-project there’s a new requirement, opportunity, or tool that works better using a different model. It’s also crucial to inventory data sources and assess data quality. Important data is usually spread across multiple silos, so interoperability and data transformation is a key part of successful projects.
- The most important element is getting the data from the application to a place to process it. This drove the adoption of the data lake but that didn’t solve the entire problem. Customers have many different data sources – it’s hard to maintain master data references and you end up with a lot of duplication. The master data management (MDM) problem is a context that you can apply to the rest of the data. It’s the building block to AI/ML with big data. Bridge the gaps of sifting through data to make scientists and analysts more effective.
- There's a lot of interest in data virtualization. Accessing all of the sources for whatever potential use – analytics or visual or any user. There’s a bottleneck between data and analytics. We need to make a change. Putting all the data in a warehouse or lake won’t work because there will always be more data. Define most critical use cases. Identify where data is holding you back.
- Quality is about controls, not volume. Source controls, security, and retention. You need reliable and valid data. The level of trust in your analytics is a function of the reliability of your data. Unstructured data comes out of file shares, ECMs, and putting into big data repositories and use to make quick informed decisions. We have the ability to look at source repositories, pull likenesses together and enforce policies against categories. Sensitive information may have a security clearance of "highly secure" and may not be available to a big data repository. Look at information, build categories, and build policies about what you can do with the data. If the security policy changes, be able to apply the changes to the base file documents and allow to be used based on the new security protocols.
- Virtual data lakes have multiple data sources and you need to be able to access all of the data. OData is a standard microservices push. Way to effectively get to the data in a structured format. JDBC for APIs can be enabled on any database. We are able to query data in real time. OData is dynamic in terms of adding fields and columns and enables virtual data lake – it can be Salesforce, but the data can reside anywhere. Get real-time data updates.
- Provide enhanced value to your customers. Data has inherent value. Through the process of mining, you may uncover opportunities for improving your existing product and creating new insights for your customers.
- With the massive amount of data coming onboard, you need to worry about velocity and variety. How do you process amount of data you are ingesting? Are you able to integrate the data you are ingesting with a variety of data sources?
- Organizations are collecting a variety of data and want to look at it. The operationalization of big data is becoming an executive known task. Answers from previous projects are seeing the results of the digital transformation. It feels like we are moving big data from a project to business as usual. Terabyte Triangle - legacy systems about functions, real-time streaming platforms, and historical data. As soon as you get to a terabyte, one of the three legs breaks. You cannot have all three with existing technologies. It's a multi-terabyte problem.
- Getting data into one place, having one source of truth, in a Hadoop cluster set up. How to use data to make sure data to ensure people are making data-driven decisions. Use data to improve every level of the enterprise. Help analysts and organizations tell a story from the data. Previously only used visual but what to do after dashboards? The last mile of communication around data is to talk to the individual viewer in a way they will understand – provide a relevant and actionable story for them. Think about the outcome you want. What do you want to accomplish?
- Know what’s in the data. How to make sense of it. Crawl through all the data and profile it so you can automatically tag with the appropriate business terms. Look for data sets with policy numbers, department IDs, go through fields and separate the two just like a human analyst. Refine and use the data with fingerprinting so you don’t have to recrawl every time there is a new outcome. Use fingerprints to train the systems and recognize things. Once everything is tagged you can search for things and understand the data, apply business policies. Consistent labels allow you to write policies for the data by automatically tagging the data.
- Everyone has heard (or even used) the term “data swamp.” That happens when a company mistakenly mixes data from various sources, sources with different trust levels or certainty levels. The value of the entire data collection is reduced since you cannot assess the value and veracity of any particular piece of data so you can’t have high trust in the resulting computations or analyses. So, data accuracy and correctness are very important for all data projects — knowing that the data you are using is correct, true, current, from a reputable or trusted source (which can help with the other assessments). Likewise, everyone has, by now, heard of the three Vs of Big Data (sometimes expanded to five or more Vs). Volume is usually the focus for most people — dealing with the larger and larger amount of data that is available. Velocity, the speed at which new data is made available, also continues to increase. And Variety was used as a reason for the rise of NoSQL platforms which would easily work on new unstructured data that traditional RDBMS systems could not.
- On the technology side, the most successful big data initiatives all have one thing in common, the ability for the data-driven organization to take action with complete accuracy, by relying on a purpose-built, high-performance, open data analytical platform. For the business to be successful, they need to demand insight from that data in seconds with complete accuracy. For example, in health care, hospitals seek to predict and prevent infections, such as sepsis. Without an underlying platform that can analyze all of the data — not samples of the data — very quickly, then this infection can go undetected and untreated. That’s why big data analytical initiatives are not just mission critical, they are life critical.
- The last five years we’ve been putting data into data lakes. Now we see the need to get better insights and outcomes from our data. Predictive analytics, ML, AI, deeper pattern matching. Correlate and find patterns in the data in the data lake. Augment analytical data from the data lakes. The first approach is to use graphs for computation power of new information. Connect the dots with new data sets. ML is more powerful as we are now able to analyze more data faster. Built-in ML inside of graph enables us to see how doctors/patients are forming a community. Prediction is best solved by the graph itself. Look for connections and predictions in the graph. It's able to provide evidence since it is easier to see connections and pathways.
- You need an "easy button" to break down complexity inherent in data processing platforms — Hadoop, Spark, Kafka. Leverage tooling to simplify and make more progress faster. Cloud has been big. Organizations are hedging their bets across multiple cloud providers. AWS and Azure come up the most, but people want to mix and match. Which services are accessed in different places? Hadoop in the cloud, Snowflake. There is a shift to take advantage of the cloud more than ever before.
- Look at software designed to collect performance statistics, trend analysis, and reporting. Intelligently scale out and up. Determine the resources you need to solve the business problem. Spin resources and servers up and down as needed. Collect and analyze the data to automate the process.
- Build AI and analytic applications using containers. Store, search, and use to impact analytic applications. This is a function of the maturity of containers. Smart companies are saving everything else — transaction, legacy, and container data as well. Need to keep and store as much data as possible.
- Get the operations side right. You need to be able to run the solution reliably to meet the business requirements. We focus on operationalizing big data with an active application. The application is always on and we have discipline around operationalizing.
- What are the best practices to be successful with data? Five things: 1) Culture — change things to have a data-driven culture. Having an executive sponsor in the business using data to innovate. One of the tools that are helpful is a cross-company data and analytics community to share projects, insights, methodology, and data. 2) How to organize around collect, create, ingest, manage, use to find insights. Kinds of skills needed. Centralized — data engineering works well focused on building a data asset. Analysis and scientists decentralized aligned with the business. 3) How to build a roadmap in an agile, lean world. Define two-year strategic initiatives. Manage risk, IoT, collect/use customer data. Collect more data, better analysis, getting and integrating insights into a business process/system. 4) Get stuck finding insights and getting it into production. What business process could they impact? What predictions could we make to help customers? How to integrate into a business process or a system. 5) Right-sizing data governance. Want power users to have access to lots of data. Conversely, you don’t want power users to access and use data inappropriately. Audit how to find insights, the lineage of data, appropriate consent to use data that way.
- Expectation setting over the past five years from building data lakes — it's easy to get data in, but how do we get value out? There is a combination of challenges. The system is not easily accessible by most people. Only the priests, software engineers, have access to the data lake. Second, the data tends to be in a raw state, integrity is not ensured, governance is not in place. The data lake is like a big slow moving train, it cannot interact at the speed of thought. Reset expectations for the data lake and big data journey. Next steps you can take — proceed with traditional data warehouse, data mart approach. If in the cloud you have different options or you can look at a data-as-a-service philosophy. Self-service at the speed of thought. Agility, flexibility with a self-service experience.
- We work with clients further down the journey already using big data for customer 360, ETL processing. They've gone all in with big data and likely have hardware and a relationship with Cloudera or Hortonworks. When end-users start using the platform, the operations team is the first line of defense. Not able to fulfill business objectives. Spark for fraud modeling. Models fail, transactions leak. Figure out how to make sure fraud pipelines complete on time. The client wants to run reports on big data and Hadoop that don’t complete on time and don’t know why. Make life easier. Get a view of what’s going on and where the problem is and how to fix. Prescriptive suggestions on what to do using AI/ML.
- The most important element is the creation of a big data system which allows the real-time ingestion, transaction, and analysis of data as it is created. Creating a system which requires an ETL process often results in systems which are not fast and responsive enough to drive real-time decision making.
- We consider automation of both the development and operationalization of data pipelines to be absolutely critical. Organizations often focus mainly on speeding up the development process only to have their projects get stuck when they can’t put them into production. Creating a data pipeline that is repeatable, scalable, and resilient has a set of challenges that are distinct from just running a data analytic every once in a while. It’s imperative that both the development and ongoing management and governance of big data projects are automated to enable true big data agility.
- Big data success comes from three key program elements: 1) Accuracy without the highest levels of data accuracy and integrity, analysis, and targeting will not be effective. And, in fact, in many cases can cost more money or lower customer satisfaction. 2) Granularity contrary to popular belief, success is not about “big data” necessarily, but rather the right data with little attributes and big insights. In order to make big data valuable, we have to ensure deep levels of granularity in the attributes and insights associated with our data. 3) Activation big data and brilliant insights are not effective without activation. We work with our clients to develop data and intelligence that is accessible for activation via BI teams, SaaS platforms, and marketing/sales programs.
Here’s who we spoke to:
- Cheryl Martin, V.P. Research Chief Data Scientist, Alegion
- Adam Smith, COO, Automated Insights
- Amy O’Connor, Chief Data and Information Officer, Cloudera
- Colin Britton, Chief Strategy Officer, Devo
- OJ Ngo, CTO and Co-founder, DH2i
- Alan Weintraub, Office of the CTO, DocAuthority
- Kelly Stirman, CMO and V.P. of Strategy, Dremio
- Dennis Duckworth, Director of Product Marketing, Fauna
- Nikita Ivanov, founder and CTO, GridGain Systems
- Tom Zawacki, Chief Digital Officer, Infogroup
- Ramesh Menon, Vice President, Product, Infoworks
- Ben Slater, Chief Product Officer, Instaclustr
- Jeff Fried, Director of Product Management, InterSystems
- Bob Hollander, Senior Vice President, Services & Business Development, InterVision
- Ilya Pupko, Chief Architect, Jitterbit
- Rosaria Silipo, Principal Data Scientist and Tobias Koetter, Big Data Manager and Head of Berlin Office, KNIME
- Bill Peterson, V.P. Industry Solutions, MapR
- Jeff Healey, Vertica Product Marketing, Micro Focus
- Derek Smith, CTO and Co-founder and Katie Horvath, CEO, Naveego
- Michael LaFleur, Global Head of Solution Architecture, Provenir
- Stephen Blum, CTO, PubNub
- Scott Parker, Director of Product Marketing, Sinequa
- Clarke Patterson, Head of Product Marketing, StreamSets
- Bob Eve, Senior Director, TIBCO
- Yu Xu, Founder and CEO, and Todd Blaschka, CTO, TigerGraph
- Bala Venkatrao, V.P. of Product, Unravel
- Madhup Mishra, VP of Product Marketing, VoltDB
- Alex Gorelik, Founder and CTO, Waterline Data
Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.
,big data analytics
,big data use cases
,how organizations use big data
,data science use cases
Opinions expressed by DZone contributors are their own.