Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data Integration

DZone's Guide to

Big Data Integration

Data management and governance are key to succeeding with big data given the influx of data from ever-evolving sources.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

We recently connected with Isabelle Nuage, Director of Product Market, Big Data at Talend to understand how she views the current and future state of big data and analytics.

How is your company involved in the ingestion, management, analysis, and/or reporting of big data?

Talend is a cloud-first data integration solutions provider that delivers insight-ready data at scale for businesses, faster. We offer products for data management, data integration, and data governance for any data source, in the cloud and on-premises. Our customers use Talend to create 360-degree views of their customers, optimize business processes, manage data privacy and compliance, and innovate with new products and services.

What do you consider to be the most important elements of a successful big data initiatives?

To be data-driven, you have to put useful data into the hands of all your employees. This requires self-service solutions to make it easy for teams to access data from anywhere. Governed self-service, including role-based access, masking rules, and workflow-based data curation, empowers decision makers without putting data at risk or undermining compliance.

You can’t have successful, governed self-service, however, without a proper data management system. In today’s digital era, there are endless possible touchpoints with customers. Having a modern data platform in place allows organizations to create a single source of the truth that they can use to improve business performance from logistics to financial forecasting while enabling one-to-one buying experiences across multiple touchpoints.

On top of all this, organizations must still comply with privacy regulations and employ proper data governance. A well-crafted data governance strategy is fundamental for any organization that works with big data. Data governance ensures that roles related to data are clearly defined and that responsibility and accountability are agreed upon across the enterprise. A well-planned data governance framework covers strategic, tactical, and operational roles and responsibilities.

How do you secure data?

Securing data, especially at the enterprise level, is a daunting task, to say the least. We take control of the tools we have and tame them to benefit from them. This is where machine learning (ML) steps in to save the day. By applying ML to the data, we can quickly develop a map of historical data, correlate events between different security sources, and even predict negative and positive outcomes.

Additionally, a concrete, data governance program establishes the policies, standards, data hubs, and controls to protect data effectively, publish it for decision making, and, finally, meet the mandates of Data Privacy and Sovereignty legislations that apply to them. An end-to-end approach to data governance for data sovereignty should address these main challenges:

  • Know your personal data by mapping critical data elements to data fields across your IT landscape using metadata management.

  • Foster accountability by creating workflows for data stewardship and managing end-user computing.

  • Establish a personal data hub with native data quality for consent management.

  • Track and trace data with audit trails and data lineage.

  • Deliver a privacy center so that consumers, customers, employees, and citizens can control rights for access, rectification, portability, and erasure.

At Talend, we live by the Five Pillars of GDPR to protect and secure our users’ data (and we help our customer do it, too). We’ve created and maintained a holistic data inventory to know what Personally Identifiable Information (PII) we have stored and processed. To achieve this, we employ the latest metadata management techniques to deliver everything from data capture, integration, classification, and lineage, to data anonymization, self-service curation, and data portability.

What are the most prevalent languages, tools, and frameworks you see being used in data ingestion, analysis, and reporting?

We see customers moving more and more of their big data workload from on-premises to the cloud and from batch to real-time data processing to meet their SLAs. We see customers adopting Spark more and more and creating data lakes in the cloud to enable more data, and more business use cases with ML, AI, and NLP that are resource intensive. These customers, like AstraZeneca, are also adopting containers and serverless technologies to scale based on seasonal business needs and pay for just what they use instead of being charged for idle servers.

What are a couple of big data use cases you’d like to highlight? What is the business problem solved?

  • AstraZeneca is a multinational pharmaceutical and biopharmaceutical company with close to 60,000 employees worldwide and $22.5B in revenue. They’re one of the few companies to span the entire lifecycle of a medicine from research and development to manufacturing and supply and the global commercialization of primary care and specialty care medicines. AstraZeneca embarked on a 3-year journey, transformed their core IT and finance functions by developing an event-driven, scalable data-platform and built an innovative data platform based on a dynamic, elastic cloud architecture to meet business demands. AstraZeneca was a pioneer in building a serverless architecture and running Talend in Docker containers to support their massive month-end peak activity leading to financial reporting in half the time and delivering twice the value at half the cost.

  • Euronext, the largest pan-European exchange in the eurozone, processes 1.5 billion new messages a day, for 2 million daily transactions. The day the Brexit was voted, the volume of transactions and the need for analysis were so high that the actual processing time exceeded the window of time allocated to batch processing. This was the key moment for Euronext to engage a cloud-first strategy for its compliance and post-market analysis platform.

    Today, Euronext has built a data lake on AWS that allows them to manage 10 times more data at the same cost through the use of serverless and to comply with regulations (RGPD, MIFID II) using Talend. In addition, with this new environment, Euronext's teams can deliver new data science platforms in just a few days, compared to an average of 45 days in the past, and are able to perform real-time analytics through streaming of data. The platform also allows the monetization of data through a marketplace, which already accounts for 20% of their income.

  • Uniper is global energy company with 100 years of experience, around 12,000 employees worldwide in more than 40 countries, and 1.7 billion euros EBITDA. Uniper generates, trades, and markets energy on a large scale. They also procure, store, transport, and supply commodities such as natural gas, liquefied natural gas, and coal as well as energy-related products. Using Talend, Uniper embarked on a journey and built a digital platform for 17+ Uniper functional entities, based a governed data lake and a data catalog running on Microsoft Azure and Snowflake to scale their analytics and drive data monetization. As a result, they were able to reduce the cost of integrating data by 80%, increase the speed of integrating these data by 75%, and gain 50% in synergies and efficiencies.

What are the most common failures you see in big data initiatives?

There are four common failure reasons in big data initiatives:

  1. Lack of Supporting Culture: Businesses often lack a culture that supports data-driven decisions. Instead of compiling the data and letting it work for them, they often make decisions based on gut feelings rather than data. When data initiatives are put into motion, they fail because the organization is not ready for this new decision-making process. If leaders are not trusting the data, why should employees? The only way to change this is by changing business culture and implementing a data-driven culture that embraces the power of analytics, which starts at the top.

  2. Ignored Middle Manager and Employee Feedback: Another area that follows closely to the culture and leadership elements above is the proper engagement of the front-line managers and business unit leaders. It is critical that your entire organization is involved in the data strategy process, especially those that are dealing with the areas that are being analyzed. Often, executives attempt to make all of the decisions without consulting the ones who deal with the data on a daily basis.

  3. Overwhelmed Data Swamps: Having the ability to garner a large amount of insight from your database is a huge feat. Big data does hold big promise for every industry out there. However, too much data can quickly turn into a data swamp. Data swamps are useless, difficult to manage, produce headaches for your team, and are a leading culprit of failing data initiatives. However, with data being created faster by the day, more data alone isn’t the answer.

  4. Too Much Data or Too Little Data: Expanding upon the “Data Swamp,” another important consideration is to determine whether or not you’re collecting the data you need. Do you have too much? Not enough? When you are gathering data on every parameter you can imagine, you eventually become overrun with useful data mixed with useless data. Gleaning insights turns into finding a needle in a haystack. At the same time, you could be scared to overwhelm your team and fail to collect enough data, hamstringing your team as they try to make decisions.

Do you have any concerns regarding the state of big data?

One of the biggest issues with big data today is data privacy, and how changing compliance and regulations are impacting how businesses approach data management. When GDPR took effect in Europe, a majority of companies failed to comply with the rules because a majority of companies don’t adequately track personal information. Meeting compliance with evolving data regulations requires a shift in the business mindset, and for businesses to use it as an opportunity to better serve their customers. This requires a new approach and new strategies – not for businesses to just check another box around compliance.

Although it’s challenging to adapt to GDPR, other countries like the U.S., Japan, and China have begun creating and implementing their own data regulations for businesses because privacy and data ownership are so important. As these regulations are implemented, we’ll encounter yet unforeseen challenges as both businesses and consumers strive to find and follow best practices.

What’s the future of big data ingestion, management, and analysis from your perspective — where do the greatest opportunities lie?

With growing volumes of raw data about people, places, and things, plus increasing computing power and real-time processing speeds, AI/ML technologies will have a tremendous impact on business processes. Before those capabilities are leveraged, IT must be able to bring datasets together from disparate and varied data sources into a secure, centralized, and scalable governed data lake – creating huge opportunities for big data ingestion and management to pave the way to implementing emerging technologies. If organizations don’t have effective data management strategies in place, then they won’t be able to take advantage of the game-changing technologies coming in the years ahead.

Serverless computing is one of those game-changing technologies for data integration. With serverless and functions-as-a-service, companies will have infinite possibilities for data on-demand. As serverless technologies grow in adoption, companies will be able to decide how, where, and when they process data in a way that's economically feasible, without wasting resources.

What do developers need to keep in mind when working with big data?

The big data and cloud ecosystem is in constant flux – technologies come and go and versions of technologies, such as Spark, get updated frequently. Developers who are building their big data stack using hand coding or betting on a specific set of technologies or versions might need to rewrite their entire project or get stuck in the past. It’s important to choose an open platform that allows developers to build portable data pipelines, so they can be much more agile and adopt new technologies and innovations much faster.

Big data might contain sensitive information and can quickly become a liability for an organization if data privacy and governance rules are not properly applied. Data is valuable across all business units, and individuals are recognizing that, but companies are struggling to cope with the big data ecosystem. The challenge for the enterprise is to distribute data skills across the organization while assuring quality. Data is no longer the responsibility of one person or department – assuring quality is a team sport and everyone now must be accountable.

Is there anything you’d like to know about what developers are or are not doing with regards to the big data projects they are working on?

Developers hand-coding their big data projects often overlook the need for data quality and governance and the entire continuous integration/continuous delivery for proper SDLC. They don’t usually anticipate the need to quickly scale for business needs and the need to integrate more and more data sources at various speeds. Also, the cost of deployment and maintenance is often an afterthought that is costing companies big time.

What have I failed to ask you that we need to consider with regards to optimizing value from data?

We know that 55% of a company’s data is NOT accessible, and only 45% of an organization’s structured data is actively used in making decisions, and less than 1% of its unstructured data is analyzed or used at all. Here are some best practices for making big data available for analytics:

  • Capture data across disparate formats and sources. For example, capture customer details or unstructured data like social media posts.
  • Establish data quality upfront. Check for accuracy, cleanse data, and reconcile data.
  • Leverage automation for faster processing. Use tools and machine learning instead of hand coding and speed data processing.
  • Operationalize data preparation and analytics. Pull data in real time, and enable your team to see correlations and trends, make data-driven decisions, and have more time for innovation.
  • Make use of a Data Catalog. Catalog data sets so users may determine if an existing data set is available before creating something new.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,big data use cases ,data security ,big data tools

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}