Building Data Lakes for GDPR Compliance
Building Data Lakes for GDPR Compliance
Impacting the entire data lifecycle, GDPR means that organizations must have an end-to-end understanding of its personal data.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
If there's one key phenomenon that business leaders across all industries have latched onto in recent years, it's the value of data. The business intelligence and analytics market continue to grow at a massive rate (Gartner forecasts the market will reach $18.3 billion in 2017) as organizations invest in the solutions that they hope will enable them to harvest the potential of that data and disrupt their industries.
But while companies continue to hoard data and invest in analytics tools that they hope will help them determine and drive additional value, the General Data Protection Regulation (GDPR) is forcing best practices in the capture, management and use of personal data.
The European Union's GDPR stipulates stringent rules around how data must be handled. Impacting the entire data lifecycle, organizations must have an end-to-end understanding of its personal data, right through from its collection and processing, to storage and — finally — its destruction.
As companies scramble to make the May 25 deadline, data governance is a key focus. But organizations cannot just think of the new regulations as a box to check. Continuous compliance is required and most organizations are having to create new policies that will help them achieve a privacy by design mode.
Diverse Data Assets
One of the great challenges posed in securely managing data is the rapid adoption of data analytics across businesses as it moves from an IT office function to become a core asset for business units. As a result, data often flows in many directions across the business, so it becomes difficult to understand the data about the data — such as lineage of data (where it was created and how it got there).
Organizations may have personal data in many different formats and types (both structured and unstructured) across many different locations. Under the GDPR, it will be crucial to know and manage where personal data is across their business. While no one is certain in exactly what form GDPR will be enforced, organizations will need to be able to demonstrate that their data management processes are continually in compliance with the GDPR at a moment's notice.
With the diverse sources and banks of data that many organizations have, consolidating this data will be key to effectively managing their compliance with the GDPR. With the numerous different types of data that must be held across an organization, data lakes are a clear solution to the challenge of storing and managing disparate data.
Pool Your Data
A data lake is a storage method that holds raw data, including structured, semi-structured, and unstructured data. The structure and requirements of the data are only defined once the data is needed. Increasingly, we're seeing data lakes used to centralize enterprise information, including personal data that originates from a variety of sources, such as sales, CX, social media, digital systems, and more.
Data lakes, which use tools like Hadoop to track data within the environment, help organizations bring all the data together into a data lake where it can all be maintained and governed collectively. The ability to store structured, semi-structured, and unstructured data is crucial to the value of this approach for consolidating data assets, compared to data warehouses, which primarily maintain structured, processed data. Enabling organizations to discover, integrate, cleanse, and protect data that can then be shared safely is essential for effective data governance.
Further to the view across the full expanse of the data lake, organizations can look upstream to identify the sources of data from before they flowed into the lake. That way, organizations can track specific data back to their source — like the CX or marketing applications — providing end-to-end visibility across their entire data supply chain so that it can be scrutinized and identified as necessary.
This end-to-end view of personal data is crucial under the GDPR, enabling businesses to identify the quality and point of origin for all their information. Further to enabling organizations to store, manage, and identify the source of all their data, data lakes provide a cost-effective means for organizations to store all their data in one place. On the other hand, managing this large volume of data in a data warehouse has a far higher TCO.
Setting the Foundations
While data lakes currently present the best approach for data management and governance for GDPR compliance, this will not be the last stop in organizations' journey towards innovative, efficient and complaint data management. The data storage approaches of the future will be built with consideration for the new regulatory climate and will be created to serve and adhere to the challenges they present.
However, with the demand on organizations to create data policies and practices that will support the compliance of their future data storage and analytics endeavors, it is clear that businesses need to start refining processes and policies that will lay the foundations for compliant data innovation in the future. Being able to quickly and easily identify and access all data, with a clear understanding of its source and stewardship, is now the minimum standard for the management of personal data.
The Clock Is Ticking
Time is running out for many organizations on achieving GDPR compliance, with just weeks until its enforcement. However, companies must take a long-term view and build a data storage model that will enable them to consolidate, harmonize and identify the source of their data in compliance with the GDPR.
GDPR is bringing new dimensions with respect to customers demand: now, they value trust and transparency and will vote with their feet. They will follow companies that will be able to deliver personalized interactions while letting their customers taking full control over their personal data. Ultimately, companies that establish a system of trust at the core of their customer and/or employee relationship will win in the digital economies.
Published at DZone with permission of Jean-Michel Franco , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.