Intelligent Governance for Big Data
Data governance can be tricky. To help you get started, we put together a list of the basics.
Join the DZone community and get the full member experience.Join For Free
Data governance in traditional data warehouses is often responsible for many aspects of the data, such as:
- Data Quality – Consumable data should be valid.
- Identifying PII elements.
- Identifying critical data elements.
- User roles and access permissions.
When you have data and data which is flowing fast with variety into the ecosystem, the biggest challenge is to govern the data. But in a big data environment, where data flows fast with inferred run time schema, the need to govern data is often realized at run time. How can we find out if the data contains PII, if it’s valid data, if it’s critical data, which domain it belongs to, etc.?
If, in a data lake, there are more than 3,000 feeds coming from various internal or external applications, with each one, on average, containing 100 elements, this means there is a need to define 3*10^5 elements. To achieve this manually, there are some data governance rules that can typically be applied, like finding elements that have SSN information or checking a business rule to see if the element's value is accurate. Therefore, data quality on large data sets can be achieved if we build algorithms which can intelligently identify data governance rules.
Let's discuss, one by one, how to build these rules intelligently.
- Data Quality rules – Some of these rules are: data validity, data format checking, SLA breaches, feed changes, data accuracy, data completeness. While, for structured data, we can define the data structures, for the data that comes in large volumes we need to infer above the values when the data is in motion. Every time a feed comes in, its attribute, type, format, time of arrival (for SLA), and min-max average values, can be stored in a repository. And data governance teams can continuously validate the outcomes and, with time, a consistent repository of the elements can be built.
- Identifying PII - A few standard patterns for PII or sensitive information like SSN, mobile numbers, zip codes, state codes, bank accounts, credit cards, and tax numbers relevant to the business can be pre-built and, when the data is in motion, elements can be verified against these pre-defined patterns. If the aattern is attributed as a PII, it can be tagged as PII. This needs continuous analysis at the repository level using machine learning algorithms, like linear regression, anomaly detections, and logistic regression.
- Identifying critical data elements – This can be derived from how the data is used. The logs built on Hive, Spark, HBase, and Cassandra need to be analyzed and stored in the repository, in order to build a glossary of CDEs.
- User roles and access permissions mostly depend on who the data belongs to. For example, is it customer data, policy data, or financial data? This can be derived from finding elements' names, if they come as part of a feed. For example, if an element's name is 'general ledger' it’s mostly like financial data. Similarly, if data contains a name, id, and/or address, it's mostly customer data. Many methods can be developed to find the data domain. Again, continuous analysis and validation of the findings is required to accuratley determine the final data domain. With the knowledge of the data domain, sensitive information, and CDEs, we can define use access based on the role, like who can see the PII data.
Though above theories look easy, they need in-house knowledge of the data, master data, domain knowledge, and knowledge of the abbreviations used in the organization. We don’t need to start with all this knowledge, we can feed the algorithms data as we come to recognize it. To develop this framework, the only thing data governance teams have to do is own the knowledge of the organization's data. To implement better data governance technology, data scientists and data engineering teams play a big role.
There are many efforts being undertaken at the industry level to build such products and offer services to enterprises or organizations.
Opinions expressed by DZone contributors are their own.