DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Intelligent Governance for Big Data

Intelligent Governance for Big Data

Data governance can be tricky. To help you get started, we put together a list of the basics.

Mamta Chawla user avatar by
Mamta Chawla
·
May. 03, 19 · Analysis
Like (5)
Save
Tweet
Share
5.11K Views

Join the DZone community and get the full member experience.

Join For Free

Data governance in traditional data warehouses is often responsible for many aspects of the data, such as:

  • Data Quality – Consumable data should be valid.
  • Identifying PII elements.
  • Identifying critical data elements.
  • User roles and access permissions.

When you have data and data which is flowing fast with variety into the ecosystem, the biggest challenge is to govern the data. But in a big data environment, where data flows fast with inferred run time schema, the need to govern data is often realized at run time. How can we find out if the data contains PII, if it’s valid data, if it’s critical data, which domain it belongs to, etc.?

If, in a data lake, there are more than 3,000 feeds coming from various internal or external applications, with each one, on average, containing 100 elements, this means there is a need to define 3*10^5 elements. To achieve this manually, there are some data governance rules that can typically be applied, like finding elements that have SSN information or checking a business rule to see if the element's value is accurate. Therefore, data quality on large data sets can be achieved if we build algorithms which can intelligently identify data governance rules.

Let's discuss, one by one, how to build these rules intelligently.

  1. Data Quality rules – Some of these rules are: data validity, data format checking, SLA breaches, feed changes, data accuracy, data completeness. While, for structured data, we can define the data structures, for the data that comes in large volumes we need to infer above the values when the data is in motion. Every time a feed comes in, its attribute, type, format, time of arrival (for SLA), and min-max average values, can be stored in a repository. And data governance teams can continuously validate the outcomes and, with time, a consistent repository of the elements can be built.
  2. Identifying PII - A few standard patterns for PII or sensitive information like SSN, mobile numbers, zip codes, state codes, bank accounts, credit cards, and tax numbers relevant to the business can be pre-built and, when the data is in motion, elements can be verified against these pre-defined patterns. If the aattern is attributed as a PII, it can be tagged as PII. This needs continuous analysis at the repository level using machine learning algorithms, like linear regression, anomaly detections, and logistic regression.
  3. Identifying critical data elements – This can be derived from how the data is used. The logs built on Hive, Spark, HBase, and Cassandra need to be analyzed and stored in the repository, in order to build a glossary of CDEs.
  4. User roles and access permissions mostly depend on who the data belongs to. For example, is it customer data, policy data, or financial data? This can be derived from finding elements' names, if they come as part of a feed. For example, if an element's name is 'general ledger' it’s mostly like financial data. Similarly, if data contains a name, id, and/or address, it's mostly customer data. Many methods can be developed to find the data domain. Again, continuous analysis and validation of the findings is required to accuratley determine the final data domain. With the knowledge of the data domain, sensitive information, and CDEs, we can define use access based on the role, like who can see the PII data.

Though above theories look easy, they need in-house knowledge of the data, master data, domain knowledge, and knowledge of the abbreviations used in the organization. We don’t need to start with all this knowledge, we can feed the algorithms data as we come to recognize it. To develop this framework, the only thing data governance teams have to do is own the knowledge of the organization's data. To implement better data governance technology, data scientists and data engineering teams play a big role.

There are many efforts being undertaken at the industry level to build such products and offer services to enterprises or organizations.

Big data Data science

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Steel Threads Are a Technique That Will Make You a Better Engineer
  • Microservices Testing
  • Tracking Software Architecture Decisions
  • How We Solved an OOM Issue in TiDB with GOMEMLIMIT

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: