DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Self-Service Data Presentation: Data Quality, Lineage, and Cataloging

Self-Service Data Presentation: Data Quality, Lineage, and Cataloging

The data catalog is the foundation of the self-service capability for a business-facing data presentation and transformation layer.

Adam Diaz user avatar by
Adam Diaz
·
Nov. 04, 16 · Opinion
Like (3)
Save
Tweet
Share
3.80K Views

Join the DZone community and get the full member experience.

Join For Free

When an organization has mastered the use of automated data ingest and the appropriate application of metadata, there are a number of additional concerns to be addressed with using data at scale. These include data quality, data lineage, and a searchable data catalog. All of these are factors in presenting an effective and useful data catalog. The data catalog is the foundation of the self-service capability for a business-facing data presentation and transformation layer.

Data Quality

It is best to apply data quality rules in an automated fashion. Simple rules can be defined, like operators in a programming language, corresponding to atomic operations such as greater than or less than. Those simple operators can be hierarchically organized into a collection of operations to establish basic rules such as social security number validation. Those rules can also be logically organized into collections to create “rule sets”. For example, a rule set might contain all the data quality operations that are needed for a specific data feed. Business transactions that require application validation and pre-processing such as loan processing or credit line increase request processing. All the fields in that set of data can be analyzed automatically, creating clean records for upstream analytics automatically. This type of data quality pre-processing is exactly what we as consumers of data should be demanding of our data ingest process.

Data Lineage

Knowing where data has come from and how it was transformed not only by data quality rules but by all of the transformations required by business rules specific to the use case at hand is invaluable. Being able to retrace the steps in data ingest and processing is critical to many data users in terms of regulatory requirements. Data in both banking and pharmaceutical research, for example, must be trackable based upon regulations in the respective industries. At a lower level, a simple data engineering level, being able to track data lineage is also a very valuable debugging tool. This includes both business process debugging (i.e., is our business process correct) but also from a simple data science perspective (i.e., are advanced algorithms functioning correctly). With Bedrock, all of these concerns can be answered by the proper application of maintaining proper data lineage for datasets over time.

Data Catalog

A functioning data catalog seems like such a simple concept in this day and age. The fact is that many organizations still face the challenge of simply finding data. Having data is not the problem. Everyone has data now. What many are missing is an effective method for finding datasets in what can be a sea of ingested data. Part of the issue is simple organization. Having a process of ingestion that cleans, tags and lands data in a known location while partitioning it along the way. Even then, finding data sets from which to start developing new analytics can be challenging without an effective way to rapidly search it. In this case, search doesn't mean SQL queries on a dataset but really the search of metadata such that the right dataset can found. Some groups rely on a single person or group of people to be the human curator of datasets for an organization. Some groups rely on a system of codes embedded in the directory structures and file names as an attempt to make data discoverable. Having an easily searchable data catalog is the main key to eventually maturing an organizational data culture toward self-service operations.

Self-Service Data Presentation and Transformation

In terms of operational maturity, the goal of many organizations is to provide IT services in a self-service mode. That is, make the presentation of that service “easy” while simultaneously addressing the appropriate technical depth to be useful to the end user. The use of a properly groomed data catalog allows independent metadata investigation by analysts and data scientist who are experts in advanced analytics. They are also more closely aligned with the business process requirements than perhaps the data engineer who typically has more of an IT process focus. The analytical consumer of the data often has insight into data transformation process improvement that long term should rightfully but pushed back into the automated process of data transformation during data ingestion ahead of the self-service layer. A self-service portal like Mica, enables such a feedback loop in that it provides access to the right users who can iterate over the data to discover additional transforms that become permanent process improvements. In the other direction, a self-service portal allows more rapid access to clean data for the discovery of new analytics and improvement of existing analytics by a variety of business data consumers.

Automation of data processing is the perfect use of a computer's time. Manual intervention is not only unnecessary now it's actually a critical business problem. As data begins to move faster and technology trends like the Internet of Things provide data to us faster, automation will be the key. Placing automated policies in place to capture, clean and tag data then feeding that data into a system in which data sets are discoverable for analytic application and analytic creation will be important to all organizations. Regardless of data format, data retention time or delivery method automated self-enabling data technologies will provide the operational flexibility to empower tomorrow's advanced analytics.

Data science Self-service Data quality

Published at DZone with permission of , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Reliability Is Slowing You Down
  • Container Security: Don't Let Your Guard Down
  • How To Handle Secrets in Docker
  • Real-Time Analytics for IoT

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: