DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
11 Monitoring and Observability Tools for 2023
Learn more
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. How To Avoid “Schema Drift”

How To Avoid “Schema Drift”

This article will explain the existing solutions and strategies to mitigate the challenge and avoid schema drift, including data versioning using lakeFS.

Yaniv Ben Hemo user avatar by
Yaniv Ben Hemo
·
Feb. 03, 23 · Tutorial
Like (2)
Save
Tweet
Share
7.63K Views

Join the DZone community and get the full member experience.

Join For Free

We are all familiar with drifting in-app configuration and IaC. We’re starting with a specific configuration backed with IaC files. Soon after, we are facing a “drift” or a change between what is actually configured in our infrastructure and files. The same behavior happens in data. The schema starts in a specific shape. As data ingestion grows and scales to different sources, we get a schema drift, a messy, unstructured database and an analytical layer that keeps failing due to a bad schema. In this article, we will learn how to deal with the scenario and how to work with dynamic schemas.

Schemas Are a Major Struggle

A schema defines the structure of the data format. Keys/Values/Formats/Types, a combination of all, results in a defined structure or simply—schema.

Developers and data engineers, have you ever needed to recreate a NoSQL collection or recreate an object layout on a bucket because of different documents with different keys or structures? You probably have. 

An unaligned record structure across your data ingestion will crash your visualization, analysis jobs, and backend, and it is an ever-ending chase to fix it.

Schema Drift

Schema drift is the case where your sources often change metadata. Fields, keys, columns, and types can be added, removed, or altered on the fly.
Your data flow becomes vulnerable to upstream data source changes without handling schema drift. Typical ETL patterns fail when incoming columns and fields change because they tend to be tied to those sources. Stream requires a different toolset.

The following article will explain the existing solutions and strategies to mitigate the challenge and avoid schema drift, including data versioning using lakeFS.

Comparing Confluent Schema Registry and Memphis Schemaverse

Confluent—Schema Registry

Confluent Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving your Avro, JSON Schema, and Protobuf schemas. It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows the evolution of schemas according to the configured compatibility settings and expanded support for these schema types. It provides serializers that plug into Apache Kafka clients. They handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats.

The Good

  • Supports Avro, JSON Schema, and protobuf.
  • Enhanced security
  • Schema enforcement
  • Schema evolution

The Bad

  • Hard to configure.
  • No external backup.
  • Manual serialization
  • Can be bypassed.
  • Requires maintenance and monitoring.
  • Mainly supports in Java.
  • No validation

Schema Registry

Source: Confluent Documentation
Schema Architecture
Source: Confluent Documentation

Memphis—Schemaverse

Schemaverse provides a robust schema store and schema management layer on top of the memphis broker without a standalone compute or dedicated resources. With a unique and programmatic approach, technical and non-technical users can create and define different schemas, attach the schema to multiple stations and choose if the schema should be enforced or not.
Memphis’ low-code approach removes the serialization part as it is embedded within the producer library. Schemaverse supports versioning, GitOps methodologies, and schema evolution.

The Good

  • Great programmatic approach.
  • Embed within the broker
  • Zero-trust enforcement
  • Versioning
  • Out-of-the-box monitoring
  • GitOps—working with files.
  • Low/no-code validation and serialization.
  • No configuration.
  • Native support in Python, Go, Node.js.

The Bad

  • Not supporting all formats yet. Protobuf and JSON only. GraphQL and Avro are next.

    Schemaverse Overview
    Schemas Management
    Partners

Avoiding Schema Drift Using Confluent Schema Registry

Avoid schema drifting means that you want to enforce a particular schema on a topic or station and validate each produced data.

This tutorial assumes that you are using Confluent cloud with registry configured. If not, visit Confluent’s official website to learn how to install it. 

To make sure you are not drifting and maintaining a single standard or schema structure in our topic, you will need to:

1. Copy the ccloud-stack config file to $HOME/.confluent/java.config.

2. Create a topic.

3. Define a schema. For example:

JSON
 
{
 "namespace": "io.confluent.examples.clients.basicavro",
 "type": "record",
 "name": "Payment",
 "fields": [
     {"name": "id", "type": "string"},
     {"name": "amount", "type": "double"}
 ]
}


4. Enable schema validation over the newly created topic.

5. Configure Avro/Protobuf both in the app and with the registry.

Example producer code in Maven:

 
...
import io.confluent.kafka.serializers.KafkaAvroSerializer;
...
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
...
KafkaProducer<String, Payment> producer = new KafkaProducer<String, Payment>(props));
final Payment payment = new Payment(orderId, 1000.00d);
final ProducerRecord<String, Payment> record = new ProducerRecord<String, Payment>(TOPIC, payment.getId().toString(), payment);
producer.send(record);


Because the pom.xml includes the avro-maven-plugin, the payment class is automatically generated during the compile. 

Compile
In case a producer will try to produce a message that is not struct according to the defined schema, the message will not get ingested.

Avoiding Schema Drift Using Memphis Schemaverse

1. Create a new schema (currently only available through Memphis GUI).

No Schema Found


Create Schema

2. Attach Schema: Head to your station, and on the top-left corner, click on “+ Attach schema.”

Attach Schema


Marketing Prod
3. Code example in node.js.

Memphis abstracts the need for external serialization functions and embeds them within the SDK.

Producer (Protobuf example):

 
const memphis = require("memphis-dev");
var protobuf = require("protobufjs");

(async function () {
    try {
        await memphis.connect({
            host: "localhost",
            username: "root",
            connectionToken: "*****"
        });
        const producer = await memphis.producer({
            stationName: "marketing-partners.prod",
            producerName: "prod.1"
        });
        var payload = {
            fname: "AwesomeString",
            lname: "AwesomeString",
            id: 54,
        };
        try {
            await producer.produce({
                message: payload
        });
        } catch (ex) {
            console.log(ex.message)
        }
    } catch (ex) {
        console.log(ex);
        memphis.close();
    }
})();


Consumer (Requires .proto file to decode messages):

const memphis = require("memphis-dev");
var protobuf = require("protobufjs");

(async function () {
    try {
        await memphis.connect({
            host: "localhost",
            username: "root",
            connectionToken: "*****"
        });

        const consumer = await memphis.consumer({
            stationName: "marketing",
            consumerName: "cons1",
            consumerGroup: "cg_cons1",
            maxMsgDeliveries: 3,
            maxAckTimeMs: 2000,
            genUniqueSuffix: true
        });

        const root = await protobuf.load("schema.proto");
        var TestMessage = root.lookupType("Test");

        consumer.on("message", message => {
            const x = message.getData()
            var msg = TestMessage.decode(x);
            console.log(msg)
            message.ack();
        });
        consumer.on("error", error => {
            console.log(error);
        });
    } catch (ex) {
        console.log(ex);
        memphis.close();
    }
})();



Schema Enforcement 

With data versioning for a robust solution over object stores:

Now that we know the many ways to enforce schema using Confluent or Memphis, let us understand how a versioning tool like lakeFS can seal the deal for you.

What Is lakeFS?

lakeFS is a data versioning engine that allows you to manage data, like code. Through the Git-like branching, committing, merging, and reverting operations, managing the data and, in turn, the schema over the entire data life cycle is made simpler.

How To Achieve Schema Enforcement With lakeFS

By leveraging the lakeFS branching feature and webhooks, you can implement a robust schema enforcement mechanism on your data lake.

LakeFS


lakeFS hooks allow automating and ensuring a given set of checks and validations run before important life-cycle events. They are similar, conceptually, to Git Hooks, but unlike Git, lakeFS hooks trigger a remote server, which runs tests and so it is guaranteed to happen.

You can configure lakeFS hooks to check for specific table schema while merging data from development or test data branches to production. That is, a pre-merge hook can be configured on the production branch for schema validation. If the hook run fails, it causes lakeFS to block the merge operation from happening.

This extremely powerful guarantee can help implement schema enforcement and automate the rules and practices that all data sources and producers should adhere to.

Implementing Schema Enforcement Using lakeFS

  • Start by creating a lakeFS data repository on top of your object store (say, AWS S3 bucket for example).
  • All the production data (single source of truth) can live on the “main” branch or “production“ branch in this repository.
  • You can then create a “dev” or a “staging” branch to persist the incoming data from the data sources.
  • Configure a lakeFS webhooks server. For example, a simple Python Flask app that can serve HTTP requests can be a webhooks server. Refer to the sample Flask app that the lakeFS team has put together to get started.
  • Once you have the webhooks server running, enable webhooks on a specific branch by adding actions.yaml file under the _lakefs_actions directory.

Here is a sample of the actions.yaml file for the schema enforcement on the master branch:

Master Branch

  • Suppose the schema of the incoming data is different from that of the production data. The configured lakeFS hooks will be triggered on pre-merge condition, and hooks will fail the merge operation to the master branch.

This way, you can use lakeFS hooks to enforce the schema, validate your data to avoid PII columns leakage in your environment, and add data quality checks as required.

Conclusion

In this article, you have learned about the existing solutions and strategies to help avoid schema drift using lakeFS. At this point, I hope you have a better understanding of the good and bad of the Confluent Schema Registry and Memphis Schemaverse and how to avoid schema drift. If you enjoyed this article, please leave a like and a comment! 

Schema AWS Apache Maven Git Hook Infrastructure as code JSON Metadata NoSQL Node.js REST Serialization Branch (computer science) Flask (web framework) Java (programming language) kafka Python (language) Repository (version control)

Published at DZone with permission of Yaniv Ben Hemo. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Automated Testing With Jasmine Framework and Selenium
  • Unlock the Power of Terragrunt’s Hierarchy
  • How To Best Use Java Records as DTOs in Spring Boot 3
  • Custom Validators in Quarkus

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: