Schemas, Contracts, and Compatibility (Part 2)
A discussion multiple ways a schema registry helps build resilient data pipelines by managing schemas and enforcing compatibility guarantees.
Join the DZone community and get the full member experience.Join For Free
Welcome back! If you missed Part 1, you can check it out here.
Schema Evolution and Compatibility
The key difference between schemas and APIs is that events with their schemas are stored for a long duration. Once you finish upgrading all the applications that call the API and move from API
v2, you can safely assume that
v1 is gone. Admittedly, this can take some time, but more often than not it is measured in weeks, not years. This is not the case with events where old versions of the schema could be stored forever.
This means that when you start modifying schemas, you need to take into account questions like: Who do we upgrade first-consumers or producers? Can new consumers handle the old events that are still stored in Kafka? Do we need to wait before we upgrade consumers? Can old consumers handle events written by new producers?
Enter Schema Registry for Apache Kafka
Avro requires that readers have access to the original writer schema in order to be able to deserialize the binary data. Whilst this is not a problem for data files where you can store the schema once for the entire file, providing the schema with every event in Kafka would be particularly inefficient in terms of the network and storage overhead.
As an architecture pattern, a schema registry is quite simple and has only two components:
- REST service for storing and retrieving schemas.
- Java library for fetching and caching schemas.
Although the concept of a schema registry for Apache Kafka was first introduced at LinkedIn, since then quite a few companies have created their own versions and talked about them in public: Confluent, Yelp, Airbnb, Uber, Monsanto, Ancestry.com, and probably a few more. Confluent's Schema Registry is available through Confluent Platform under the Confluent Community License.
As the number of public use cases has grown, so has the list of ways in which the Confluent Schema Registry is used at different companies, from basic schema storage and serving to sophisticated metadata discovery interfaces. Let's take an in-depth look at how Confluent Schema Registry is used to optimize data pipelines, and guarantee compatibility.
Enabling Efficiently Structured Events
This is the most basic use case and the one LinkedIn engineers had in mind when originally developing the schema registry. Storing the schema with every event would waste lots of memory, network bandwidth and disk space. Rather, we wanted to store the schema in the schema registry, give it a unique ID and store the ID with every event instead of the entire schema. These are the roots from which Confluent Schema Registry evolved.
Doing so requires integration of the schema registry with Kafka producers and consumers so that they are able to store and retrieve the actual schema itself, using Avro serializers and deserializers.
When a producer produces an Avro event, we take the schema of the event and look for it in the schema registry. If the schema is new, we register it in the schema registry and generate an ID. If the schema already exists, we take its ID from the registry. Either way, every time we produce an event with that schema, we store the ID together with the event. Note that the schema and the ID are cached, so as long as we keep producing events with the same schema, we don't need to talk to the schema registry again.
When a consumer encounters an event with a schema ID, it looks the schema up in the schema registry using the ID, then caches the schema, and uses it to deserialize every event it consumes with this ID.
As you can see, this is a simple yet nice and efficient way to avoid having to attach the schema to every event, particularly since in many cases the schema is much larger than the event itself.
Validating the Compatibility of Schemas
This is the most critical use case of the Confluent Schema Registry. If you recall, we opened this blog post with a discussion of the importance of maintaining compatibility of events used by microservices, even as requirements and applications evolve. In order to maintain data quality and avoid accidentally breaking consumer applications in production, we need to prevent producers from producing incompatible events.
The context for compatibility is what we call a subject, which is a set of mutually compatible schemas (i.e., different versions of the base schema). Each subject belongs to a topic, but a topic can have multiple subjects. Normally, there will be one subject for message keys and another subject for message values. There may also be a special case where you want to store multiple event types in the same topic, in which each event type will have its own subject.
In the previous section, we explained how producers look up the schemas they are about to produce in the schema registry, and if the schema is new, they register a new schema and create a new ID. When the schema registry is configured to validate compatibility, it will always validate a schema before registering it.
Validation is done by looking up previous schemas registered with the same subject, and using Avro's compatibility rules to check whether the new schema and the old schema are compatible. Avro supports different types of compatibility, such as forward compatible or backward compatible, and data architects can specify the compatibility rules that will be used when validating schemas for each subject .
If the new schema is compatible, it will be registered successfully, and the producer will receive an ID to use. If it is incompatible, a serialization exception will be thrown, the schema will not register and the message will not be sent to Kafka.
The developers responsible for the application producing the events will need to resolve the schema compatibility issue before they can produce events to Kafka. The consuming applications will never see incompatible events.
Compatibility Validation in the Development Process
Stopping incompatible events before they are written to Kafka and start breaking consumer applications is a great step, but ideally we'd catch compatibility mistakes much sooner. Some organizations only deploy new code to production every few months, and no one wants to wait that long to discover what could be a trivial mistake.
Ideally, a developer is able to run schema validation on their own development machine in the same way they run unit tests. They'd also be also able to run schema validation in their CI/CD pipeline as part of a pre-commit hook, post-merge hook, or a nightly build process.
Confluent Schema Registry enables all this via the Schema Registry Maven Plugin, which can take a new schema, compare it to the existing schema in the registry and report whether or not they are compatible. This can be done without registering a new schema or modifying the schema registry in any way, which means that the validation can safely run against the production schema registry.
So far, we assumed that new schema will be registered in production by producers when they attempt to produce events with the new schema. When done this way, the registration process is completely automated and transparent to developers.
In some environments, however, access to the schema registry is controlled and special privileges are required in order to register new schema. We can't assume that every application will run with sufficient privileges to register new schema. In these cases, the automated process of deploying changes into production will register schemas using either the Maven plugin or schema registry REST APIs.
Because the rules of forward and backward compatibility are not always intuitive, it can be a good idea to have a test environment with the applications producing and consuming events. This way the effects of registering a new schema and upgrading the applications producing and consuming events can be tested empirically.
We started out by exploring the similarities between schemas and APIs, and the importance of being able to modify schemas without the risk of breaking consumer applications. We went into the details of what compatibility really means for schemas and events (and why it's so critical!), and then discussed multiple ways a schema registry helps build resilient data pipelines by managing schemas and enforcing compatibility guarantees.
You can explore Schema Registry in more depth by signing up for Confluent Cloud, Apache Kafka as a fully managed event streaming service. Confluent Cloud includes not just a fully managed schema registry, but also a web-based UI for creating schemas, exploring, editing, and comparing different versions of schemas.
Published at DZone with permission of Gwen Shapira, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.