Java Object Serialization, Schema Compositionality, and Apache Avro
Apache Avro is rapidly becoming a popular choice for Java object serialization in Apache Kafka-based solutions.
Join the DZone community and get the full member experience.
Join For Free
apache avro, a serialization framework that originated from hadoop, is rapidly becoming a popular choice for general java object serialization in apache kafka-based solutions, due to its compact binary payloads and stringent schema support.
in its simplest form, it, however, lacks an important feature of a good schema formalism: the ability to decompose a schema into smaller, reusable schema components. it can be accomplished, but requires some additional work or using an alternative schema syntax.
you may also like: kafka, avro serialization, and the schema registry
serialization & schema formalisms
data serialization plays a central role in any distributed computing system, be it message-oriented or rpc-based. ideally, the involved parties should be able to exchange data in a way that is both efficient and robust, and which can evolve over time. i've seen many data serialization techniques come and go during the last 20 years, shifting with the current technical trends: fixed-position binary formats, tag-based binary formats, separator-based formats, xml, json, etc. early frameworks were usually backed by supporting tools (yes, i'm old enough to remember when data dictionaries were state of the art ...), whereas more recent serialization frameworks usually provide a formal schema language to enforce data correctness and enable contracts to evolve in a controlled way. the schema formalism usually also provides a data binding mechanism to allow for easy usage in various programming languages.
in order to support non-trivial domain/information models, the schema language should provide support for composition , where a complex schema may be composed of smaller, resuable schemas. this is usually achieved by some kind of include mechanism in the the schema formalism, and optionally additional build-time configuration for any code generation data binding support.
apache kafka and serialization
event-driven architectures are becoming increasingly more popular, partly due to the challenges with tightly coupled microservices. when streaming events at scale, a highly scalable messaging backbone is a critical enabler. apache kafka is widely used, due to its distributed nature and thus extreme scalability. in order for kafka to really deliver, individual messages need to be fairly small (see e.g. kafka benchmark ). hence, verbose data serialization formats like xml or json might not be appropriate for event sourcing.

while there are several serialization protocols offering compact binary payloads (among them, google protobuf stands out a modern and elegant framework), apache avro is frequently used together with kafka. while not necessarily the most elegant serialization framework, the confluent kafka packaging provides a schema registry , which allows a structured way to manage message schemas and schema versions, and the schema registry is based on avro schemas.
surprisingly, while the formal support for managing schema versioning (and automatically detecting schema changes which are not backwards compatible) is really powerful, vanilla avro lacks a decent
include
mechanism to enable compositional schemas that adheres to the
dry principle
. the standard json-based syntax for avro schemas allows for a composite type to refer to other fully-qualified types, but the composition is not enforced by the schema itself. consider the following schema definitions, where the composite
usercarrelation
is composed from the simpler
user
and
car
schemas:
{"namespace": "se.callista.blog.avro.user",
"type": "record",
"name": "user",
"fields": [
{"name": "userid", "type": "string"},
]
}
{"namespace": "se.callista.blog.avro.car",
"type": "record",
"name": "car",
"fields": [
{"name": "vehicleidentificationnumber", "type": "string"},
]
}
{"namespace": "se.callista.blog.avro.usercarrelation",
"type": "record",
"name": "usercarrelation",
"fields": [
{"name": "user", "type": "se.callista.blog.avro.user.user"},
{"name": "car", "type": "se.callista.blog.avro.car.car"},
]
}
in order for the avro compiler to interpret and properly generate code for the
usercarrelation
schema, it needs to be aware of the inclusions (in the correct order). the avro maven plugin provides explicit support for this missing inclusion mechanism:
<plugin>
<groupid>org.apache.avro</groupid>
<artifactid>avro-maven-plugin</artifactid>
<version>${avro.version}</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourcedirectory>${project.basedir}/src/main/resources/avro/usercarrelation</sourcedirectory>
<imports>
<import>${project.basedir}/src/main/resources/avro/user/user.avsc</import>
<import>${project.basedir}/src/main/resources/avro/car/car.avsc</import>
</imports>
</configuration>
</execution>
</executions>
</plugin>
as seen, this inclusion is only handled by the data binding toolchain and not explicitly present in the schema itself. hence, it won't work with the kafka schema registry.
avro idl
in more recent versions of avro, there is, however, an alternative syntax for describing schemas.
avro idl
is a custom dsl for describing data types and rpc operations. the top-level concept in an avro idl definition file is a
protocol
, a collection of operations and their associated data types. while the syntax, at first look, seems to be geared toward rpc, the rpc operations can be omitted, and hence, a protocol may be used to only define datatypes. interestingly enough, avro idl does contain a standard
include
mechanism, where other idl files, as well as json-defined avro schemas, may be properly included. avro idl originated as an experimental feature in avro but is now a supported alternative syntax.
below is the same example as above, in avro idl:
@namespace("se.callista.blog.avro.user")
protocol userprotocol {
record user {
string userid;
}
}
@namespace("se.callista.blog.avro.car")
protocol carprotocol {
record car {
string vehicleidentificationnumber;
}
}
@namespace("se.callista.blog.avro.usercarrelation")
protocol usercarrelationprotocol {
import idl "../user/user.avdl";
import idl "../car/car.avdl";
record usercarrelation {
se.callista.blog.avro.user.user user;
se.callista.blog.avro.car.car car;
}
}
now, the build system configuration can be correspondingly simplified:
<plugin>
<groupid>org.apache.avro</groupid>
<artifactid>avro-maven-plugin</artifactid>
<version>${avro.version}</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>idl-protocol</goal>
</goals>
<configuration>
<sourcedirectory>${project.basedir}/src/main/resources/avro/usercarrelation</sourcedirectory>
</configuration>
</execution>
</executions>
</plugin>
conclusion
compositionality is an important aspect of well-designed information or message model, in order to highlight important structural relationships and to eliminate redundancy. if apache avro is used as your serialization framework, i believe avro idl should be the preferred way to express the schema contracts.
further reading
kafka, avro serialization, and the schema registry
using avro for big data and data streaming architectures: an introduction
Published at DZone with permission of Björn Beskow. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
How Web3 Is Driving Social and Financial Empowerment
-
Top Six React Development Tools
-
Building the World's Most Resilient To-Do List Application With Node.js, K8s, and Distributed SQL
-
Integrating AWS With Salesforce Using Terraform
Comments