DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • How Web3 Is Driving Social and Financial Empowerment
  • Top Six React Development Tools
  • Building the World's Most Resilient To-Do List Application With Node.js, K8s, and Distributed SQL
  • Integrating AWS With Salesforce Using Terraform

Trending

  • How Web3 Is Driving Social and Financial Empowerment
  • Top Six React Development Tools
  • Building the World's Most Resilient To-Do List Application With Node.js, K8s, and Distributed SQL
  • Integrating AWS With Salesforce Using Terraform
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Java Object Serialization, Schema Compositionality, and Apache Avro

Java Object Serialization, Schema Compositionality, and Apache Avro

Apache Avro is rapidly becoming a popular choice for Java object serialization in Apache Kafka-based solutions.

Björn Beskow user avatar by
Björn Beskow
·
Oct. 07, 19 · Presentation
Like (3)
Save
Tweet
Share
10.67K Views

Join the DZone community and get the full member experience.

Join For Free

java serialization with apache avro

apache avro is rapidly becoming a popular choice for java object serialization in apache kafka-based solutions.

apache avro, a serialization framework that originated from hadoop, is rapidly becoming a popular choice for general java object serialization in apache kafka-based solutions, due to its compact binary payloads and stringent schema support.

in its simplest form, it, however, lacks an important feature of a good schema formalism: the ability to decompose a schema into smaller, reusable schema components. it can be accomplished, but requires some additional work or using an alternative schema syntax.

you may also like: kafka, avro serialization, and the schema registry

serialization & schema formalisms

data serialization plays a central role in any distributed computing system, be it message-oriented or rpc-based. ideally, the involved parties should be able to exchange data in a way that is both efficient and robust, and which can evolve over time. i've seen many data serialization techniques come and go during the last 20 years, shifting with the current technical trends: fixed-position binary formats, tag-based binary formats, separator-based formats, xml, json, etc. early frameworks were usually backed by supporting tools (yes, i'm old enough to remember when data dictionaries were state of the art ...), whereas more recent serialization frameworks usually provide a formal schema language to enforce data correctness and enable contracts to evolve in a controlled way. the schema formalism usually also provides a data binding mechanism to allow for easy usage in various programming languages.

in order to support non-trivial domain/information models, the schema language should provide support for composition , where a complex schema may be composed of smaller, resuable schemas. this is usually achieved by some kind of include mechanism in the the schema formalism, and optionally additional build-time configuration for any code generation data binding support.

apache kafka and serialization

event-driven architectures are becoming increasingly more popular, partly due to the challenges with tightly coupled microservices. when streaming events at scale, a highly scalable messaging backbone is a critical enabler. apache kafka is widely used, due to its distributed nature and thus extreme scalability. in order for kafka to really deliver, individual messages need to be fairly small (see e.g. kafka benchmark ). hence, verbose data serialization formats like xml or json might not be appropriate for event sourcing.

avro trademark

while there are several serialization protocols offering compact binary payloads (among them, google protobuf stands out a modern and elegant framework), apache avro is frequently used together with kafka. while not necessarily the most elegant serialization framework, the confluent kafka packaging provides a schema registry , which allows a structured way to manage message schemas and schema versions, and the schema registry is based on avro schemas.

surprisingly, while the formal support for managing schema versioning (and automatically detecting schema changes which are not backwards compatible) is really powerful, vanilla avro lacks a decent include mechanism to enable compositional schemas that adheres to the dry principle . the standard json-based syntax for avro schemas allows for a composite type to refer to other fully-qualified types, but the composition is not enforced by the schema itself. consider the following schema definitions, where the composite usercarrelation is composed from the simpler user and car schemas:

{"namespace": "se.callista.blog.avro.user",
 "type": "record",
 "name": "user",
 "fields": [
     {"name": "userid", "type": "string"},
 ]
}
{"namespace": "se.callista.blog.avro.car",
 "type": "record",
 "name": "car",
 "fields": [
     {"name": "vehicleidentificationnumber", "type": "string"},
 ]
}
{"namespace": "se.callista.blog.avro.usercarrelation",
 "type": "record",
 "name": "usercarrelation",
 "fields": [
     {"name": "user", "type": "se.callista.blog.avro.user.user"},
     {"name": "car", "type": "se.callista.blog.avro.car.car"},
 ]
}


in order for the avro compiler to interpret and properly generate code for the usercarrelation schema, it needs to be aware of the inclusions (in the correct order). the avro maven plugin provides explicit support for this missing inclusion mechanism:

<plugin>
    <groupid>org.apache.avro</groupid>
    <artifactid>avro-maven-plugin</artifactid>
    <version>${avro.version}</version>
    <executions>
      <execution>
        <phase>generate-sources</phase>
        <goals>
          <goal>schema</goal>
        </goals>
        <configuration>
          <sourcedirectory>${project.basedir}/src/main/resources/avro/usercarrelation</sourcedirectory>
          <imports>
            <import>${project.basedir}/src/main/resources/avro/user/user.avsc</import>
            <import>${project.basedir}/src/main/resources/avro/car/car.avsc</import>
          </imports>
        </configuration>
      </execution>
    </executions>
  </plugin>


as seen, this inclusion is only handled by the data binding toolchain and not explicitly present in the schema itself. hence, it won't work with the kafka schema registry.

avro idl

in more recent versions of avro, there is, however, an alternative syntax for describing schemas. avro idl is a custom dsl for describing data types and rpc operations. the top-level concept in an avro idl definition file is a protocol , a collection of operations and their associated data types. while the syntax, at first look, seems to be geared toward rpc, the rpc operations can be omitted, and hence, a protocol may be used to only define datatypes. interestingly enough, avro idl does contain a standard include mechanism, where other idl files, as well as json-defined avro schemas, may be properly included. avro idl originated as an experimental feature in avro but is now a supported alternative syntax.

below is the same example as above, in avro idl:

@namespace("se.callista.blog.avro.user")
protocol userprotocol {

  record user {
    string userid;
  }

}
@namespace("se.callista.blog.avro.car")
protocol carprotocol {

  record car {
    string vehicleidentificationnumber;
  }

}
@namespace("se.callista.blog.avro.usercarrelation")
protocol usercarrelationprotocol {

  import idl "../user/user.avdl";
  import idl "../car/car.avdl";

  record usercarrelation {
    se.callista.blog.avro.user.user user;
    se.callista.blog.avro.car.car car;
  }

}


now, the build system configuration can be correspondingly simplified:

<plugin>
    <groupid>org.apache.avro</groupid>
    <artifactid>avro-maven-plugin</artifactid>
    <version>${avro.version}</version>
    <executions>
      <execution>
        <phase>generate-sources</phase>
        <goals>
          <goal>idl-protocol</goal>
        </goals>
        <configuration>
          <sourcedirectory>${project.basedir}/src/main/resources/avro/usercarrelation</sourcedirectory>
        </configuration>
      </execution>
    </executions>
  </plugin>


conclusion

compositionality is an important aspect of well-designed information or message model, in order to highlight important structural relationships and to eliminate redundancy. if apache avro is used as your serialization framework, i believe avro idl should be the preferred way to express the schema contracts.

further reading

kafka, avro serialization, and the schema registry

using avro for big data and data streaming architectures: an introduction

kafka avro scala example

avro Schema Serialization kafka Big data Java (programming language) Object (computer science)

Published at DZone with permission of Björn Beskow. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • How Web3 Is Driving Social and Financial Empowerment
  • Top Six React Development Tools
  • Building the World's Most Resilient To-Do List Application With Node.js, K8s, and Distributed SQL
  • Integrating AWS With Salesforce Using Terraform

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: