DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Debezium Serialization With Apache Avro and Apicurio Service Registry
  • Handling Schema Versioning and Updates in Event Streaming Platforms Without Schema Registries
  • Apache Avro to ORC Using Apache Gobblin
  • Testing Schema Registry: Spring Boot and Apache Kafka With JSON Schema

Trending

  • DZone's Article Submission Guidelines
  • The End of “Good Enough Agile”
  • MCP Servers: The Technical Debt That Is Coming
  • Event Driven Architecture (EDA) - Optimizer or Complicator
  1. DZone
  2. Data Engineering
  3. Big Data
  4. A Gentle (and Practical) Introduction to Apache Avro (Part 1)

A Gentle (and Practical) Introduction to Apache Avro (Part 1)

Time to get your feet wet with Avro.

By 
Anton Rodriguez user avatar
Anton Rodriguez
·
Dec. 22, 20 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
10.0K Views

Join the DZone community and get the full member experience.

Join For Free

This post is a gentle introduction to Apache Avro. After several discussions with Dario Cazas about what’s possible with Apache Avro, he did some research and summarized it in an email. I found myself looking for that email several times to forward it to different teams to clarify doubts about Avro. After a while, I thought it could be useful for others, and this is how this series of three posts was born.

In summary, Apache Avro is a binary format with the following characteristics:

  • It’s binary, which means it’s very efficient (the keys of your data aren’t copied several times as with JSON) but you can’t read it in your text editor.
  • It’s a row format, so each record is stored independently (for example, Parquet is a columnar format) so it’s bad for aggregations but quite good to send data independently from one place to another.
  • It has great support to manage the schema of the data. The schema is typically defined in JSON format.

These characteristics make Apache Avro very popular in Event Streaming architectures based on Apache Kafka, but it isn’t the only possible use.

If you have more interest in Apache Avro, take a look at the Apache Avro Wikipedia page.

Avro With the Schema Registry and Kafka

Apache Avro plays well with Apache Kafka because it provides good performance and an easy way to govern schemas. There is an important thing to note: because Apache Avro is a binary format, consumers need to know how is the schema of the information stored in that message to deserialize it.

The most common way to do this is using the Schema Registry, aka SR. We are going to speak about the Confluent implementation, but it isn’t the only one and it isn’t part of the Kafka project. The workflow is quite simple: the producer consults the ID of the schema in the SR (or creates a new one if it doesn’t exist) and adds that ID to the message. The consumer retrieves the schema from the SR using that ID and deserializes the message.

The way to add the ID to the message is also simple: one byte with the value 0 in the case of Confluent, 4 bytes with the ID, and the rest of the data. It’s documented in the Wire Format entry.

Environment Setup

Using the Confluent Avro serializer/deserializer, the process is quite straightforward. Let’s try it using the Confluent Community Docker version. The setup is documented in the Quick Start for Apache Kafka using Confluent Platform Community Components (Docker), summarized here:

Shell
 




xxxxxxxxxx
1


 
1
git clone https://github.com/confluentinc/cp-all-in-one.git
2
cd cp-all-in-one/cp-all-in-one-community/
3
docker-compose up -d



Let’s start creating a topic:

Shell
 




xxxxxxxxxx
1


 
1
docker-compose exec broker kafka-topics \
2
  --create \
3
  --bootstrap-server localhost:9092 \
4
  --replication-factor 1 \
5
  --partitions 1 \
6
  --topic test



The output should be:

Created topic test.

To test it, we are going to create a Kafka Producer and a Kafka Consumer.

Kafka Producer with Confluent Schema Registry

Download the kafka-java-client-examples project and open it with your favorite IDE. We are going to work with a schema, located in the src/main/resources folder:

JSON
 




xxxxxxxxxx
1


 
1
{
2
 "namespace": "com.galiglobal.examples.testavro",
3
 "type": "record",
4
 "name": "Test",
5
 "fields": [
6
     {"name": "id", "type": "string"},
7
     {"name": "test", "type": "double"}
8
 ]
9
}



This Avro file is going to create a Test class you can use in your project.

Note for IntelliJ Idea users: you need to generate the classes from the Avro file. Right-click on your project and choose Maven > Generate sources and update folders. It’s important to do it each time you change the schema.

You can run now the ConfluentProducerExample, and it should print:

Successfully produced 10 messages to a topic called test

The more relevant parts are the properties of the producer:

Java
 




xxxxxxxxxx
1


 
1
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
2
props.put(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");



We indicate how to connect to the SR and the serializer, which is publishing to the SR under the hood. In the class io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient you can find the Rest client used to request schemas to the SR using HTTP.

If you check in the SR, you can see the schema which has been created by the producer:

Shell
 




xxxxxxxxxx
1


 
1
curl http://localhost:8081/subjects/test-value/versions/1



It should return:

JSON
 




xxxxxxxxxx
1


 
1
{
2
  "subject": "test-value",
3
  "version": 1,
4
  "id": 1,
5
  "schema": "{\"type\":\"record\",\"name\":\"Test\",\"namespace\":\"com.galiglobal.examples.testavro\",\"fields\":[{\"name\":\"id\",\"type\":\"string\"},{\"name\":\"test\",\"type\":\"double\"}]}"
6
}



Kafka Consumer

We are going to consume the messages using the Kafka Consumer just executing the ConfluentConsumerExample class. It should print something similar to:

key = id0, value = {“id”: “id0”, “amount”: 1000.0}
key = id1, value = {“id”: “id1”, “amount”: 1000.0}
key = id2, value = {“id”: “id2”, “amount”: 1000.0}
key = id3, value = {“id”: “id3”, “amount”: 1000.0}
key = id4, value = {“id”: “id4”, “amount”: 1000.0}
key = id5, value = {“id”: “id5”, “amount”: 1000.0}

The relevant part is the configuration of the SR and the deserializer:

Java
 




xxxxxxxxxx
1


 
1
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class);
2
props.put(AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
3
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);



The schema URL and deserializer are equivalent to the producer. SPECIFIC_AVRO_READER_CONFIG indicates we would like to deserialize to a Test object instead of a GenericRecord.

If we try to consume directly from the topic without using the Confluent deserializer, the result isn’t quite legible:

Shell
 




xxxxxxxxxx
1


 
1
docker-compose exec broker kafka-console-consumer \
2
 --topic test \
3
 --bootstrap-server localhost:9092 \
4
 --from-beginning \
5
 --property print.key=true \
6
 --property key.separator=" : " \
7
 --key-deserializer "org.apache.kafka.common.serialization.StringDeserializer" \
8
 --value-deserializer "org.apache.kafka.common.serialization.StringDeserializer"



As you can see, it’s a binary protocol and quite efficient! We aren’t sending the schema with every record as we would do with JSON or any other based protocol, and that’s a good saving.

Schema Compatibility

Efficiency isn’t the only positive point of this approach. One of the nice things you have with a Schema Registry is the possibility to govern schemas and make sure they are being used properly.

One of the big issues with asynchronous communications is how to evolve the schema without affecting consumers of that particular topic. A Schema Registry helps with that because it can check the changes in the schema and validate whether they are breaking compatibility. There are different types of compatibility, you can read more on Schema Evolution and Compatibility. Let’s test it. First, we’ll check what type of compatibility the SR is enforcing:

Shell
 




xxxxxxxxxx
1


 
1
curl -X GET http://localhost:8081/config



By default, it should return:

{“compatibilityLevel”:“BACKWARD”}

Backward compatibility means new consumers can read old records, but old consumers need to upgrade to the new version to be able to deserialize new messages.

We can test it using cur,l but it’s a bit tricky because we have to scape the JSON file. Let’s do it instead with the Producer adding one field to the schema:

JSON
 




xxxxxxxxxx
1
10


 
1
{
2
 "namespace": "com.galiglobal.examples.testavro",
3
 "type": "record",
4
 "name": "Test",
5
 "fields": [
6
     {"name": "id", "type": "string"},
7
     {"name": "test", "type": "double"},
8
     {"name": "boom", "type": "double"}
9
 ]
10
}



If we modify ConfluentProducerExample and run it again, an exception will show:

org.apache.kafka.common.errors.SerializationException: Error registering Avro schema: {“type”:“record”,“name”:“Test”,“namespace”:“com.galiglobal.examples.testavro”,“fields”:[{“name”:“id”,“type”:“string”},{“name”:“boom”,“type”:“string”}]} Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema being registered is incompatible with an earlier schema for subject “test-value”; error code: 409

Adding a new field isn’t a backward compatible change because new consumers can’t read old messages with that schema. They don’t have a way to fill the new field, which is mandatory. One possibility to make this change backward compatible would be to give a default value to the new field, so consumers know what value to give it when the field isn’t present in the message.

Let’s add a default value to the new field in the schema:

JSON
 




xxxxxxxxxx
1
10


 
1
{
2
 "namespace": "com.galiglobal.examples.testavro",
3
 "type": "record",
4
 "name": "Test",
5
 "fields": [
6
     {"name": "id", "type": "string"},
7
     {"name": "test", "type": "double"},
8
     {"name": "boom", "type": "double", "default":  0.0}
9
 ]
10
}



If we make the proper changes and run ConfluentProducerExample again, it will produce 10 new events and save a new version of the schema:

Shell
 




xxxxxxxxxx
1


 
1
curl http://localhost:8081/subjects/test-value/versions/2



It should return:

JSON
 




x


 
1
{
2
  "subject": "test-value",
3
  "version": 2,
4
  "id": 2,
5
  "schema": "{\"type\":\"record\",\"name\":\"Test\",\"namespace\":\"com.galiglobal.examples.testavro\",\"fields\":[{\"name\":\"id\",\"type\":\"string\"},{\"name\":\"test\",\"type\":\"double\"},{\"name\":\"boom\",\"type\":\"double\",\"default\":0.0}]}"
6
}



Summary and Next Steps

We have covered the basics of Apache Avro in an Apache Kafka architecture. It has important advantages in terms of performance, reduction of message size, and governance of the schemas.

But it also has some problems, especially when we are dealing with hybrid and/or multi-tenant architectures. In the following two parts of this series, we’ll cover these problems in detail and the different alternatives we have with Avro to deal with them.

avro kafka Schema GENtle

Published at DZone with permission of Anton Rodriguez. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Debezium Serialization With Apache Avro and Apicurio Service Registry
  • Handling Schema Versioning and Updates in Event Streaming Platforms Without Schema Registries
  • Apache Avro to ORC Using Apache Gobblin
  • Testing Schema Registry: Spring Boot and Apache Kafka With JSON Schema

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!