Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Choose Right Schema Definition for Topics in Publisher-Subscriber Model

DZone's Guide to

How to Choose Right Schema Definition for Topics in Publisher-Subscriber Model

A discussion of pub-sub architectural models and how to use the proper schema for your microservice data in such an architecture.

· Microservices Zone ·
Free Resource

Learn why microservices are breaking traditional APM tools that were built for monoliths.

If you ever got confused while defining the schema for your topics in Kafka (or any publisher-subscriber model), then this article is for you.

One of the main architectural decisions in creating a publisher-subscriber model is to finalize schema definitions for topics.

We often face questions like:

  • Which fields should be there in the schema?

  • What should be the structure of the schema (e.g. nested or flat)?

Example Use Case

This article is not bound to any specific technology or language, but I am using Java as the coding language, Kafka as the publisher-subscriber model, and JSON as for the schema.

We will take the example of an Employee object, and will see which fields and structures we should pick in our schema definition.

public class Employee {
  private int id;
  private String name;
  private String designation;
  private List<JobRole> roles;
  //gettters and setters
}

public class JobRole {
  private int id;
  private String name;
  private String type;
  //gettters and setters
}

There are mainly two type of schema definitions: producer-centric and consumer-centric. 

Producer-Centric Schema Definition:

As the name suggests, we create this kind of schema by keeping in mind the structure of the object on the producer side. Such schemas are generic, and are not specific to any consumer. 

When to Chooset This Schema

  • When we follow message driven communication in our service architecture, and we believe that the object we are creating or updating will be required by other services, then we can publish such objects to a topic of centralized Kafka clusters. Any service can consume that topic. 

  • If both the producers and consumers are in our control (i.e. there's no third-party involved), then we should prefer this model, as it is extensible.

  • When we do not have any specific reqirements from the client, or there are n number of clients, then in place of having n different producers, we should go with a producer specific schema. Otherwise, we would encouter a maintainability issue.

What Should the Schema Definition Be?

Producer-centric schema structures should be closer to the object definition on the producer side. While defining such schemas, we should expose all fields, which we believe other services will need.

For our employee example, the schema should be:

{
  "id": "emp id",
  "name": "emp name",
  "designation": "emp designation",
  "roles": [{
  "id": "role id",
  "name": "role name",
  "type": "role type"
}]
}

Do not leave any field which you may need to expose in future, because then you will face backward compatibility issues. For example, maybe at the moment no consumer wants 'role type,' so you haven't exposed that, but, after some time, a consumer needs 'role type' also. In that case, we would not have required fields in already published JSON. This will break backword compatibility, and our generic schema would be of no use.

Advantages

  • It helps to achieve loose coupling between services.

  • Reduces the load from the publisher side, as the publisher has to take care of publishing only one schema.

  • It's maintainable and extensible.

Disadvantages

  • The schema would be heavy, as we would expose fields which no consumer would ever need.

  • It would not work if your consumer is a third-party, or a generic framework in your project, which needs the schema to be in a specific format.

Consumer-Centric Schema Definition

This kind of schema definition is totally governed by the consumer. 

When to Choose This Schema

  • When the consumer is a service or framework which is not able to consume different kinds of schema to gather the information required.

  • When the consumer needs to consume data from n number of producers for the same use case (e.g. user activity information), getting different JSON objects and parsing them would be painful. To avoid this, the consumer can request other services to produce data in a given format to a client centric topic.

What Should the Schema Definition Be?

Schema definitions and structures should be defined by the consumer. For the same object, different consumers can request different fields and structures. 

For our employee example, the schema could be any of the below (or any other combination of fields and structures)

//For topic 1
{
  "emp_id": "emp id"
  "emp_name": "emp name"
  "roles": ["role 1 name", "role 2 name"]
} 


// For topic 2
{
  "empId": "emp id",
  "empName": "emp name",
  "roles":[{
    "name": "role name 1"
  },{
    "name": "role name 2"
  }]
}

Advantages

  • We expose only required fields, so schemas are not heavy.

  • Reduces the complexity on the consumer side and supports the consumers which are not extensible.

  • If the object on producer side is too heavy, or too complex to expose and maintain, then choosing consumer specific schema is a better option.

Disadvantages

  • If, for the same object, a producer is supporting different consumer specific schema, then it would create maintainability issues on the producer side.

  • This is not the ideal way to achieve loose coupling in message-driven architecture.

Thanks for reading!

Record growth in microservices is disrupting the operational landscape. Read the Global Microservices Trends report to learn more.

Topics:
kafka architecture ,microservice communication ,pub-sub ,schema design ,microservices

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}