The Story of a Microservice Transformation in Hepsiburada
We launched the new checkout ecosystem, 'Erebor,' a few months ago in Hepsiburada. I will talk about what we experienced during this microservice transformation.
Join the DZone community and get the full member experience.Join For Free
We launched the new checkout ecosystem, 'Erebor,' a few months ago in Hepsiburada. In this article, I will talk about what we have experienced during this microservice transformation.
“A software ecosystem is the interaction of a set of actors on top of a common technological platform that results in a number of software solutions or services.” Software Ecosystem: Understanding an Indispensable Technology and Industry by David G. Messerschmitt and Clemens Szyperski
I prefer to use 'ecosystem' as a keyword to describe such a large structure because of its depth, capabilities, and scope. This is exactly my purpose of using this term in a few places in the rest of the article.
We work as vertical product teams in Hepsiburada. Obviously, all of these product teams are working autonomously as independent startups, and together with all these teams, we are creating Hepsiburada. The Checkout team is just one of thirty product teams in Hepsiburada. We try to make our customers’ lives easier by managing the products and technologies behind the process, starting from the basket flow until completing their orders. We divided our monolith checkout system into nearly 40 microservices with the Erebor. I will try to share with you what we learned in this journey in three chapters:
- Chapter One: Definition of the Problem
- Chapter Two: Choosing the Right Technology Specs
- Chapter Three: The Complexity Behind the Simplicity
Chapter One: Definition of the Problem
In fact, the first step of all good solutions starts with defining the problem well. The 'ultimate' correctness of the solution depends on the exact definition of the problem. We gathered our problems under the four different topics based on this fact:
- There is no bridge between the problem space and the solution space.
- Different loads in the same spot.
- Surviving dependency hell.
- Stuck at the borders.
We improved our entire ecosystem with solutions to these topics.
There Is No Bridge Between the Problem Space and the Solution Space
The problem space contains all stakeholders of a product. These stakeholders have comprehensive knowledge of their products. They can produce many non-technical solutions related to their products. On the other hand, the solution space is a space in which engineers are involved in different roles and produce technical solutions. The productivity of our product and the quality of our code-base has been adversely affected over time because of this complex communication problem. Also, complexity has become the standard for us rather than an option against simplicity. (unfortunately)
So, we used 'DDD' to solve this complex communication problem. In fact, we could preserve our code-base for a certain period of time by implementing different refactoring methodologies or extreme programming paradigms. All resistance mechanisms will lose their efficiency over time as long as there is no common language between these two different spaces. Actually, all resources about DDD lead us to the following truth.
DDD builds a solid bridge between these two spaces with using the ubiquitous language.
There are many ways to design the ubiquitous language. We chose 'the domain storytelling method' invented by Stefan Hofer and Henning Schwentner because it uses imagination and stories to simplify even very complex problems.
The best way to learn a language is to listen to other people speak that language. Try to repeat what you hear and mind their feedback. Gradually, you will progress from individual words to phrases and to complete sentences. The more you speak, the faster you will learn.
We established four golden rules to keep it all on track when applying DDD.
- Learn the business first and then design the domain.
- Write or apply your code like you speak.
- Don’t worry about the tactical design; focus on strategy design first.
- Use fitness functions as much as possible.
We learned that mistakes made in tactical design could be resolved quickly. However, the mistakes made in the strategic design, unfortunately, caused the anemia of the domain. (anemic model). Also, we have implemented as many fitness functions as possible in order to keep the system robust. We are currently trying to integrate the concept of 'Fitness function-driven development' into our system.
As a result of all this, we divided our system into four main domains.
The Domain Design (sample)
First of all, we designed our ubiquitous language. Then we determine our bounded-contexts for each domain.
After deciding on our bounded contexts, we modeled our communication standard with these contexts via context mapping tools.
After the strategic design, we started to use tactical design tools. We have defined our aggregates, entities, value objects, repositories, factories, and services.
Different Loads in the Same Spot
Our Monolith checkout service was used by about 10 different teams. At this point, all availability problems in our service were directly affecting these teams. The metrics we have collected here have led us to the following conclusions.
- We have a large difference between the number of reads and writes. We must scale independently on both sides. (10/3)
- Performance is critical. We can optimize read and write sides independently. Also, we support a low of parallel operations on the same set of data.
- We must normalize the write database. We must make writes efficient, but we don’t need the normalized data on the read side (projection-per-client/projection-per-business).
Based on these data, we decided to implement the CQRS with Event Sourcing. As you know, There are three types of CQRS.
- Separated class structure using domain classes for commands and DTOs for returning read data, which will introduce some duplication.
- Separated model with different APIs and models for reads and write, respectively. In addition to optimized queries, this also enables caching, making it interesting for a high load on reads.
- Separated storage optimized for queries enabling even more scaling of reads and separate types of storage for writing and querying, e.g., a relational database and a NoSQL type. Synchronization of read storage commonly runs in the background causing eventual consistency on the read side. Together with the best scalability, this pattern also brings the highest complexity.
We chose the separate storage strategy because it solves our problems more clearly than others. But you know, 'Everything in life is a trade-off.'
We recommend that you research the CQRS myths before applying the CQRS. But it’s worth sharing what Greg Young shared on StackOverflow:
We had to find answers to some basic questions while implementing the CQRS with Event Sourcing.
- Did we need to apply to all domains, really?
No. In fact, this solution fully defines 'The Complexity Behind The Simplicity.' If we applied this to all domains, we would have lost our main motivation, 'simplification.' Also, it will not be sustainable for all domains due to its cost. We have decided to implement this solution only in the 'basket,' which takes the most load and needs this solution the most. However, we applied CQRS with the Separated model with different APIs in some domains, considering the cost, feasibility, and sustainability.
- How were we going to prevent losses that could occur in events?
The processing of data without any delay is a very critical issue for us. Our system is definitely not suitable for an eventually-consist solution. We could use Microsoft Distributed Transaction Coordinator or Transactional Outbox Pattern to prevent losing an event. Also, we could distribute the load in a balanced and consistent way with consistency hashing. However, we did not choose them because they caused minor delays. We use 2pc to protect the integrity of events.
- How would we ensure data consistency on the read and write side?
In fact, we obeyed Greg Young on this step. We used hash codes to achieve the above mathematical function. In other words, We have ensured that the hash codes of the projections we created by combining our events are the same as the aggregate. Here, we have created different solutions to avoid some exceptional situations as follows.
- We have stored the summary of events as metadata to ensure data integrity. In this metadata, we store the number of events on aggregate and the hash code of aggregate.
- We re-build the events in case the projection data is not available. We check the data consistency and reliability of new projections by comparison with metadata.
- If the data is still unreliable, we rebuild an event using an aggregate. Thus, we destroy all existing events and build reliable projections again.
So what did we get at the end?
We solved many availability problems in our entire system with CQRS and Event Sourcing. In the basket domain, where we use these two solutions:
- We can process 2.5M events at approximately the same time in 60MS.
- We have a projection-store hit ratio of 98%.
- The average response times of our back end services are 180–200 milliseconds. (approximately)
Surviving Dependency Hell
In our system, different microservices had to call the same services cumulatively. This caused a 'dependency hell' problem at both the network level and between microservices.
i.e., we needed data about the basket, delivery options, and payment options to complete payment and create a pre-order state (called a snapshot).
As you can see, these services are stateless. They certainly don’t know each other’s results.
So how did we solve this 'dependency hell' here?
Actually, we used a simple concept. Services that provide the same situation in any time period can share their status with each other. We compress the states between services with brotli and transmitted them to each other via http-header. Thus, services that need the same situation do not make a service call again using the data in the http-header. We call it dependency reducer.
Stuck at the Borders
We did not have adequate system and application metrics, so we were stuck with unreal limits for a long time.
- We could not observe the impact of the improvements we made.
- We could not track the results of the improvements we made for our customers.
- We couldn’t see our technical limits, and we couldn’t move forward.
As a result of all this, we were often making the wrong decisions or wasting time with useless solutions. We had solved most of our technical problems. So, We decided to make our product data-centric to determine our own limits.
I have to say, Any product team within a large-scale technology company such as Hepsiburada cannot become data-centric as per Conway Law. The entire organization needs to be a data center as fast as possible.
Chapter Two: Choosing the Right Technology Specs
We had to make the right, sustainable, and robust technological choices after separating all the domains from each other. In theory, we divided our system into 40 different microservices, but we did not decide how or in what way we would design them. We followed the rules below while making this decision:
- Stay away from hype-driven development.
- Trust your feelings.
- Consider 'The Problem Space.'
- Check the community.
- Consider the quality of developers you want to attract.
- Use technologies that fit our company’s core values.
Our monolith service developed with C#. So, we decided to use C# in our services where complex logic is concentrated in Erebor. We’ve developed our less complex services with Go. We preferred Node.JS in our cross cuttings. Thus, we have determined the language distribution in our services as follows:
We used Apache Cassandra to write and store our events because of its flexible schema, its highly scalable and highly available with no single point of failure, very high write throughput, and good read throughput. We also used it to store our projections.
We preferred the MongoDB (as a sharded cluster) to store and use our aggregates. We use five different MongoDB clusters, and we have approximately 100 different nodes.
We use an Apache Kafka cluster for our domain events. Kafka is capable of handling high-velocity and high-volume data using not so large hardware. It is capable of supporting message throughput of thousands of messages per second. Also, Kafka is able to handle these messages with very low latency of the range of milliseconds, demanded by most new use cases.
On the other hand, We preferred RabbitMQ for our integration events. As you know, RabbitMQ offers a variety of features to let you trade off performance with reliability, including persistence, delivery acknowledgments, publisher confirms, and high availability. Also, messages are routed through exchanges before arriving at queues. RabbitMQ features several built-in exchange types for the typical routing logic. For more complex routing, you can bind exchanges together or even write your own exchange type as a plugin.
Below you can see all the products we use:
Chapter Three: The Complexity Behind the Simplicity
After we launched our system, we faced different problems. Our new system was telling us to change our perspective due to the following problems:
Was a Single Product Team Enough for This Many Microservices?
As you know, Amazon uses a simple rule called 'the two-pizza rule' to maximize meeting efficiency. We believe that this rule applies to product teams. If our motivation is to increase the productivity of the product and the production capacity of the team, we should prefer vertical teams instead of these horizontal teams. For this reason, we decided to transform horizontal teams like us into vertical teams in time.
Communication problems increase “exponentially as team size increases.” Ironically, the larger the team, the more time will be spent on communication instead of producing work. J. Richard Hackman
How Could Such Microservices Be Maintained?
In fact, we haven’t decided on the usage strategy of our repositories yet. We are still discussing mono-repo, multi-repo, or hybrid. We are still thinking about how to make improvements at many points, including the CI/CD. I think we will learn by experience what the truth is.
How Would the Learning Curve Be Affected?
The biggest disadvantage of microservices and event-based architectures is the learning challenge. The learning curve threshold of monolith systems is much lower than microservices. However, using different languages and different products together will further increase the learning threshold. If you’re a horizontally designed team like us, you probably won’t be able to solve this problem in the long run. Perhaps you can have 'lunch and learn' meetings regularly to reduce this impact and ensure the team has the necessary maturity, or you can try pair programming, which is the most effective solution.
How Will We Maintain the Strength of These Services?
Actually, we use the attributes specified in the 'building evolutionary architectures' for this. We test these attributes that we have determined both technically and theoretically. The attributes we have selected are: adaptability, autonomy, availability, configurability correctness, effectiveness, durability, usability, failure transparency, fault tolerance, maintainability, manageability, scalability, stability, traceability, testability. We have not yet automated testing of these features. However, we will implement 'Fitness Function-Driven Development' as soon as possible.
How Would We Monitor so Many Microservices?
Monitoring is an extremely difficult problem in microservices and event-based systems.
Each application will have unique needs relating to monitoring. There are a few common metrics you’ll want to record. They include:
- Application Metrics
The system must be able to collect and serve top-level data. These top-level data are useful for development teams and the organization to understand the functional behavior of the system.
- Platform Metrics
These metrics report on the nuts and bolts of your infrastructure. These metrics provide a dashboard that can be used to understand low-level system performance and behavior.
- System Events
Operations staff knows there is a strong correlation between new code deployments and system failures. Scaling events, configuration updates, and other operational changes are also relevant and should be recorded. Recording these events will also make it possible to correlate them with system behavior.
- Business Metrics
These metrics should be collected to track users’ behavior. In addition, improvements can only be made based on these metrics.
We collect application and platform metrics using APM. Also, we also use our in-house services to collect business metrics.
I would like to share with you some of the results we have achieved after all these efforts, with the following two different graphs.
We have measured that our new system can handle eight times more load than the peak in the graph above. Also, you can see below the average response times of our system under heavy load.
I would like to thank my teammates and the whole team who made an incredible effort during the process and helped us deliver this system to users with superhuman energy, and never gave up.
- Software Ecosystem: Understanding an Indispensable Technology and Industry
- Domain Storytelling
- Fitness function-driven development
- Distributed Systems by Maarten van Steen, Andrew S. Tanenbaum
- Building Microservices: Designing Fine-Grained Systems by Sam Newman
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppman
- Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services by Brandan Burns
- Building Event-Driven Microservices: Leveraging Organizational Data at Scale by Adam Bellemare
- Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith by Sam Newman
- Microservices Patterns: With examples in Java by Chris Richardson
- Flow Architectures: The Future of Streaming and Event-Driven Integration by James Urquhart
- Practical Microservices: Build Event-Driven Architectures with Event Sourcing and CQRS by Ethan Garofolo
Published at DZone with permission of Cem Basaranoglu. See the original article here.
Opinions expressed by DZone contributors are their own.