How to Improve Software Architecture in a Cloud Environment
This article explains how we can improve the existing cloud, microservices architecture on a project to make it more robust and powerful for all of today’s challenges.
Join the DZone community and get the full member experience.
Join For FreeThe need for software architecture today has grown more critical due to the increasing complexity, scale, and expectations of modern software systems. Applications today aren't simple. They involve multiple layers: frontend, backend, databases, integrations, microservices, and sometimes even AI/ML components. A strong architecture provides a roadmap for organizing this complexity into manageable pieces, making it easier to develop, maintain, and scale.
This article explains how we can improve the existing architecture on a project to make it more robust and powerful for all of today’s challenges.
Introduction
When creating an architecture for any application, the first step is to understand the requirements, formal description of what we need to build. Types of requirements are:
- Functional requirements – Features of the system
- Non-Functional requirements – Quality Attributes
- System Constrains – Limitations and boundaries
For each of these requirements, there is a detailed procedure for developing the architecture, on which I will not focus now.
Also, regarding to the requirements of the app which you create, you can use one of the software architecture patterns:
- Two – Tier Architecture (Mobile app, Desktop App)
- Multi - Tier Architecture (Monolithic app with API Gateway or without API Gateway)
- Microservices Architecture
- Event – Driven Architecture
Existing Application
Now, we can focus on an example of how we can improve the software architecture of an existing application, which is the goal of this post.
Let’s assume we have an existing app similar to Instagram, Tripadwisor or TikTok.
- A user can sign up and login to post, vote, or comment
- Title
- Topic tags
- Body (text and/or upload images and/or upload files)
- A user should be able to comment on any existing post
- Comments are ordered chronologically as a flat list
- Logged-in user can up-vote/down-vote an existing post/comment.
- Present the top most popular posts in the last 24 hours on the homepage.
Current implementation of the app looks like it is presented on the next architecture diagram:
- Web App Service – A central service that provides web pages that cover part of the content presentation application
- Users Service – Used for creating and login user
- Post & Comments Service – Used for creation and storing posts and comment
- Ranking Service - Efficient ranking of posts by popularity in the last 24 hours
- Votes Service – Service which stores voting
We will consider non-functional requirements that relate to the quality of our application in a real-world application usage environment. These are Non-Functional Requirements for this application:
- Scalability (app should support millions of daily users)
- Performance (it will be expected to be less than 500ms Response time at 99%)
- Fault Tolerance/ High Availability (99,9%)
- Availability - Partition Tolerance (We gave priority for “Availability - Partition Tolerance” over “Consistency - Partition Tolerance”; It is not possible to have Availability and Consistency at the same time)
- Durability
Question is: “How to improve current architecture and cover required Non-Functional Requirements?!”
It can be easy, if we go step by step and solve all requirements one by one.
Improvements
Scalability
How can we improve scalability? We can place a Load Balancer in front of the web app service and each one of the other services. This will help us scale out our system, depending on the traffic to our system and each service independently.
Then we can place an API Gateway to decouple the frontend code from the system’s internal structure. This will allow us to make changes internally and increase our organizational scalability.
On the posts and comments database side, as we store more posts and comments, our database instance may run out of space. It also may not be able to handle the traffic from so many users.
So the solution will be sharding our posts and comments collection across different database shards. Database sharding is the process of storing a large database across multiple machines. A single machine, or database server, can store and process only a limited amount of data. Database sharding overcomes this limitation by splitting data into smaller chunks, called shards, and storing them across several database servers. All database servers usually have the same underlying technologies, and they work together to store and process large volumes of data. This way we can scale to as many posts or comments as we want to, by adding more instances and distributing our data across all those instances.
We will use a range sharding strategy which many NoSQL databases support. Range-based sharding involves dividing data into contiguous ranges determined by the shard key values. In this model, documents with "close" shard key values are likely to be in the same chunk or shard. This allows for efficient queries where reads target documents within a contiguous range.
The best solution for range sharding strategy in our case is to use sharding on a compound index of a post ID and the comment ID.
In the worst case, the same post will be located in the two shards, like it is shown in the picture below:
Performance
One potential performance bottleneck is image blob store. Every time a user loads a post of a news feed with the most popular posts, we need to load images from our system. So, loading those images on the browser can definitely add a lot of latency.
To solve this problem, we can utilize a global network of CDN servers. A content delivery network (CDN) is a geographically distributed group of servers that caches content close to end users. A CDN allows for the quick transfer of assets needed for loading Internet content, including HTML pages, JavaScript files, stylesheets, images, and videos.
By using a pull model with long TTL (Time to live - refers to the amount of time or “hops” that a packet is set to exist inside a network before being discarded by a router.), images of popular posts will be distributed among those seed and edge servers, which are located very close to users in different geographical locations. A few popular CDN solutions are Cloudflare CDN, AWS Cloudfront, GCP Cloud CDN, Azure CDN and Oracle CDN. This reduces the traffic load on our system and also reduces the page load time for the users significantly.
On the API gateway level, we can introduce cashing at peak times when a small group of posts is viewed by many users simultaneously, we can store those posts in a cache and update this cache every few minutes. We can store those posts in a cache and update this cache every few minutes. This will also reduce the response the response time to our users and goes hand in hand with our decision to prioritize availability over consistency.
For example, also we can add an Index on the post ID which will make searching for a particular post in a constant time instead of scanning through all our collection of posts.
Similarly, we can use the same compound index we created for sharding the comments using the post ID and comments ID. This way, requesting a batch of comments for a given post will also be much faster. Of course, we can also use indexing for the other databases.
When a user comes to votes, we also have a performance problem here. As we recall, the votes are stored in the vote service database as individual records for each vote. However, when we show a post or a comment in the UI, we want to show the overall total number of up votes and down votes for each entity. So what we can do is introduce a Message Broker between the voting service and the post and comment. Additionally, we can add two fields in the post in comments collections. One for total up votes and one for total down votes. So any time a user up votes or down votes opposed, the voting service will add that entry to its own database and publish it as an event to the post and common service. Then the Post and Common Service can consume this event and update the total numbers of up votes for a given post which will store with the post record. This approach really helps if we have a short but sudden traffic spike when many people start voting on the same post or several post simultaneously, since we can buffer those votes in the message broker and we can make those updates slowly and complete them after that traffic spike is over. Of course, during this transition, users will not get the most up to date values for the total number of up votes or down votes, but that’s okay because we prioritized availability over consistency when we gather our nonfunctional requirement.
Finally, now that we have the total number of up votes and down votes stored together with each post and comment, we can load that data easily and efficiently to the user.
Fault Tolerance/ High Availability (99.9%)
To ensure we don’t lose any data, we can introduce database replication if a database instance or database shard crashes. Similarly, every message broker data store and micro service is replicated, adding to our overall system’s full tolerance.
In addition, we can run our system across multiple data centers in different geographical locations and utilize a global load balancing service. This way, we can always fall over to another region if there is a natural disaster in one of those regions. Of course, as an added benefit, this also improves our performance as our system gets physically closer to our users on different continents.
The following image shows a situation where the closest server is unavailable to the user. In this situation, the user contacts the second closest server that is available.
Conclusion
Architecture is what helps keep the chaos in check. Software will keep getting more complex, but with a solid foundation, it doesn’t have to feel overwhelming. Taking the time to improve your architecture now can save a lot of headaches later and make life easier for everyone working on the project. It’s not about perfection. It’s about making smart choices that set you up for whatever comes next.
Published at DZone with permission of Milan Karajovic. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments