Performance refers to how well an application conducts itself compared to an expected level of service. Today's environments are increasingly complex and typically involve loosely coupled architectures, making it difficult to pinpoint bottlenecks in your system. Whatever your performance troubles, this Zone has you covered with everything from root cause analysis, application monitoring, and log management to anomaly detection, observability, and performance testing.
Welcome back to this series all about file uploads for the web. In the previous posts, we covered things we had to do to upload files on the front end, things we had to do on the back end, and optimizing costs by moving file uploads to object storage. Upload files with HTML Upload files with JavaScript Receive uploads in Node.js (Nuxt.js) Optimize storage costs with Object Storage Optimize performance with a CDN Secure uploads with malware scans Today, we’ll do more architectural work, but this time it’ll be focused on optimizing performance. Recap of Object Storage Solution By now, we should have an application that stores uploaded files somewhere in the world. In my case, it’s an Object Storage bucket from Akamai cloud computing services, and it lives in the us-southeast-1 region. So when I upload a cute photo of Nugget making a big ol’ yawn, I can access it at austins-bucket.us-southeast-1.linodeobjects.com/files/nugget.jpg. Nugget is a super cute dog. Naturally, a lot of people are going to want to see this. Unfortunately, this photo is hosted in the us-southeast-1 region, so anyone living far away from that region has to wait longer before their eyes can feast on this beast. Latency sucks. And that’s why CDNs exist. What Is a CDN? CDN stands for “content delivery network“, and it’s a connected network of computers that are globally distributed and can store copies of the same files so that when a user makes a request for a specific file, it can be served from the nearest computer to the user. By using a CDN, the distance a request must travel is reduced, thereby resolving requests faster, regardless of a user’s location. Here’s a WebPageTest result for that photo of Nugget. The request was made from their servers in Japan, and it took 1.1 seconds for the request to complete. Instead of serving the file directly from my Object Storage bucket, I can set up a CDN in front of my application to cache the photo all over the world. So users in Tokyo will get the same photo but served from their nearest CDN location (which is probably in Tokyo), and users in Toronto are going to get that same file but served from their nearest CDN location (which is probably in Toronto). This can have significant performance implications. Let’s look at that same request, but served behind a CDN. The new WebPageTest results still show the same photo of Nugget, and the request still originated from Tokyo, but this time it only took 0.2 seconds; a fraction of the time! When the request is made for this image, the CDN can check if it already has a cached version. If it does, it can respond immediately. If it doesn’t, it can go fetch the original file from Object Storage, then save a cached version for any future requests. Note: the numbers reported above are from a single test. They may vary depending on network conditions. The Compounding Returns of CDNs The example above focused on improving the delivery speeds of uploaded files. In that context, I was only dealing with a single image that is uploaded to an Object Storage bucket. It shows almost a full-second improvement in response times, which is great, but things get even better when you consider other types of assets. CDNs are great for any static asset (CSS, JavaScript, fonts, images, icons, etc.), and by putting it in front of my application, all the other static files can automatically get cached as well. This includes the files that Nuxt.js generates in the build process, and which are hosted on the application server. This is especially relevant when you consider the “Critical rendering path” and render-blocking resources like CSS, JavaScript, or fonts. When a webpage loads, as the browser comes across a render-blocking resource, it will pause parsing and go download the resource before it continues (hence “render-blocking”). So any latency that affects a single asset may also impact the performance of other assets further down the network cascade. This means the performance improvements from a CDN are compounding. Nice! So is this about showing cute photos of my dog to more people even faster, or is it about helping you make your applications run faster? YES! Whatever motivates you to build faster websites, including a CDN as part of your application infrastructure is a crucial step if you plan on serving customers from more than one region. Connect Akamai CDN to Object Storage I want to share how I set up Akamai with Object Storage because I didn’t find much information on the subject, and I’d like to help anyone that’s looking for a solution. If it doesn’t apply to your use case, feel free to skip this section. Akamai is the largest CDN provider in the world, with something like 300,000 servers across 4,000 locations. It’s used by some of the largest companies in the world, but most enterprise clients don’t like sharing which tools they use, so it’s hard to find Akamai-related content. (Note: You will need an Akamai account and access to your DNS editor.) In the Akamai Control Center, I created a new property using the Ion Standard product, which is great for general-purpose CDN delivery. After clicking Create Property, you’ll be prompted to choose whether to use the setup wizard to guide you through creating the property, or you can go straight to the Property Manager settings for the new property. I chose the latter. In the Property Manager, I had to add a new hostname in the Property Hostnames section. I added the hostname for my application. This is the URL where users will find your application. In my case, it was "uploader.austingil.com". Part of this process also requires setting up an SSL certificate for the hostname. I left the default value selected for Enhanced TLS. With all that set up, Akamai will show me the following Property Hostname and Edge Hostname. We’ll come back to these later when it’s time to make DNS changes. Property Hostname: uploader.austingil.com Edge Hostname: uploader.austingil.com-v2.edgekey.net Next, I had to set up the actual property’s behavior, which meant editing the Default Rule under the Property Configuration Settings. Specifically, I had to point the Origin Server Hostname to the domain where my origin server will live. In my DNS, I created a new A record pointing origin-uploader.austingil.com to my origin server’s IP address, then added a CNAME record that points uploader.austingil.com to the Edge Hostname provided by Akamai. A: origin-uploader.austingil.com -> origin server IP CNAME: uploader.austingil.com -> uploader.austingil.com-v2.edgekey.net This lets me build out my CDN configuration and test it as needed, only sending traffic through the CDN when I’m ready. Finally, to serve files in my Object Storage instance through Akamai, I created a new rule based on the blank rule template. I set the rule criteria to apply to all requests going to the /files/* sub-route. The rule behavior is set up to rewrite the request’s Origin Server Hostname and change it to my Object Storage location: npm.us-southeast-1.linodeobjects.com. This way, any request that goes to uploader.austingil.com/files/nugget.jpeg is served through the CDN, but the file originates from the Object Storage location. And when you load the application, all the static assets generated by Nuxt are served from the CDN as well. All other requests are passed through Akamai and forwarded to origin-uploader.austingil.com, which points to the origin server. So that’s how I’ve configured Akamai CDN to sit in front of my application. Hopefully, it all made sense, but if you have questions, feel free to ask me. To Sum Up Today we looked at what a CDN is, the role it plays in reducing network latency, and how to set up Akamai CDN with Object Storage. But this is just the tip of the iceberg. There’s a whole world of tweaking CDN configuration to get even more performance. There are also a lot of other performance and security features a CDN can offer beyond just static file caching: web application firewalls, faster network path resolution, DDoS protection, bot mitigation, edge compute, automated image and video optimization, malware scanning, request security headers, and more. My colleague, Mike Elissen, also covers some great security topics on his blog. The most important thing that I wanted to convey today is that using a CDN improves file delivery performance by caching content close to the user. I hope you’re enjoying the series so far and plan on sticking around until the end. We’ll continue next time by looking at ways to protect our servers from malicious file uploads. Thank you so much for reading. If you liked this article, and want to support me, the best ways to do so are to share it and follow me on Twitter.
When it comes to online services, uptime is crucial, but it’s not the only thing to consider. Imagine running an online store — having your site available 99.9% of the time might sound good, but what if that 0.1% of downtime happens during the holiday shopping season? That could mean losing out on big sales. And what if most of your customers are only interested in a few popular items? If those pages aren’t available, it doesn’t matter that the rest of your site is working fine. Sometimes, being available during peak moments can make or break your business. It’s not just e-commerce — a small fraction of airports handle most of the air traffic, just a tiny minority of celebrities are household names, and only a handful of blockbuster movies dominate the box office each year. It’s the same distribution pattern everywhere. To be successful, it’s important to not only maintain uptime but also be ready for significant events. Some teams implement change freezes before key times, such as Prime Day, Black Friday, or Cyber Monday. This approach is reasonable, but it can be limiting as it doesn’t allow teams to quickly respond to unexpected opportunities or critical situations. Additionally, not all demand can be predicted, and it’s not always clear when those high-impact events will happen. This is where “Reliability when it matters” comes in. We need to be able to adapt and respond quickly to changes in customer demand without being held back by code freeze periods and being prepared for unforeseen situations. By considering time as a valuable resource and understanding the relative significance of different moments, organizations can better translate customer value and adjust risk and availability budgets accordingly. This approach allows organizations to be flexible and responsive to changes in demand without missing out on crucial features or opportunities. In the end, it’s about being ready when luck comes your way. It’s important to note that a system is not static and is constantly changing. The system itself, the infrastructure it’s hosted on, and the engineering organization all change over time. This means that knowledge about the system also changes, which can impact reliability. Besides that, incidents and outages are inevitable, no matter how much we try to prevent them. Bugs will be shipped, bad configurations will be deployed, and human error will occur. There can also be interdependencies that amplify outages. An incident rarely has a single cause and is often a combination of factors coming together. The same goes for solutions, which are most effective when they involve a combination of principles and practices working together to mitigate the impact of outages. Operating a system often means dealing with real-world pressures, such as time, market, and management demands to deliver faster. This can lead to shortcuts being taken and potentially compromise the reliability of the system. Growth and expansion of the user base and organization can also bring additional complexity and result in unintended or unforeseen behaviors and failure modes. However, by adopting a holistic approach and utilizing the principles and practices of engineering I’m going to cover below, we can have the best of both worlds — speed, and reliability. It’s not an either-or scenario but rather a delicate balance between the two. What Is Reliability? Reliability is a vital component of any system, as it guarantees not only availability but also proper functioning. A system may be accessible, yet if it fails to operate accurately, it lacks reliability. The objective is to achieve both availability and precision within the system, which entails containing failures and minimizing their impact. However, not all failures carry equal weight. For instance, an issue preventing checkout and payment is far more crucial than a minor glitch in image loading. It’s important to focus on ensuring important functions work correctly during critical moments. In other words, we want to focus on being available and functioning correctly during peak times, serving the most important functionality, whether it be popular pages or critical parts of the process. Making sure systems work well during busy times is tough, but it’s important to approach it in a thoughtful and thorough way. This includes thinking about the technical, operational, and organizational aspects of the system. Key parts of this approach include: Designing systems that are resilient, fault-tolerant, and self-healing. Proactively testing systems under extreme conditions to identify potential weak spots and prevent regressions. Effective operational practices: defining hosting topology, auto-scaling, automating deployment/rollbacks, implementing change management, monitoring, and incident response protocols. Navigating the competing pressures of growth, market demands, and engineering quality. Cultivating a culture that values collaboration, knowledge sharing, open-mindedness, simplicity, and craftsmanship. It also requires a focus on outcomes in order to avoid indecision and provide the best possible experience for customers. Further, we’re going to expand on the concept of “Reliability when it matters” and provide practical steps for organizations to ensure availability and functionality during critical moments. We’ll discuss key elements such as designing systems for reliability, proactively testing and monitoring, and also delve into practical steps like automating deployment and incident response protocols. Reliability Metrics: A Vital Tool for Optimization When optimizing a service or system, it's essential to initially define your objectives and establish a method for monitoring progress. The metrics you choose should give you a comprehensive view of the system’s reliability, be easy to understand, share, and highlight areas for improvement. Here are some common reliability metrics: Incident frequency: the number of incidents per unit of time. Incident duration: the total amount of time incidents last. While these metrics are a good starting point, they don’t show the impact of incidents on customers. Let’s consider the following graph: Blue — the number of requests per five minutes, Red — errors, Green — reliability in 0..1 Suppose we have two incidents, one at 1 am and one at 2 pm, each causing about 10% of requests to fail for an equal duration of 30 minutes. Treating these incidents as equally impactful on reliability wouldn’t reflect their true effects on customers. By considering traffic volume, the reliability metric can better show that an incident during peak traffic has a bigger impact and deserves higher priority. Our goal is to have a clear signal that an incident during peak traffic is a major problem that should be fixed. This distinction helps prioritize tasks and make sure resources are used effectively. For example, it can prevent the marketing team’s efforts to bring more visitors from being wasted. Additionally, tracking the incident frequency per release can help improve the deployment and testing processes and reduce unexpected issues. In the end, this should lead to faster delivery with lower risks. Digging Deeper Into Metrics To get a deeper understanding of these metrics and find areas for improvement, try tracking the following: Time to detection: how long it takes to notice an incident. Time to notification: how long it takes to notify relevant parties. Time to repair: how long it takes to fix an incident. Time between incidents: this can reveal patterns or trends in system failures. Action item completion rate: the percentage of tasks completed. Action item resolution time: the time it takes to implement solutions. Percentage of high-severity incidents: this measures the overall reliability of the system. Finally, regularly reviewing these metrics during weekly operations can help focus on progress, recognize successes, and prioritize. By making this a regular part of your culture, you can use the data from these metrics to drive better decisions and gradually optimize the system. Remember, the usefulness of metrics lies in the actions taken from them and their ability to drive progress. It’s a continuous feedback loop of refining both the data and the action items to keep the system improving. Designing for Resilience A system that isn’t designed to be resilient probably won’t handle peak times as smoothly. Here are some considerations that can help ensure a system’s reliability under a variety of conditions: Do’s: Prepare for component failure: By partitioning the service or using isolation, you can limit the blast radius and reduce the impact of failures. Implement fault-tolerance: Implementing mechanisms like retries, request hedging, and backpressure will improve the system’s availability and performance. Use rate-limiting and traffic quotas: Don’t rely solely on upstream dependencies to protect themselves. Use rate-limiting and traffic quotas to ensure that your system remains reliable. Categorize functionality: Prioritize functions by categorizing them into “critical,” “normal,” and “best-effort” categories. This will help keep essential functions available at all costs during high demand. Implement error-pacing and load-shedding: These mechanisms help prevent or mitigate traffic misuse or abuse. Continuously challenge the system: Continuously challenge the system and consider potential failures to identify areas for improvement. Plan for recovery: Implement fail-over mechanisms and plan for recovery in the event of a failure. This will help reduce downtime and ensure that essential services are available during challenging conditions. Make strategic trade-offs: Make strategic trade-offs and prioritize essential services during challenging external conditions. Dont’s: Don’t assume callers will use your service as intended. Don’t neglect rare but potential failures; plan and design prevention measures. Don’t overlook the possibility of hardware failures. I explored some of the ideas in the following blog posts: Ensuring Predictable Performance in Distributed Systems Navigating the Benefits and Risks of Request Hedging for Network Services FIFO vs. LIFO: Which Queueing Strategy Is Better for Availability and Latency? Isolating Noisy Neighbors in Distributed Systems: The Power of Shuffle-Sharding Reliability Testing Reliability testing is essential for maintaining the availability and functionality of a system during high demand. To ensure a reliable system, it is important to: Design for testability so each component can be tested individually. Have good enough testing coverage as a prerequisite for being agile. Calibrate testing by importance, focusing on essential functions and giving a bit of slack to secondary or experimental features. Perform extensive non-functional testing, such as load testing, stress testing, failure-injection testing, soak testing, and fuzzing/combinatorial testing. It’s crucial to avoid: Blindly pursuing high coverage numbers. Assuming that a single data point provides a comprehensive understanding. Ensure that results are robustly reproducible. Underinvesting in testing environments and tooling. Proper testing not only ensures correctness, serves as living documentation, and prevents non-functional regressions but also helps engineers to understand the system deeper, flex their creative muscles while trying to challenge them, and ultimately create more resilient, reliable systems for the benefit of all stakeholders. Remember, if you don’t deliberately stress test your system, your users will do it for you. And you won’t be able to choose when that moment comes. Reliability Oriented Operations Operating a distributed system is like conducting an orchestra, a delicate art that requires a high level of skill and attention to detail. Many engineers tend to underestimate the importance of operations or view it as secondary to software development. However, in reality, operations can have a significant impact on the reliability of a system. Just like a conductor’s skill and understanding of the orchestra is vital to ensure a harmonious performance. For example, cloud computing providers often offer services built on open-source products. It’s not just about using the software but how you use it. This is a big part of the cloud computing provider business. To ensure reliability, there are three key aspects of operations to consider: Running the service: This involves hosting configuration, deployment procedures, and regular maintenance tasks like security patches, backups, and more. Incident prevention: Monitoring systems in real-time to quickly detect and resolve issues, regularly testing the system for performance and reliability, capacity planning, etc. Incident response: Having clear incident response protocols that define the roles and responsibilities of team members during an incident, as well as effective review, communication, and follow-up mechanisms to address issues and prevent similar incidents from happening or minimize their impact in the future. The incident response aspect is particularly crucial, as it serves as a reality check. After all, all taken measures were insufficient. It’s a moment of being humble and realizing that the world is much more complex than we thought. And we need to try to be as honest as possible to identify all the engineering and procedural weaknesses that enabled the incident and see what we could do better in the future. To make incident retrospectives effective, consider incorporating the following practices: Assume the reader doesn’t have prior knowledge of your service. First of all, you write this retrospective to share knowledge and write clearly so that others can understand. Define the impact of the incident. It helps to calibrate the amount of effort needed to invest in the follow-up measures. Only relatively severe incidents require a deep process, do not normalize retrospectives by having them for every minor issue that doesn’t have the potential to have a lasting impact. Don’t stop at comfortable answers. Dig deeper without worrying about personal egos. The goal is to improve processes, not blame individuals or feel guilt. Prioritize action items that would have prevented or greatly reduced the severity of the incident. Aim to have as few action items as possible, each with critical priority. In terms of not stopping at the “comfortable answers,” it’s important to identify and address underlying root causes for long-term reliability. Here are a few examples of surface-level issues that can cause service disruptions: Human error while pushing configuration. Unreliable upstream dependency causes unresponsiveness. Traffic spike leading to the temporary unavailability of our service. It can be difficult to come up with action items to improve reliability in the long term based on these diagnoses. On the other hand, deeper underlying root causes may sound like: Our system allowed the deployment of an invalid configuration to the whole fleet without safety checks. Our service didn’t handle upstream unavailability and amplified the outage. Our service didn’t protect itself from excessive traffic. Addressing underlying root causes can be more challenging, but it is essential for achieving long-term reliability. This is just a brief overview of what we should strive for in terms of operations, but there is much more to explore and consider. From incident response protocols to capacity planning, there are many nuances and best practices to be aware of. The Human Factor in System Reliability While procedures and mechanisms play a vital role in ensuring system reliability, it is ultimately the humans who bring them to life. So, it’s not just about having the right tools but also cultivating the right mindset to breathe life into those mechanisms and make them work effectively. Here are some of the key qualities and habits that contribute to maintaining reliability (and not only): Collaboration with other teams and organizations in order to share knowledge and work towards a common goal. A degree of humility and an open-minded approach to new information in order to adapt and evolve the system. A focus on simplicity and craftsmanship in order to create evolvable and maintainable systems. An action-driven and outcome-focused mindset, avoiding stagnation and indecision. A curious and experimental approach akin to that of a child, constantly seeking to understand how the system works and finding ways to improve it. Conclusion Ensuring reliability in a system is a comprehensive effort that involves figuring out the right metrics, designing with resilience in mind, and implementing reliability testing and operations. With a focus on availability, functionality, and serving the most important needs, organizations can better translate customer value and adjust risks and priorities accordingly. Building and maintaining a system that can handle even the toughest conditions not only helps drive business success and pleases customers but also brings a sense of accomplishment to those who work on it. Reliability is a continuous journey that requires attention, skill, and discipline. By following best practices, continuously challenging the system, and fostering a resilient mindset, teams, and organizations can create robust and reliable systems that can withstand any challenges that come their way.
Service meshes are becoming increasingly popular in cloud-native applications as they provide a way to manage network traffic between microservices. Istio, one of the most popular service meshes, uses Envoy as its data plane. However, to maintain the stability and reliability of modern web-scale applications, organizations need more advanced load management capabilities. This is where Aperture comes in. It offers several features, including: Prioritized load shedding: Drops traffic that is deemed less important to ensure that the most critical traffic is served. Distributed rate-limiting: Prevents abuse and protects the service from excessive requests. Intelligent autoscaling: Adjusts resource allocation based on demand and performance. Monitoring and telemetry: Continuously monitors service performance and request attributes using an in-built telemetry system. Declarative policies: Provides a policy language that enables teams to define how to react to different situations. These capabilities help manage network traffic in a microservices architecture, prioritize critical requests, and ensure reliable operations at scale. Furthermore, the integration with Istio for flow control is seamless and without the need for application code changes. In this blog post, we will dive deeper into what a Service Mesh is, the role of Istio and Envoy, and how they work together to provide traffic management capabilities. Finally, we will show you how to manage loads with Aperture in an Istio-configured environment. What Is a Service Mesh? Since the advent of microservices architecture, managing and securing service-to-service communication has been a significant challenge. As the number of microservices grows, the complexity of managing and securing the communication between them increases. In recent years, a new approach has emerged that aims to address these challenges: the service mesh. The concept of a service mesh was first introduced in 2016 when a team of engineers from Buoyant, a startup focused on cloud-native infrastructure, released Linkerd, an open-source service mesh for cloud-native applications. Linkerd was designed to be lightweight and unobtrusive, providing a way to manage service-to-service communication without requiring significant changes to the application code. Istio as a Service Mesh Source: Istio Istio is an open-source service mesh that is designed to manage and secure communication between microservices in cloud-native applications. It provides a number of features for managing and securing service-to-service communication, including traffic management, security, and observability. At its core, Istio works by deploying a sidecar proxy alongside each service instance. This proxy intercepts all incoming and outgoing traffic for the service and provides a number of capabilities, including traffic routing, load balancing, service discovery, security, and observability. When a service sends a request to another service, the request is intercepted by the sidecar proxy, which then applies a set of policies and rules that are defined in Istio's configuration. These policies and rules dictate how traffic should be routed, how the load should be balanced, and how security should be enforced. For example, Istio can be used to implement traffic routing rules based on request headers, such as routing requests from a specific client to a specific service instance. Istio can also be used to apply circuit-breaking and rate-limiting policies to ensure that a single misbehaving service does not overwhelm the entire system. Istio also provides strong security capabilities, including mutual TLS authentication between services, encryption of traffic between services, and fine-grained access control policies that can be used to restrict access to services based on user identity, IP address, or other factors. What Is Envoy? Envoy is an open-source, high-performance edge and service proxy developed by Lyft. It was created to address the challenges of modern service architectures, such as microservices, cloud-native computing, and containerization. Envoy provides a number of features for managing and securing network traffic between services, including traffic routing, load balancing, service discovery, health checks, and more. Envoy is designed to be a universal data plane that can be used with a wide variety of service meshes and API gateways, and it can be deployed as a sidecar proxy alongside service instances. One of the key benefits of Envoy is its high-performance architecture, which makes it well-suited for managing large volumes of network traffic in distributed systems. Envoy uses a multi-threaded, event-driven architecture that can handle tens of thousands of connections per host with low latency and high throughput. Traffic Management in Istio Istio's traffic management capabilities enable the control of network traffic flow between microservices. This refers to the ability to manage how the traffic flows between them. In Istio, Traffic management is primarily done through the use of Envoy sidecar proxies that are deployed alongside your microservices. Envoy proxies are responsible for handling incoming and outgoing network traffic and can perform a variety of functions such as load balancing, routing, and security. Let's talk about EnvoyFilter and how it is used in Aperture. Aperture's EnvoyFilter One way to customize the behavior of Envoy proxies in Istio is through the use of Envoy filters. Envoy filters are a powerful feature of Envoy that allow you to modify, route, or terminate network traffic based on various conditions such as HTTP headers, request/response bodies, or connection metadata. In Istio, Envoy filters can be added to your Envoy sidecar proxies by creating custom EnvoyFilter resources in Kubernetes. These resources define the filter configuration, the filter type, and the filter order, among other parameters. There are several types of filters that can be used in Istio, including HTTP filters, network filters, and access log filters. Here are some of the Envoy filters that Apertures uses. HTTP filters: You can use HTTP filters to modify or route HTTP traffic based on specific criteria. For example, you can add an HTTP filter to strip or add headers, modify request/response bodies, or route traffic to different services based on HTTP headers or query parameters. Network filters: You can use network filters to modify network traffic at the transport layer, such as TCP or UDP. For example, you can use a network filter to add or remove SSL/TLS encryption or to redirect traffic to a different IP address. Filters are loaded dynamically into Envoy and can be applied globally to all traffic passing through the proxy or selectively to specific services or routes. Aperture uses EnvoyFilter to implement its flow control capabilities. The Aperture Agent is integrated with Envoy using EnvoyFilter, which allows the agent to use the External authorization API (we will learn more about this in the next section). This API allows Aperture to extract metadata from requests and makes flow control decisions based on that metadata. With EnvoyFilter, Aperture can intercept, inspect, and modify the traffic flowing through Envoy, providing more advanced and flexible flow control capabilities. External Authorization API The External Authorization API is a feature provided by Envoy Proxy that allows external authorization services to make access control decisions based on metadata extracted from incoming requests. It provides a standard interface for Envoy to call external authorization services for making authorization decisions, which allows for a more flexible and centralized authorization control for microservices. Control Using Envoy's External Authorization API enables Aperture to make flow control decisions based on a variety of request metadata beyond basic authentication and authorization. By extracting and analyzing data from request headers, paths, query parameters, and other attributes, Aperture can gain a more comprehensive understanding of the flow of traffic between microservices. This capability allows Aperture to prioritize critical requests over others, ensuring reliable operations at scale. Moreover, by utilizing Aperture flow control, applications can degrade gracefully in real-time, meaning Aperture can prioritize the different workloads. Aperture uses Envoy's External Authorization definition to describe request metadata, specifically the AttributeContext. However, the ability to extract values from the request body depends on how External Authorization in Envoy was configured. Access Logs In addition to flow control, Aperture also provides access logs that developers can use to gain insight into the flow of network traffic in their microservices architecture. These logs capture metadata about each request, such as the request method, path, headers, and response status code, which can be used to optimize application performance and reliability. By analyzing traffic patterns and identifying potential issues or performance bottlenecks, developers can make informed decisions to improve their microservices architecture. Aperture extract fields from access logs containing high-cardinality attributes that represent key attributes of requests and features. You can find all these extracted fields here. These fields allow for a detailed analysis of system performance and behavior and provide a comprehensive view of individual requests or features within services. Additionally, this data can be stored and visualized using FluxNinja ARC, making it available for querying using an OLAP backend (Druid). Let’s jump into the implementation of Aperture with Istio. Flow Control Using Aperture With Istio Load management is important in web-scale applications to ensure stability and reliability. Aperture provides flow control mechanisms such as weighted fair queuing, distributed rate-limiting, and prioritization of critical features to regulate the flow of requests and prevent overloading. These techniques help to manage the flow of requests and enable load management. Demo Prerequisites The following are prerequisites before going ahead with Aperture Flow Control Integration with Istio. A Kubernetes cluster: For the purposes of this demo, you can use the Kubernetes-in-Docker (Kind) environment. Aperture Controller and Aperture Agent installed in your Kubernetes cluster. Istio is installed in your Kubernetes cluster. Step 1: Install Aperture Controller and Aperture Agent To install Aperture Controller and Aperture Agent running as Daemon Set in your Kubernetes cluster, follow the instructions provided in the Aperture documentation. Step 2: Install Istio To install Istio in your Kubernetes cluster, follow the instructions provided in the Istio documentation. Step 3: Enable Sidecar Injection To enable Istio sidecar injection for your applications deployments in a specific namespace, run the following command: kubectl label namespace <namespace-name> istio-injection=enabled This will add the istio-injection=enabled label to the specified namespace, allowing Istio to inject sidecars automatically into the pods. Example: kubectl create namespace demoapp kubectl label namespace demoapp istio-injection=enabled Step 4: Apply Envoy Filters Patches To configure Aperture with Istio, four patches have to be applied via Envoy Filter. These patches essentially serve the following purposes: To merge the values extracted from the filter with the Open Telemetry Access Log configuration to the HTTP Connection Manager filter for the outbound listener in the Istio sidecar running with the application. This is called the NETWORK_FILTER Patch. To merge the Open Telemetry Access Log configuration with the HTTP Connection Manager filter for the inbound listener in the Istio sidecar running with the application. This is also a NETWORK_FILTER Patch. To insert the External Authorization before the Router sub-filter of the HTTP Connection Manager filter for the inbound listener in the Istio sidecar running with the application. This is called the HTTP_FILTER Patch. To insert the External Authorization before the Router sub-filter of the HTTP Connection Manager filter, but for the outbound listener in the Istio sidecar running with the application. This is another HTTP_FILTER Patch. In simpler terms, the patches are applied to the Istio sidecar running alongside the application. They help to modify the HTTP Connection Manager filter, which manages incoming and outgoing traffic to the application. The patches enable Aperture to extract metadata, such as request headers, paths, and query parameters, to make flow control decisions for each request. This allows Aperture to prioritize critical requests over others, ensure reliable operations at scale, and prevent overloading the application. Refer to Aperture Envoy Configuration documentation to know more about each patch. For convenience, Aperture provides two ways to do this, Helm: Using Aperture istioconfig Helm chart Aperturectl: Own Aperture cli provides a way to do istioconfig install in ISTIOD namespace. Using Helm: helm repo add aperture https://fluxninja.github.io/aperture/ helm repo update helm upgrade --install aperture-envoy-filter aperture/istioconfig --namespace ISTIOD_NAMESPACE_HERE Using aperturectl aperturectl install istioconfig --version v0.26.1 --namespace ISTIOD_NAMESPACE_HERE Example: aperturectl install istioconfig --version v0.26.1 --namespace istio-system The default values for the Aperture Agent service namespace are aperture-agent, the port is 8080, and sidecar mode is false. This means that the Aperture Agent target URL is aperture-agent.aperture-agent.svc.cluster.local:8080. If you have installed the Aperture Agent in a different namespace or port, you can create or update the values.yaml file and pass it with the install command. Please refer to the Aperture documentation for more information on how to configure Aperture for custom values and how to create or update the values.yaml file and pass it with the install command. Step 5: Verify Integration kubectl get envoyfilter aperture-envoy-filter -n ISTIOD_NAMESPACE_HERE Example Command: kubectl get envoyfilter aperture-envoy-filter -n istio-system Output: NAME AGE aperture-envoy-filter 46s Deploy Demo-App To demonstrate whether Istio integration is working, install a demo deployment in demoapp namespace with istio-injection enabled. kubectl create namespace demoapp kubectl label namespace demoapp istio-injection=enabled kubectl create -f https://gist.githubusercontent.com/sudhanshu456/10b8cfd09629ae5bbce9900d8055603e/raw/cfb8f0850cc91364c1ed27f6904655d9b338a571/demoapp-and-load-generator.yaml -n demoapp Check pods are up and running. kubectl get pods -n demoapp Output: NAME READY STATUS RESTARTS AGE service1-demo-app-686bbb9f64-rhmpp 2/2 Running 0 14m service2-demo-app-7f8dd75749-zccx8 2/2 Running 0 14m service3-demo-app-6899c6bc6b-r4mpd 2/2 Running 0 14m wavepool-generator-7bd868947d-fmn2w 1/1 Running 0 2m36s Apply Policy Using aperturectl To demonstrate control points and their functionality, we will use a basic concurrency limiting policy. Below is the values.yaml file we will apply using aperturectl. values.yaml common: policy_name: "basic-concurrency-limiting" policy: flux_meter: flow_selector: service_selector: agent_group: default service: service1-demo-app.demoapp.svc.cluster.local flow_matcher: control_point: ingress concurrency_controller: flow_selector: service_selector: agent_group: default service: service1-demo-app.demoapp.svc.cluster.local flow_matcher: control_point: ingress Use the below command to apply the above policy values.yaml. aperturectl blueprints generate --name=policies/latency-aimd-concurrency-limiting --values-file values.yaml --output-dir=policy-gen --version=v0.26.1 --apply Validate Policy Let’s check if the policy is applied or not. kubectl get policy -n aperture-controller Output: NAME STATUS AGE basic-concurrency-limiting uploaded 20s List Active Control Points Using aperturectl Control points are the targets for flow control decision-making. By defining control points, Aperture can regulate the flow of traffic between services by analyzing request metadata, such as headers, paths, and query parameters. Aperture provides a command-line interface called aperturectl that allows you to view active control points and live traffic samples. Using aperturectl flow-control control-points, you can list active control points. aperturectl flow-control control-points --kube Output: AGENT GROUP SERVICE NAME default egress service1-demo-app.demoapp.svc.cluster.local default egress service2-demo-app.demoapp.svc.cluster.local default ingress service1-demo-app.demoapp.svc.cluster.local default ingress service2-demo-app.demoapp.svc.cluster.local default ingress service3-demo-app.demoapp.svc.cluster.local Live Previewing Requests Using aperturectl aperturectl provides a flow-control preview feature that enables you to view live traffic and visualize it. This feature helps you understand the incoming request attributes, which is an added benefit on top of Istio. You can use this feature to preview incoming traffic and ensure that Aperture is making the correct flow control decisions based on the metadata extracted from the request headers, paths, query parameters, and other attributes. aperturectl flow-control preview --kube service1-demo-app.demoapp.svc.cluster.local ingress --http --samples 1 Output: { "samples": [ { "attributes": { "destination": { "address": { "socketAddress": { "address": "10.244.1.11", "portValue": 8099 } } }, "metadataContext": {}, "request": { "http": { "headers": { ":authority": "service1-demo-app.demoapp.svc.cluster.local", ":method": "POST", ":path": "/request", ":scheme": "http", "content-length": "201", "content-type": "application/json", "cookie": "session=eyJ1c2VyIjoia2Vub2JpIn0.YbsY4Q.kTaKRTyOIfVlIbNB48d9YH6Q0wo", "user-agent": "k6/0.43.1 (https://k6.io/)", "user-id": "52", "user-type": "subscriber", "x-forwarded-proto": "http", "x-request-id": "a8a2684d-9acc-4d28-920d-21f2a86c0447" }, "host": "service1-demo-app.demoapp.svc.cluster.local", "id": "14105052106838702753", "method": "POST", "path": "/request", "protocol": "HTTP/1.1", "scheme": "http" }, "time": "2023-03-15T11:50:44.900643Z" }, "source": { "address": { "socketAddress": { "address": "10.244.1.13", "portValue": 33622 } } } }, "parsed_body": null, "parsed_path": [ "request" ], "parsed_query": {}, "truncated_body": false, "version": { "encoding": "protojson", "ext_authz": "v3" } } ] } What's Next? Generating and Applying Policies: Check out our tutorials on flow-control and load-management techniques here. Conclusion In this blog post, we have explored the concept of a service mesh, which is a tool used to manage the communication between microservices in a distributed system. We have also discussed how Aperture can integrate with Istio, an open-source service mesh framework, to provide enhanced load management capabilities. By following a few simple steps, you can get started with using these load management techniques, which include concurrency limiting, dynamic rate limiting, and prioritized load shedding, to improve the performance and reliability of your microservices. To learn more about Aperture, please visit our GitHub repository and documentation site. You can also join our Slack community to discuss best practices, ask questions, and engage in discussions on reliability management.
“Set it and forget it” is the approach that most network teams follow with their authoritative Domain Name System (DNS). If the system is working and end-users find network connections to revenue-generating applications, services, and content, then administrators will generally say that you shouldn’t mess with success. Unfortunately, the reliability of DNS often causes us to take it for granted. It’s easy to write DNS off as a background service precisely because it performs so well. Yet this very “set it and forget it” strategy often creates blind spots for network teams by leaving performance and reliability issues undiagnosed. When those undiagnosed issues pile up or go unaddressed for a while, they can easily metastasize into a more significant network performance problem. The reality is that, like any machine or system, DNS requires the occasional tune-up. Even when it works well, specific DNS errors require attention so minor issues don’t flare up into something more consequential. I want to share a few pointers for network teams on what to look for when they’re troubleshooting DNS issues. Set Baseline DNS Metrics No two networks are configured alike. No two networks have the same performance profile. Every network has quirks and peculiarities that make it unique. That’s why knowing what’s “normal” for your network is important before diagnosing any issues. DNS data can give you a sense of average query volume over time. For most businesses, this is going to be a relatively stable number. There will probably be seasonal variations (especially in industries like retail), but these are usually predictable. Most businesses see gradual increases in query volume as their customer base or service volume grows, but this also generally follows a set pattern. It’s also important to look at the mix of query volume. Is most of your DNS traffic to a particular domain? How steady (or volatile) is the mix of DNS queries among various back-end resources? The answers to these questions will be different for every enterprise and may change based on network team decisions on issues like load balancing, product resourcing, and delivery costs. Monitor NXDOMAIN Responses NXDOMAIN responses are a clear indication that something’s wrong. It’s normal to return at least some NXDOMAINs for “fat finger” queries, standard redirect errors, and user-side issues that are likely outside of a network team’s control. NS1, an IBM Company’s recent Global DNS data report, shows that between 3-6% of DNS queries receive an NXDOMAIN response for one reason or another. Anything at or near that range is probably to be expected in a “normal” network setup. When you go over double digits, something bigger is probably happening. The nature of the pattern matters, though. A slow but steady increase in NXDOMAIN responses is probably a long-standing misconfiguration issue that mimics overall traffic volume. A sudden spike in NXDOMAINs could be either a localized (but highly impactful) misconfiguration or a DDoS attack. The key is to keep a steady eye on NXDOMAIN responses as a percentage of overall query volume. Deviation from the norm is usually a clear sign that something is not right — then it becomes a question of why it’s not right and how to fix it. In most cases, a deeper dive into the timing and characteristics of the abnormal uptick will provide clues about why it’s happening. NXDOMAIN responses aren’t always a bad thing. In fact, they could represent a potential business opportunity. If someone’s trying to query a domain or subdomain of yours and coming up empty, that could indicate that it’s a domain you should buy or start using. Watch Out for Exposure of Internal DNS Data One particularly concerning type of NXDOMAIN response is caused by misconfigurations that expose internal DNS zone and record data to the internet. Not only does this kind of misconfiguration weigh on performance by creating unnecessary query volume, but it’s also a significant security issue. Stale URL redirects are often the cause of exposed internal records. In the upheaval of a merger or acquisition, systems sometimes get pointed at properties that fade away or are repurposed for other uses. The systems are still publicly looking for the old connection but not finding the expected answer. The smaller the workload, the more likely it is to go unnoticed. Pay Attention to Geography If you set a standard baseline for where your traffic is coming from, it’s easier to discover anomalous DDoS attacks, misconfigurations, and even broader changes in usage patterns as they emerge. A sudden uptick in traffic to a specific regional server is a different kind of issue than a broader increase in overall query volume. Tracking your DNS data by geography helps identify the issue you’re facing and ultimately provides clues on how to deal with it. Check SERVFAILs for Misconfigured Alias Records Alias records are a frequent source of misconfigurations and deserve regular audits in their own right. I’ve found that an increase in SERVFAIL responses — whether a sudden spike or a gradual increase — can often be traced back to problems with alias records. NOERROR NODATA? Consider IPv6 NXDOMAIN responses are pretty straightforward — the record wasn’t found. Things get a little more nuanced when you see the response come back as NOERROR, but you also see that no answer was returned. While there’s no official RFC code for this situation, it’s usually known as a NOERROR NODATA response when the answer counter returns “0”. NOERROR NODATA means that the record was found, but it wasn’t the record type that was supposed to be there. If you’re seeing a lot of NOERROR NODATA responses, in our experience, the resolver is usually looking for an AAAA record. If you’ve got a lot of NOERROR NODATA responses, I’ve found that adding support for IPv6 usually fixes the problem. DNS Cardinality and Security Implications In the world of DNS, there are two types of cardinality to worry about. Resolver cardinality refers to the number of resolvers querying your DNS records. Query name cardinality refers to the number of different DNS names for which you receive queries each minute. Measuring DNS cardinality is important because it may indicate malicious activity. Specifically, an increase in DNS query name cardinality can indicate a random label attack or probing of your infrastructure at a mass level. An increase in resolver cardinality may indicate that you are being targeted with a botnet. If you suddenly see an increase in resolver cardinality, it’s likely an indication of some sort of attack. Conclusion These pointers should help you better understand the impact of DNS query behavior and some steps you can take to get your DNS to a healthy state. Feel free to comment below on any other tips you’ve learned throughout your career.
The Apollo router is a powerful routing solution designed to replace the GraphQL Gateway. Built using Rust, it offers a high degree of flexibility, loose coupling, and exceptional performance. This self-hosted graph routing solution is highly configurable, making it an ideal choice for developers who require a high-performance routing system. With its ability to handle large amounts of traffic and complex data, the Apollo router is quickly becoming a popular choice among developers seeking a reliable and efficient routing solution. The binary implementation of the Apollo router comes equipped with a built-in telemetry plugin that functions as an OpenTelemetry collector agent. This plugin is responsible for sending traces and logs to the appropriate endpoints, making it an essential component of the performance monitoring process. By leveraging the capabilities of the telemetry plugin, developers can gain valuable insights into the performance of their applications, identify bottlenecks and optimize their system accordingly. With this integrated telemetry functionality, the Apollo router provides a streamlined and efficient performance monitoring solution. OpenTelemetry Agent and Collector OpenTelemetry offers an agent or router telemetry plugin that collects telemetry data from a host, including metrics and traces. This agent runs on a host machine and provides a centralized reporting system, which is particularly useful when running multiple instances of the router. The collector enables developers to process and send metrics to multiple locations beyond their APM tool, providing flexibility and versatility in performance monitoring. By leveraging the OpenTelemetry agent, developers can gain valuable insights into the performance of their applications, identify bottlenecks, and optimize their system for better performance. After the router telemetry agent collects telemetry data from the host, it then forwards that data to a collector. The collector is another program that receives and processes the telemetry data, making it easier to analyze and understand. By sending data to the collector, developers can gain a more comprehensive view of their application's performance, which can help them identify potential issues and optimize their system for better performance. The use of a collector also enables developers to store data in a centralized location, making it easier to access and analyze. Telemetry plugin configuration for Apollo Router Agent: Open Telemetry Collector Configuration: Splunk APM (Application Performance Monitoring) Splunk APM is a highly sophisticated tool designed for application performance monitoring and troubleshooting, especially for cloud-native and microservices-based applications. It is built on open source and OpenTelemetry instrumentation, which enables the collection of data from various programming languages and environments. With its advanced features, Splunk APM provides an efficient and reliable solution for monitoring application performance and identifying and resolving issues quickly. The OpenTelemetry Collector is a tool used to export data to your Application Performance Monitoring (APM) tool. To set up a basic configuration for the OpenTelemetry Collector, you need to have both an OpenTelemetry Protocol (OTLP) receiver for the router and an exporter to forward the data to your APM tool. By using the OTLP receiver and exporter, you can easily configure the OpenTelemetry Collector to collect and transmit data from various sources to your APM tool, making it an essential component for effective application monitoring and troubleshooting. In conclusion, telemetry plays a crucial role in performance monitoring and optimization. The Apollo router comes with a built-in telemetry plugin that functions as an OpenTelemetry collector agent, which allows developers to gain valuable insights into their application's performance. OpenTelemetry also offers an agent and collector for collecting telemetry data from a host and sending it to a centralized reporting system. Additionally, the OpenTelemetry Collector is an essential tool for exporting data to an APM tool, such as Splunk APM, which provides a sophisticated solution for monitoring application performance and identifying and resolving issues quickly. By leveraging these tools, developers can optimize their system's performance, enhance user experience, and ultimately achieve their business objectives.
Overview If you have ever downloaded a new Linux distribution ISO image, you may have wondered how to access the contents inside the image prior to repartitioning your disk and installing the operating system onto your local disk. This can be done via a loop mount in Linux. In Linux and other UNIX-like systems, it is possible to use a regular file as a block device. A loop device is a virtual or pseudo-device which enables a regular file to be accessed as a block device. Say you want to create a Linux file system but do not have a free disk partition available. In such a case, you can create a regular file on the disk and create a loop device using this file. The device node listing for the new pseudo-device can be seen under/dev. This loop device can then be used to create a new file system. The file system can be mounted, and its contents can be accessed using normal file system APIs. Uses of Loop Device As described above, one of the uses is creating a file system with a regular file when no disk partition is available. Another common use of a loop device is with ISO images of installable operating systems. The contents of ISO images can be easily browsed by mounting the ISO image as a loop device. Creating a Loop Device in Linux These commands require root privilege. 1. Create a large regular file on disk that will be used to create the loop device. # dd if=/dev/zero of=/loopfile bs=1024 count=51200 51200+0 records in 51200+0 records out 52428800 bytes (52 MB, 50 MiB) copied, 0.114882 s, 456 MB/s This command creates a 50Mb file called loopfile filled with zeros. If you already have an image file that you want to mount as a loop device, then you can skip this step. 2. Create a loop device with the large file created above. There may be some loop devices already created. Run the following command to find the first available device node. # losetup -f /dev/loop1 So we can safely use /dev/loop1 to create our loop device. Create the loop device with the following command. # losetup /dev/loop1 /loopfile If you see no errors, the regular file /loopfileis now associated with the loop device /dev/loop1. 3. Confirm creation of the loop device # losetup /dev/loop1 /dev/loop1: [66309]:214 (/loopfile) Creating a Linux Filesystem With the Loop Device You can now create a normal Linux filesystem with this loop device. 1. Create an ext4 filesystem using /dev/loop1. # mkfs -t ext4 -v /dev/loop1 mke2fs 1.45.3 (14-Jul-2019) fs_types for mke2fs.conf resolution: 'ext4', 'small' Discarding device blocks: done Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 12800 inodes, 12800 blocks 640 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=14680064 1 block group 32768 blocks per group, 32768 fragments per group 12800 inodes per group Allocating group tables: done Writing inode tables: done Creating journal (1024 blocks): done Writing superblocks and filesystem accounting information: done 2. Create a mount point for the filesystem. # mkdir /mnt/loopfs 3. Mount the newly created filesystem. # mount -t ext4 /dev/loop1 /mnt/loopfs This command mounts the loop device as a normal Linux ext4 filesystem, on which normal filesystem operations can be performed. 4. Check disk usage of the file system. # df -h /dev/loop1 Filesystem Size Used Avail Use% Mounted on /dev/loop1 45M 48K 41M 1% /mnt/loopfs 5. Use tune2fs to see the filesystem settings. # tune2fs -l /dev/loop1 tune2fs 1.45.3 (14-Jul-2019) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: b1b13d6e-c544-45dd-a549-5846371fbde6 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 12800 Block count: 12800 Reserved block count: 640 Free blocks: 11360 Free inodes: 12789 First block: 0 Block size: 4096 Fragment size: 4096 Group descriptor size: 64 Reserved GDT blocks: 6 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 12800 Inode blocks per group: 400 Flex block group size: 16 Filesystem created: Sun Mar 19 08:56:47 2023 Last mount time: Sun Mar 19 09:00:52 2023 Last write time: Sun Mar 19 09:00:52 2023 Mount count: 1 Maximum mount count: -1 Last checked: Sun Mar 19 08:56:47 2023 Check interval: 0 (<none>) Lifetime writes: 37 kB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: e489fd33-4003-4235-9347-144c7a5d4d73 Journal backup: inode blocks Checksum type: crc32c Checksum: 0x3b8c797a 6. To unmount the filesystem and delete the loop device, run the following commands. # umount /mnt/loopfs/ # losetup -d /dev/loop1
I enjoy improving application performance. After all, the primary purpose of computers is to execute tasks efficiently and swiftly. When you consider the fundamentals of computing, it seems almost magical — at its core, it involves simple arithmetic and logical operations, such as addition and comparison of binary numbers. Yet, by rapidly performing countless such operations, computers enable us to enjoy video games, watch endless videos, explore the vast expanse of human knowledge and culture, and even unlock the secrets of the universe and life itself. That is why optimizing applications is so important — it allows us to make better use of our most precious resource: time. To paraphrase a famous quote: In the end, it’s not the number of years you live that matter, but the number of innovations that transpire within those years. However, application performance often takes a backseat until it becomes a pressing issue. Why is this the case? Two primary reasons are at play: prioritization and the outdated notion of obtaining performance improvements for free. Software Development Priorities When it comes to system architecture, engineers typically rank their priorities as follows: 1. Security 2. Reliability 3. Performance This hierarchy is logical, as speed is only valuable if tasks can be executed securely and reliably. Crafting high-quality software demands considerable effort and discipline. The larger the scale of service, the more challenging it becomes to ensure security and reliability. In such scenarios, teams often grapple with bugs, new feature requests, and occasional availability issues stemming from various factors. Consequently, performance is frequently perceived as a luxury, with attention given only if it impacts the bottom line. This pattern is not unique to software development, though it may not be as apparent in other industries. For example, we rarely evaluate a car based on its "reliability" nowadays. In the past, this was not the case — it took several decades for people to shift their focus to aspects like design, comfort, and fuel efficiency, which were once considered "luxuries" or non-factors. A similar evolution is bound to occur within the software industry. As it matures, performance will increasingly become a key differentiator in a growing number of product areas, further underscoring its importance. The End of Free Performance Gains Another factor contributing to the low prioritization of performance is that it was once relatively effortless to achieve. As computers continually became faster, there was little incentive to invest in smarter engineering when a simple hardware upgrade could do the trick. However, this is no longer the case: According to Karl Rupp's microprocessor trend analysis data, CPU frequency has plateaued for over 15 years, with only marginal improvements in single-core speed. Although the number of CPU cores has increased to compensate for stagnating clock speeds, harnessing the power of multiple cores or additional machines demands more sophisticated engineering. Moreover, the law of diminishing returns comes into play rather quickly, emphasizing the need to prioritize performance in software development. Optimizing Application Performance Achieving exceptional results is rarely a matter of chance. If you want your product to meet specific standards and continually improve, you need a process that consistently produces such outcomes. For example, to develop secure applications, your process should include establishing security policies, conducting regular security reviews, and performing scans. Ensuring reliability involves multiple layers of testing (unit tests, integration tests, end-to-end tests), as well as monitoring, alerting, and training organizations to respond effectively to incidents. Owners of large systems also integrate stress and failure injection testing into their practices, often referred to as "chaos engineering." In essence, to build secure and reliable systems, software engineers must adhere to specific rules and practices applicable to all stages of a system development life cycle. Performance optimization is no exception to this principle. Application code is subject to various forces — new features, bug fixes, version upgrades, scaling, and more. While most code changes are performance-neutral, they are more likely to hinder performance than enhance it unless the modification is a deliberate optimization. Therefore, it is essential to implement a mechanism that encourages performance improvements while safeguarding against inadvertent regressions: It sounds simple in theory — we run a performance test suite and compare the results to previous runs. If a regression is detected, we block the release and fix it. In practice, it is a bit more complicated than that, especially for large/old projects. There could be different use cases that involve multiple independent components. Only some of them may regress, or such regressions only manifest under particular circumstances. These components communicate over the network, which may have noise levels high enough to hide regressions, especially for tail latency and outliers. The application may show no signs of performance degradation under a low load but substantially slow down under stress. Performance measurements may not capture general tendencies because some pay attention only to specific latency percentiles. On top of that, it’s impossible to model real-life conditions in non-production environments. So, without proper metrics and profiling, it would be difficult to identify and attribute performance issues. In other words, it requires a deep understanding of the system and the infrastructure, as well as having the right tools and knowledge of how to effectively put all of that together. In this post, I will touch only the tip of this iceberg. Essential Factors for Measuring Service Performance When aiming to measure the performance of an isolated network service, it is crucial to establish a well-defined methodology. This ensures that the results are accurate and relevant to real-world scenarios. The following criteria are critical for assessing service performance: Avoid network latency/noise Prevent “noisy neighbor” issues from a Virtual Machine (VM) co-location Achieve latency resolution down to microseconds Control request rate Minimize overhead from the benchmarking tool Saturate CPU up to 100% to determine the limit of your service Provide comprehensive latency statistics Let's delve deeper into each of the essential criteria for effectively measuring service performance. Avoid Network Latency/Noise Network latency is significantly larger than in-memory operations and is subject to various random events or states, such as packet loss, re-transmissions, and network saturation. These factors can impact results, particularly tail latency. Prevent “Noisy Neighbor” Issues From VM Co-Location Cloud providers often place multiple virtual machines (VMs) on the same hardware. To ensure accurate measurements, use either a dedicated "bare-metal" instance or a VM without over-subscription, providing exclusive access to all available CPUs. Control Request Rate Comparing different solutions at low, moderate, and high request rates allows for a more comprehensive evaluation. Thread contention and implementation deficiencies may only become evident at certain load levels. Additionally, ramping up traffic linearly helps analyze latency degradation patterns. Achieve Latency Resolution Down to Microseconds Millisecond latency resolution is suitable for measuring end-user latency but may be insufficient for a single service. To minimize noise-contributing factors and measure pure operation costs, a higher resolution is necessary. Although measuring batches of operations in milliseconds can be informative, this approach only calculates the average, losing visibility into percentiles and variance. Saturate CPU up to 100% To Determine the Service Limits The true test for a system is its performance at high request rates. Understanding a system's maximum throughput is crucial for allocating hardware resources and handling traffic spikes effectively. Minimize Overhead From the Benchmarking Tool Ensuring the precision and stability of measurements is paramount. The benchmarking tool itself should not be a significant source of overhead or latency, as this would undermine the reliability of the results. Comprehensive Latency Statistics To effectively measure performance, it is essential to utilize a range of metrics and statistics. The following list outlines some key indicators to consider: p50 (the median) — a value that is greater than 50% of observed latency samples. p90 — the 90th percentile, a value better than 9 out of 10 latency samples. This is typically a good proxy for latency perceivable by humans. p99 (tail latency) — the 99th percentile, the threshold for the worst 1% of samples. Outliers: p99.9 and p99.99 — crucial for systems with multiple network hops or large fan-outs (e.g., a request gathering data from tens or hundreds of microservices). max — the worst-case scenario should not be overlooked. tm99.9 — (trimmed mean), the mean value of all samples, excluding the best and worst 0.1%. This is more informative than the traditional mean, as it eliminates the potentially disproportionate influence of outliers. stddev — (standard deviation) — a measure of stability and predictability. While most of these metrics are considered standard in evaluating performance, it is worth taking a closer look at two less popular ones: outliers and the trimmed mean Outliers (P99.9 and P99.99) In simple terms, p99.9 and p99.99 represent the worst request per 1,000 and per 10,000 requests, respectively. While this may not seem significant to humans, it can considerably impact sub-request latency. Consider this example: To process a user request, your service makes 100 sub-requests to other services (whether sequential or parallel doesn't matter for this example). Assume the following latency distribution: 999 requests — 1ms 1 request — 1,000ms What is the probability your users will see 1,000+ms latency? The chance of a sub-request encountering 1s latency is 0.1%, but the probability of it occurring for at least one of the 100 sub-requests is: (1 — (1–0.001)¹⁰⁰)=0.095 , or 9.5%! Consequently, this may impact your end-user p90, which is certainly noticeable. The following image and table illustrate how the number of sub-requests translates p99, p99.9, and p99.99 into the end-user latency percentiles: x-axis — number of sub-requests, y-axis — the end-user percentile Or in the table form (for a few data points): As demonstrated, p99 may not be an accurate measure for systems with a large number of sub-requests. Therefore, it is crucial to monitor outliers like p99.9 or p99.99, depending on your system architecture, scale, and objectives. Trimmed Mean Are percentiles sufficient? The mean is commonly considered a flawed metric due to its sensitivity to outliers, which is why the median, also known as p50, is generally recommended as a better alternative. It's true that the mean can be sensitive to outliers, which can distort the metric. Let's assume we have the following latency measurements with a single outlier: The median gives a more precise picture of the general tendency Indeed, 1318.2 is not representative, and the median is much better at capturing the average user experience. However, the issue with percentiles is that they act as fences and are not sensitive to what happens inside the fence. In other words, this metric might not be proportional to the user experience. Let's assume our service degraded (from the top row to the bottom one): The median didn’t change, but users noticed the service became slower. That’s pretty bad. We may be degrading the user experience, and our p50/p90thresholds won’t detect it. Let's examine how the trimmed mean would handle this. It discards the best and the worst samples and calculates the average: The trimmed mean captures the general tendency better than the median. As shown, the trimmed mean reflects the performance regression while still providing a good sense of the average user experience. With the trimmed mean, we can enjoy the best of both worlds. Practice Exercise Now that we've explored the theoretical aspects of performance measurement let's dive into a practical exercise to put this knowledge to use. One interesting area to investigate is the baseline I/O performance of TCP proxies implemented in various languages such as C, C++, Rust, Golang, Java, and Python. A TCP proxy serves as an ideal test case because it focuses on the fundamental I/O operations without any additional processing. In essence, no TCP-based network service can be faster than a TCP proxy, making it an excellent subject for comparing the baseline performance across different programming languages. Upon starting my search for a benchmarking tool, I was somewhat surprised by the challenge in finding an appropriate option, especially within the open-source domain. Several existing tools are available for performance measurements and stress testing. However, tools written in Java/Python/Golang may suffer from additional overhead and noise generated by garbage-collection pauses. Apache Benchmarking tool ab doesn't allow the user to control the rate, and its latency statistics aren't as detailed. Also, it measures latency in milliseconds. To solve the first issue, I had to create a performance-gauging tool in Rust that satisfies all of the requirements mentioned earlier. It can produce output like this: Test: http-tunnel-rust Duration 60.005715353s Requests: 1499912 Request rate: 24996.152 per second Success rate: 100.000% Total bytes: 15.0 GB Bitrate: 1999.692 Mbps Summary: 200 OK: 1499912 Latency: Min : 109µs p50 : 252µs p90 : 392µs p99 : 628µs p99.9 : 1048µs p99.99 : 1917µs Max : 4690µs Mean : 276µs StdDev : 105µs tm95 : 265µs tm99 : 271µs tm99.9 : 274µs This output is useful for a single run. Unfortunately, a single run is never enough if you want to measure performance properly. It is also difficult to use this output for anything but taking a glance and forgetting the actual numbers. Reporting metrics to Prometheus and running a test continuously seems to be a natural solution to gather comprehensive performance data over time. Below, you can see the ramping up of traffic from 10k to 25k requests per second (rps) and the main percentiles (on the left — latency in µs; on the right — the request rate in 1,000 rps): Sending 25,000 requests per second and measuring the latency to µs resolution Why Prometheus? It’s an open-source framework integrated with Grafana and Alertmanager, so the tool can be easily incorporated into existing release pipelines. There are plenty of Docker images that allow you to quickly start Prom stack with Grafana (I installed them manually, which was also quick and straightforward.) Then I configured the following TCP proxies: HAProxy— in TCP-proxy mode. To compare to a mature solution written in C: http://www.haproxy.org/ draft-http-tunnel — a simple C++ solution with very basic functionality (trantor) (running in TCP mode): https://github.com/cmello/draft-http-tunnel/ (thanks to Cesar Mello, who coded it to make this benchmark possible). http-tunnel — a simple HTTP-tunnel/TCP-proxy written in Rust (tokio) (running in TCP mode): https://github.com/xnuter/http-tunnel/ tcp-proxy — a Golang solution: https://github.com/jpillora/go-tcp-proxy NetCrusher — a Java solution (Java NIO). Benchmarked on JDK 11, with G1: https://github.com/NetCrusherOrg/NetCrusher-java/ pproxy — a Python solution based on asyncio (running in TCP Proxy mode): https://pypi.org/project/pproxy/ All of the solutions above use Non-blocking I/O, it’s the best method to handle network communication if you need highly available services with low latency and large throughput. All these proxies were compared in the following modes: 25k rps, reusing connections for 50 requests. Max possible rate to determine the max throughput. 3.5k connections per second serving a single request (to test resource allocation and dispatching). If I were asked to summarize the test result in a single image, I would pick this one (from the max throughput test, the CPU utilization is close to 100%): Left axis: latency overhead. Right axis: throughput The blue line is tail latency (Y-axis on the left) — the lower, the better. The grey bars are throughput (Y-axis on the right) — the higher, the better. As you can see, Rust is indeed on par with C/C++, and Golang did well too. Java and Python are way behind. You can read more about the methodology and detailed results in the perf-gauge's wiki: Benchmarking low-level I/O: C, C++, Rust, Golang, Java, Python, in particular, measuring by these dimensions: Moderate RPS Max RPS Establishing connections (TLS handshake) Closing Thoughts In conclusion, understanding and accurately measuring the performance of network services is crucial for optimizing and maintaining the quality of your systems. As the software industry matures and performance becomes a key differentiator, it is essential to prioritize it alongside security and reliability. Historically, performance improvements were easily achieved through hardware upgrades. However, as CPU frequency plateaus and harnessing the power of multiple cores or additional machines demands more sophisticated engineering, it is increasingly important to focus on performance within the software development process itself. The evolution of performance prioritization is inevitable, and it is essential for software development teams to adapt and ensure that their systems can deliver secure, reliable, and high-performing services to their users. By emphasizing key metrics and statistics and applying accurate performance assessment methods, engineers can successfully navigate this shifting landscape and continue to create high-quality, competitive software.
In the initial "How to Move IBM App Connect Enterprise to Containers" post, a single MQ queue was used in place of an actual MQ-based back-end system used by an HTTP Input/Reply flow, which allowed for a clean example of moving from local MQ connections to remote client connections. In this post, we will look at what happens when an actual back-end is used with multiple client containers and explore solutions to the key challenge: how do we ensure that reply messages return to the correct server when all the containers have identical starting configurations? For a summary of the use of MQ correlation IDs; see Correlation ID Solution below. Issues With Multiple Containers and Reply Messages Consider the following scenario, which involves an HTTP-based flow that calls an MQ back-end service using one queue for requests and another for replies (some nodes not shown): The HTTP input message is sent to the MQ service by the top branch of the flow using the input queue, and the other branch of the flow receives the replies and sends them back to the HTTP client.The picture is complicated slightly by HTTPReply nodes requiring a “reply identifier” in order to know which HTTP client should receive a reply (there could be many clients connected simultaneously), with the identifier being provided by the HTTPInput node. The reply identifier can be saved in the flow (using ESQL shared variables or Java static variables) in the outbound branch and restored in the reply branch based on MQ-assigned message IDs, or else sent to the MQ service as a message or correlation ID to be passed back to the reply branch, or possibly sent as part of the message body; various solutions will work in this case.This behaves well with only one copy of the flow running, as all replies go through one server and back to the calling client. If the ACE container is scaled up, then there will be a second copy of the flow running with an identical configuration, and it might inadvertently pick up a message intended for the original server. At that point, it will attempt to reply but discover that the TCPIP socket is connected to the original server: This situation can arise even with only a single copy of the container deployed: a Kubernetes rolling update will create the new container before stopping the old one, leading to the situation shown above due to both containers running at the same time. While Kubernetes does have a “Recreate” deploy strategy that eliminates the overlap, it would clearly be better to solve the problem itself rather than restricting solutions to only one container. Containers present extra challenges when migrating from on-prem integration nodes: the scaling and restarts in the container world are often automated and not directly performed by administrators, and all of the replica containers have the same flows with the same nodes and options. There is also no “per-broker listener” in the container case, as each server has a separate listener. Solutions The underlying problem comes down to MQInput nodes picking up messages intended for other servers, and the solutions come in two general categories: Use correlation IDs for reply messages so that the MQInput nodes only get the messages for their server.Each server has a specific correlation ID, the messages sent from the MQOutput node are specific to that correlation ID, and the back-end service copies the correlation ID into the response message. The MQInput node only listens for messages with that ID, and so no messages for other servers will be picked up. This solution requires some way of preserving the HTTP reply identifier, which can be achieved in various ways. Create a separate queue for each server and configure the MQInput nodes for each container to use a separate queue.In this case, there is no danger of messages going back to the wrong server, as each server has a distinct queue for replies. These queues need to be created and the flows configured for this to work with ACE. The second category requires custom scripting in the current ACE v12 releases and so will not be covered in this blog post, but ACE v12 does have built-in support for the first category, with several options for implementing solutions that will allow for scaling and redeploying without messages going to the wrong server. Variations in the first category include the “message ID to correlation ID” pattern and synchronous flows, but the idea is the same. While containers show this problem (and therefore the solutions) nicely, the examples described here can be run on a local server also, and do not need to be run in containers. Scaling integration solutions to use multiple servers is much easier with containers, however, and so the examples focus on those scenarios. Example Solution Scenario Overview Building on the previous blog post, the examples we shall be using are at this GitHub repo and follow this pattern: The previous blog post showed how to create the MQ container and the ACE container for the simple flow used in that example, and these examples follow the same approach but with different queues and ACE applications. Two additional queues are needed, with backend-queues.mqsc showing the definitions: DEFINE QLOCAL(BACKEND.SHARED.INPUT) REPLACE DEFINE QLOCAL(BACKEND.SHARED.REPLY) REPLACE Also, the MQSC ConfigMap shown in a previous MQ blog post can be adjusted to include these definitions. The same back-end is used for all of the client flows, so it should be deployed once using the “MQBackend” application. A pre-built BAR file is available that can be used in place of the BAR file used in the post, "From IBM Integration Bus to IBM App Connect Enterprise in Containers (Part 4b)," and the IS-github-bar-MQBackend.yaml file can be used to deploy the backend service. See the README.md for details of how the flow works. Correlation ID Solution The CorrelIdClient example shows one way to use correlation IDs: This client flow relies on: The back-end flow honoring the MQMD Report settings A unique per-server correlation ID being available as a user variable The HTTP RequestIdentifier being usable as an MQ message ID The MQBackend application uses an MQReply node, which satisfies requirement 1, and requirement 3 is satisfied by the design of the ACE product itself: the request identifier is 24 bytes (which is the size of MQ’s MsgId and CorrelId) and is unique for a particular server. Requirement 2 is met in this case by setting a user variable to the SHA-1 sum of the HOSTNAME environment variable. SHA-1 is not being used in this case for cryptographic purposes but rather to ensure that the user variable is a valid hex number (with only letters from A-F and numbers) that is 24 bytes or less (20 in this case). The sha1sum command is run from a server startup script (supported from ACE 12.0.5) using the following server.conf.yaml setting (note that the spaces are very important due to it being YAML): YAML StartupScripts: EncodedHostScript: command: 'export ENCODED_VAR=`echo $HOSTNAME | sha1sum | tr -d "-" | tr -d " "` && echo UserVariables: && /bin/echo -e " script-encoded-hostname: \\x27$ENCODED_VAR\\x27"' readVariablesFromOutput: true This server.conf.yaml setting will cause the server to run the script and output the results: YAML 2022-11-10 13:04:57.226548: BIP9560I: Script 'EncodedHostScript' is about to run using command 'export ENCODED_VAR=`echo $HOSTNAME | sha1sum | tr -d "-" | tr -d " "` && echo UserVariables: && /bin/echo -e " script-encoded-hostname: \\x27$ENCODED_VAR\\x27"'. UserVariables: script-encoded-hostname:'adc83b19e793491b1c6ea0fd8b46cd9f32e592fc' 2022-11-10 13:04:57.229552: BIP9567I: Setting user variable 'script-encoded-hostname'. 2022-11-10 13:04:57.229588: BIP9565I: Script 'EncodedHostScript' has run successfully. Once the user variable is set, it can be used for the MQInput node (to ensure only matching messages are received) and the MQOutput node (to provide the correct ID for the outgoing message). The MQInput node can accept user variable references in the correlation ID field (ignore the red X): The MQOutput node will send the contents of the MQMD parser from the flow, and so this is set in the “Create Outbound Message” Compute node using ESQL, converting the string in the user variable into a binary CorrelId: SQL -- Set the CorrelId of the outgoing message to match the MQInput node. DECLARE encHost BLOB CAST("script-encoded-hostname" AS BLOB); SET OutputRoot.MQMD.CorrelId = OVERLAY(X'000000000000000000000000000000000000000000000000' PLACING encHost FROM 1); SET OutputRoot.Properties.ReplyIdentifier = OutputRoot.MQMD.CorrelId; The ReplyIdentifier field in the Properties parser overwrites the MQMD CorrelId in some cases, so both are set to ensure the ID is picked up. The “script-encoded-hostname” reference is the name of the user variable, declared as EXTERNAL in the ESQL to cause the server to read the user variable when the ESQL is loaded: DECLARE "script-encoded-hostname" EXTERNAL CHARACTER; Other sections of the ESQL set the Report options, the HTTP request identifier, and the ReplyToQ: SQL -- Store the HTTP reply identifier in the MsgId of the outgoing message. -- This works because HTTP reply identifiers are the same size as an MQ -- correlation/message ID (by design). SET OutputRoot.MQMD.MsgId = InputLocalEnvironment.Destination.HTTP.RequestIdentifier; -- Tell the backend flow to send us the MsgId and CorrelId we send it. SET OutputRoot.MQMD.Report = MQRO_PASS_CORREL_ID + MQRO_PASS_MSG_ID; -- Tell the backend flow to use the queue for our MQInput node. SET OutputRoot.MQMD.ReplyToQ = 'BACKEND.SHARED.REPLY'; Deploying the flow requires a configuration for the server.conf.yaml mentioned above, and this must include the remote default queue manager setting as well as the hostname encoding script due to only one server.conf.yaml configuration being allowed by the operator. See this combined file and an encoded form ready for CP4i deployment. Once the configurations are in place, the flow itself can be deployed using the IS-github-bar-CorrelIdClient.yaml file to the desired namespace (“cp4i” in this case): kubectl apply -n cp4i -f IS-github-bar-CorrelIdClient.yaml (or using a direct HTTP URL to the Git repo). This will create two replicas, and once the servers are running then curl can be used to verify the flows operating successfully, and the CorrelId field should alternate between requests to show both servers sending and receiving messages correctly: YAML $ curl http://http-mq-correlidclient-http-cp4i.apps.cp4i-domain/CorrelIdClient {"originalMessage": {"MsgId":"X'455648540000000000000000c6700e78d900000000000000'","CorrelId":"X'20fa1e68cb59a328f559cc306aa52df3e58ffd3200000000'","ReplyToQ":"BACKEND.SHARED.REPLY ","jsonData":{"Data":{"test":"CorrelIdClient message"}},"backendFlow":{"application":"MQBackend"} $ curl http://http-mq-correlidclient-http-cp4i.apps.cp4i-domain/CorrelIdClient {"originalMessage": {"MsgId":"X'455648540000000000000000be900e78d900000000000000'","CorrelId":"X'd7cb7b30c4775d27aabbb4997020ebafd14f775700000000'","ReplyToQ":"BACKEND.SHARED.REPLY ","jsonData":{"Data":{"test":"CorrelIdClient message"}},"backendFlow":{"application":"MQBackend"} $ curl http://http-mq-correlidclient-http-cp4i.apps.cp4i-domain/CorrelIdClient {"originalMessage":{"MsgId":"X'455648540000000000000000c6700e78d900000000000000'","CorrelId":"X'20fa1e68cb59a328f559cc306aa52df3e58ffd3200000000'","ReplyToQ":"BACKEND.SHARED.REPLY ","jsonData":{"Data":{"test":"CorrelIdClient message"}},"backendFlow":{"application":"MQBackend"} Variations Other possible solutions exist: CorrelIdClientUsingBody shows a solution storing the HTTP reply identifier in the body of the message rather than using the MsgId field of the MQMD. This avoids a situation where the same MsgId is used twice (once for the request and the second time for the reply), but requires the back-end service to copy the identifier from the request to the reply message, and not all real-life services will do this. Similar flows can be built using RFH2 headers to contain the HTTP reply identifier, with the same requirement that the back-end service copies the RFH2 information. This flow uses a separate CorrelId padding value of “11111111” instead of “00000000” used by CorrelIdClient to ensure that the flows do not collide when using the same reply queue. Like the CorrelIdClient, this solution relies on startup scripts supported in ACE 12.0.5 and later. Sha1HostnameClient is a variant of CorrelIdClient that uses a predefined user variable called “sha1sum-hostname” provided by the server in ACE 12.0.6 and later fixpacks; this eliminates the need for startup scripts to set user variables. SyncClient shows a different approach, using a single flow that waits for the reply message. No user variables are needed, and the back-end service needs only to copy the request MsgId to the CorrelId in the reply (which is the default Report option), but the client flow will block in the MQGet until the reply message is received. This is potentially a very resource-intensive way to implement request-reply messaging, as every message will block a thread for as long as it takes the back-end to reply, and each thread will consume storage while running. Summary Existing integration solutions using MQ request/reply patterns may work unchanged in the scalable container world if they implement mechanisms similar to those described above, but many solutions will need to be modified to ensure correct operation. This is especially true for integrations that rely on the per-broker listener to handle the matching of replies to clients, but solutions are possible as is shown in the example above. Further conversation in the comments or elsewhere is always welcome! Acknowledgments: Thanks to Amar Shah for creating the original ACE-with-MQ blog post on which this is based, and for editorial help.
This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report Our approach to scalability has gone through a tectonic shift over the past decade. Technologies that were staples in every enterprise back end (e.g., IIOP) have vanished completely with a shift to approaches such as eventual consistency. This shift introduced some complexities with the benefit of greater scalability. The rise of Kubernetes and serverless further cemented this approach: spinning a new container is cheap, turning scalability into a relatively simple problem. Orchestration changed our approach to scalability and facilitated the growth of microservices and observability, two key tools in modern scaling. Horizontal to Vertical Scaling The rise of Kubernetes correlates with the microservices trend as seen in Figure 1. Kubernetes heavily emphasizes horizontal scaling in which replications of servers provide scaling as opposed to vertical scaling in which we derive performance and throughput from a single host (many machines vs. few powerful machines). Figure 1: Google Trends chart showing correlation between Kubernetes and microservice (Data source: Google Trends ) In order to maximize horizontal scaling, companies focus on the idempotency and statelessness of their services. This is easier to accomplish with smaller isolated services, but the complexity shifts in two directions: Ops – Managing the complex relations between multiple disconnected services Dev – Quality, uniformity, and consistency become an issue. Complexity doesn't go away because of a switch to horizontal scaling. It shifts to a distinct form handled by a different team, such as network complexity instead of object graph complexity. The consensus of starting with a monolith isn't just about the ease of programming. Horizontal scaling is deceptively simple thanks to Kubernetes and serverless. However, this masks a level of complexity that is often harder to gauge for smaller projects. Scaling is a process, not a single operation; processes take time and require a team. A good analogy is physical traffic: we often reach a slow junction and wonder why the city didn't build an overpass. The reason could be that this will ease the jam in the current junction, but it might create a much bigger traffic jam down the road. The same is true for scaling a system — all of our planning might make matters worse, meaning that a faster server can overload a node in another system. Scalability is not performance! Scalability vs. Performance Scalability and performance can be closely related, in which case improving one can also improve the other. However, in other cases, there may be trade-offs between scalability and performance. For example, a system optimized for performance may be less scalable because it may require more resources to handle additional users or requests. Meanwhile, a system optimized for scalability may sacrifice some performance to ensure that it can handle a growing workload. To strike a balance between scalability and performance, it's essential to understand the requirements of the system and the expected workload. For example, if we expect a system to have a few users, performance may be more critical than scalability. However, if we expect a rapidly growing user base, scalability may be more important than performance. We see this expressed perfectly with the trend towards horizontal scaling. Modern Kubernetes systems usually focus on many small VM images with a limited number of cores as opposed to powerful machines/VMs. A system focused on performance would deliver better performance using few high-performance machines. Challenges of Horizontal Scale Horizontal scaling brought with it a unique level of problems that birthed new fields in our industry: platform engineers and SREs are prime examples. The complexity of maintaining a system with thousands of concurrent server processes is fantastic. Such a scale makes it much harder to debug and isolate issues. The asynchronous nature of these systems exacerbates this problem. Eventual consistency creates situations we can't realistically replicate locally, as we see in Figure 2. When a change needs to occur on multiple microservices, they create an inconsistent state, which can lead to invalid states. Figure 2: Inconsistent state may exist between wide-sweeping changes Typical solutions used for debugging dozens of instances don't apply when we have thousands of instances running concurrently. Failure is inevitable, and at these scales, it usually amounts to restarting an instance. On the surface, orchestration solved the problem, but the overhead and resulting edge cases make fixing such problems even harder. Strategies for Success We can answer such challenges with a combination of approaches and tools. There is no "one size fits all," and it is important to practice agility when dealing with scaling issues. We need to measure the impact of every decision and tool, then form decisions based on the results. Observability serves a crucial role in measuring success. In the world of microservices, there's no way to measure the success of scaling without such tooling. Observability tools also serve as a benchmark to pinpoint scalability bottlenecks, as we will cover soon enough. Vertically Integrated Teams Over the years, developers tended to silo themselves based on expertise, and as a result, we formed teams to suit these processes. This is problematic. An engineer making a decision that might affect resource consumption or might impact such a tradeoff needs to be educated about the production environment. When building a small system, we can afford to ignore such issues. Although as scale grows, we need to have a heterogeneous team that can advise on such matters. By assembling a full-stack team that is feature-driven and small, the team can handle all the different tasks required. However, this isn't a balanced team. Typically, a DevOps engineer will work with multiple teams simply because there are far more developers than DevOps. This is logistically challenging, but the division of work makes more sense in this way. As a particular microservice fails, responsibilities are clear, and the team can respond swiftly. Fail-Fast One of the biggest pitfalls to scalability is the fail-safe approach. Code might fail subtly and run in non-optimal form. A good example is code that tries to read a response from a website. In a case of failure, we might return cached data to facilitate a failsafe strategy. However, since the delay happens, we still wait for the response. It seems like everything is working correctly with the cache, but the performance is still at the timeout boundaries. This delays the processing. With asynchronous code, this is hard to notice and doesn't put an immediate toll on the system. Thus, such issues can go unnoticed. A request might succeed in the testing and staging environment, but it might always fall back to the fail-safe process in production. Failing fast includes several advantages for these scenarios: It makes bugs easier to spot in the testing phase. Failure is relatively easy to test as opposed to durability. A failure will trigger fallback behavior faster and prevent a cascading effect. Problems are easier to fix as they are usually in the same isolated area as the failure. API Gateway and Caching Internal APIs can leverage an API gateway to provide smart load balancing, caching, and rate limiting. Typically, caching is the most universal performance tip one can give. But when it comes to scale, failing fast might be even more important. In typical cases of heavy load, the division of users is stark. By limiting the heaviest users, we can dramatically shift the load on the system. Distributed caching is one of the hardest problems in programming. Implementing a caching policy over microservices is impractical; we need to cache an individual service and use the API gateway to alleviate some of the overhead. Level 2 caching is used to store database data in RAM and avoid DB access. This is often a major performance benefit that tips the scales, but sometimes it doesn't have an impact at all. Stack Overflow recently discovered that database caching had no impact on their architecture, and this was because higher-level caches filled in the gaps and grabbed all the cache hits at the web layer. By the time a call reached the database layer, it was clear this data wasn't in cache. Thus, they always missed the cache, and it had no impact. Only overhead. This is where caching in the API gateway layer becomes immensely helpful. This is a system we can manage centrally and control, unlike the caching in an individual service that might get polluted. Observability What we can't see, we can't fix or improve. Without a proper observability stack, we are blind to scaling problems and to the appropriate fixes. When discussing observability, we often make the mistake of focusing on tools. Observability isn't about tools — it's about questions and answers. When developing an observability stack, we need to understand the types of questions we will have for it and then provide two means to answer each question. It is important to have two means. Observability is often unreliable and misleading, so we need a way to verify its results. However, if we have more than two ways, it might mean we over-observe a system, which can have a serious impact on costs. A typical exercise to verify an observability stack is to hypothesize common problems and then find two ways to solve them. For example, a performance problem in microservice X: Inspect the logs of the microservice for errors or latency — this might require adding a specific log for coverage. Inspect Prometheus metrics for the service. Tracking a scalability issue within a microservices deployment is much easier when working with traces. They provide a context and a scale. When an edge service runs into an N+1 query bug, traces show that almost immediately when they're properly integrated throughout. Segregation One of the most important scalability approaches is the separation of high-volume data. Modern business tools save tremendous amounts of meta-data for every operation. Most of this data isn't applicable for the day-to-day operations of the application. It is meta-data meant for business intelligence, monitoring, and accountability. We can stream this data to remove the immediate need to process it. We can store such data in a separate time-series database to alleviate the scaling challenges from the current database. Conclusion Scaling in the age of serverless and microservices is a very different process than it was a mere decade ago. Controlling costs has become far harder, especially with observability costs which in the case of logs often exceed 30 percent of the total cloud bill. The good news is that we have many new tools at our disposal — including API gateways, observability, and much more. By leveraging these tools with a fail-fast strategy and tight observability, we can iteratively scale the deployment. This is key, as scaling is a process, not a single action. Tools can only go so far and often we can overuse them. In order to grow, we need to review and even eliminate unnecessary optimizations if they are not applicable. This is an article from DZone's 2023 Software Integration Trend Report.For more: Read the Report
It’s well-known that Ethereum needs support in order to scale. A variety of L2s (layer twos) have launched or are in development to improve Ethereum’s scalability. Among the most popular L2s are zero-knowledge-based rollups (also known as zk-rollups). Zk-rollups offer a solution that has both high scalability and minimal costs. In this article, we’ll define what zk-rollups are and review the latest in the market, the new ConsenSys zkEVM. This new zk-rollup—a fully EVM-equivalent L2 by ConsenSys— makes building with zero-knowledge proofs easier than ever. ConsenSys achieves this by allowing developers to port smart contracts easily, stay with the same toolset they already use, and bring users along with them smoothly—all while staying highly performant and cost-effective. If you don’t know a lot about zk-rollups, you’ll find how they work fascinating. They’re at the cutting edge of computer science. And if you do already know about zk-rollups, and you’re a Solidity developer, you’ll be interested in how the new ConsenSys zkEVM makes your dApp development a whole lot easier. It’s zk-rollup time! So let’s jump in. The Power of Zero-Knowledge Proofs Zk-rollups depend on zero-knowledge proofs. But what is a zero-knowledge proof? A zero-knowledge proof allows you to prove a statement is true—without sharing what the actual statement is, or how the truth was discovered. At its most basic, a prover passes secret information to an algorithm to compute the zero-knowledge proof. Then a verifier uses this proof with another algorithm to check that the prover actually knows the secret information. All this happens without revealing the actual information. There are a lot of details behind that above statement. Check out this article if you want to understand the cryptographic magic behind how it all works. But for our purpose, what’s important are the use cases of zero-knowledge proofs. A few examples: Anonymous payments—Traditional digital payments are not private, and even most crypto payments are on public blockchains. Zero-knowledge proofs offer a way to make truly private transactions. You can prove you paid for something … without revealing any details of the transaction. Identity protection—With zero-knowledge proofs, you can prove details of your personal identity while still keeping them private. For example, you can prove citizenship … without revealing your passport. And the most important use case for our purposes: Verifiable computation. What Is Verifiable Computation? Verifiable computation means you can have some other entity process computations for you and trust that the results are true … without knowing any of the details of the transaction. That means a layer 2 blockchain, such as the ConsenSys zkEVM, can become the outsourced computation layer for Ethereum. It can process a batch of transactions (much faster than Ethereum), create the proof for the validity of the transactions, and submit just the results and the proof to Ethereum. Ethereum, since it has the proof, doesn’t need the details—nor does it need a way to prove that the results are true. So instead of processing every transaction, Ethereum offloads the work to a separate chain. All Ethereum has to do is apply the results to its state. This vastly improves the speed and scalability of Ethereum. Exploring the New ConsenSys zkEVM and Why It’s Important Several zk-rollup L2s for Ethereum have already been released or are in progress. But the ConsenSys zkEVM could be the king. Let’s look at why: Type 2 ZK-EVM For one thing, it’s a Type 2 ZK-EVM—an evolution of zk-rollups. It’s faster and easier to use than Type 1 zk solutions. It offers better scalability and performance while still being fully EVM-equivalent. Traditionally with zk-proofs, it’s computationally expensive and slow for the prover to create proofs, which limits the capabilities and usefulness of the rollup. However, the ConsenSys zkEVM uses a recursion-friendly, lattice-based zkSNARK prover—which means faster finality and seamless withdraws, all while retaining the security of Ethereum settlements. And it delivers ultra-low gas fees. Solves the Problems of Traditional L2s Second, the ConsenSys zkEVM solves many of the practical problems of other L2s: Zero switching costs - It’s super easy to port smart contracts to the zkEVM. The zkEVM is EVM-equivalent down to the bytecode. So no rewriting code or smart contracts. You already know what you need to know to get started, and your current smart contracts already work. Easy to move your dApp users to the L2 - The zkEVM is supported by MetaMask, the leading web3 wallet. So most of your users are probably already able to access the zkEVM. Easy for devs - The zkEVM supports most popular tools out of the box. You can build, test, debug, and deploy your smart contracts with Hardhat, Infura, Truffle, etc. All the tools you use now, you can keep using. And there is already a bridge to move tokens onto and off the network. It uses ETH for gas - There’s no native token to the zkEVM, so you don’t need to worry about new tokens, third-party transpilers, or custom middleware. It’s all open source! How To Get Started Using the ConsenSys zkEVM The zkEVM private testnet was released in December 2022 and is moving to public testnet on March 28th, 2023. It’s already processed 774,000 transactions(and growing). There are lots of dApps already: uniswap, the graph, hop, and others. You can read the documentation for the zkEVM and deploy your own smart contract. Conclusion It’s definitely time for zk-rollups to shine. They are evolving quickly and leading the way in helping Ethereum to scale. It’s a great time to jump in and learn how they work—and building with the ConsenSys zkEVM is a great place to start! Have a really great day!
Joana Carvalho
Performance Engineer,
Postman
Greg Leffler
Observability Practitioner, Director,
Splunk
Ted Young
Director of Open Source Development,
LightStep
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere