Containers have changed how we think about building, packaging, and deploying our applications. From a developer’s perspective, they make it easier to package an application with its full set of dependencies and reliably recreate that application on another developer’s workstation. It also allows us the more reliably deliver applications from dev to test to production (possibly in a CI/CD pipeline). Lastly, in this world of microservices, containers help deliver microservices to production at scale.
With the move to cloud-native applications and services architectures like microservices, we gain some advantages in our deployment and management infrastructure, but our applications need to be designed with different principles in mind compared to traditional design: design for failure, horizontal scaling, dynamically changing environments, etc. Interestingly enough, services implemented with these considerations in mind invariably find themselves dealing with more complexity in the interactions between services than the services themselves.
Can containers help here?
Complexity Has Moved to Service Interaction
First, we should decide what the problem is and how this complexity in the interactions between services manifests itself. Services will need to work with each other in a cooperative way to provide business value and thus will need to communicate. In these cloud architectures, this communication will happen over the network. This is the first source of complexity that traditional applications with colocated components don’t usually have to confront.
Any time a service has to make a call over the network to interact with its collaborators, things can go wrong. In our asynchronous, packet-switched networks, there are no guarantees about what can and will happen. When we put data out onto the network, that data goes through many hops and queues to get to its intended destination. Along the way, data can be dropped completely, duplicated, or slowed down.
Moreover, these behaviors make it difficult to determine whether our communication with our collaborators is failing/slow because of the network or because the service on the other end has failed/is slow. This can lead to unsafe consequences, like services unable to deliver service to their customers, collaborations partially succeeding, data inconsistencies between services, and more.
Related to problems that occur because of network failure/degradation (or perceived failure/degradation) are things like, how does a service find and talk to its collaborators? How does it load balance across multiple instances of its collaborators? When we build these cloud-native services with containers, we now need to account for the complexity introduced by communication over the network. We need to implement things like service discovery, load balancing, circuit breakers, timeouts, and retries so that our services stay resilient in the face of this uncertain network behavior.
This sounds like a lot of responsibility for our applications. We could create reusable libraries to help with this. Indeed, that’s the approach many of the big Internet companies took. Google invested massive engineering work to implement an RPC library that helps with these things (Stubby, now gRPC). Twitter did as well with their Finagle framework. Netflix was even nice enough to open source their efforts with their Netflix OSS libraries like Ribbon, Hystrix, and others.
To make this work, we need to restrict our frameworks and languages to only those for which we can implement and maintain these cross-cutting concerns. We’d need to re-implement these patterns for each language and framework we’d like to support. Additionally, every developer would need the discipline to apply these libraries and idioms consistently across all the code they wrote. In many ways, folks like Netflix had to write these tools because they had no other choice; they were trying to build resilient services on top of IaaS cloud infrastructure. What choices do we have today?
For basic service discovery and load balancing, we should be able to leverage our container platform. For example, if you’re packaging your application as Docker containers and you’re using Kubernetes, things like load balancing and basic service discovery are baked in. In Kubernetes, we can use the “Kubernetes service” concept to define application clusters (each instance running in a container or Kubernetes “pod”) and assign networking (like virtual IPs) to these clusters. Then we can use basic DNS to discover and interact with the cluster of containers even if the cluster evolves over time (addition of containers, etc).
Service Mesh for Containerized Services
What if we could implement these resilience concerns and more across our services architectures without requiring language and framework-specific implementations? That’s where a “service mesh” fits into the picture. A service mesh sits between our services and solves these issues without having to use frameworks or libraries inside the application.
With a service mesh, we introduce application proxies that handle communicating with other services on behalf of our application. The application or service talks directly to the proxy and is configured with appropriate timeouts, retries, budgets, circuit breaking, etc. for communicating with upstream services.
These proxies can either be implemented as shared proxies (multiple services use a single proxy) or application-specific “sidecar” proxies. With a sidecar proxy, the proxy is deployed alongside each instance of the service and is responsible for these horizontal concerns; that is, the application gains this functionality without having to instrument their code directly.
Linkerd and Lyft Envoy are two popular examples of proxies that can be used to build a service mesh. Linkerd is an open source project from the startup Buoyant.io, while Envoy is an open-source project from ride-hailing company Lyft. In a container environment, we can implement sidecars by either deploying the proxy in the same container as your application or as a sidecar container if you can specify container-affinity rules like with Kubernetes pods.
In Kubernetes, a pod is a logical construct that considers an “instance” to be one or more containers deployed together. Implementing sidecar proxies in Kubernetes becomes straightforward. With these sidecar (or shared) proxies in place, we can reliably and consistently implement service discovery, load balancing, circuit breaking, retries, and timeouts regardless of what’s running in the container.
With containers, we abstract away the details of the container for the purposes of uniform deployment and management, and with a service mesh, we can safely introduce reliability between the containers in a uniform way. Since these application proxies are proxying traffic, doing load balancing, retries, etc, we can also collect insight about what happens at the network level between our services. We can expose these metrics to a central monitoring solution (like InfluxDB or Prometheus) and have a consistent way to track metrics. We can also use these proxies to report other metadata about the runtime behavior of our services, including things like propagating distributed tracing to observability tools like Zipkin.
Lastly, we can introduce a control plane to help manage these application proxies across the service mesh. For example, a newly announced project, Istio.io, provides just that. With the control plane, not only are we able to understand and report what’s happening between our services, we can control the flow of traffic as well. This becomes useful when we want
to deploy new versions of our application and we want to implement A/B style testing or canary releases. With a control plane, we can configure fine-grained interservice routing rules to accomplish more advanced deployments.
Containers enable a new paradigm of cloud-native applications and container platforms help with the management and deployment of those containers. From a services architecture point of view, however, we need to solve some of the complexity that has now been moved between our services. Service meshes aim to help with this and application proxies help remove horizontal, cross-cutting code (and their dependencies) from our application code so that we can focus on business-differentiating services. Containers and container environments help us naturally implement this service-mesh pattern.