While much has been written about DevOps in pure cloud environments and DevOps in the enterprise data center, I thought it would be interesting to describe how we do DevOps in a hybrid cloud environment.
For context, my company offers onsite storage as a service. We install our own purpose-built equipment into our customers’ data centers and utilize the cloud to monitor the equipment behind our customers’ firewalls, remotely mitigate failures, and update software in a non-disruptive fashion. Unlike public cloud providers like Amazon, Microsoft, and Google who manage their fleet of equipment in their own captive data centers, we manage our fleet of equipment dispersed across customer data centers that we cannot touch regularly.
Frequent and predictable software deployments are an essential ingredient for fast innovation and high product quality. The accumulation of fixes and features adds risk to the update, which, if handled by the customer, can quickly erode confidence if the update brings friction. Following good practices of a SaaS company, we control deployments to a fleet of systems that sit on customers' premises, allowing us to keep the fleet current or fail fast by rolling back updates. I'd like to walk through the mechanics behind how we release software and touch on some of the DevOps practices we exercise as an engineering team.
Our deployment pipeline (re)starts every Wednesday. As good practitioners of CI/CD, Wednesday begins from a known good quality baseline. We then take Wednesday’s build and begin serializing automated deployments with a release candidate in our Dev environment, halting subsequent deployments if an update fails or if deleterious effects of new code running in the system are discovered. We execute synthetic workloads in our Dev environment that mimic customer use cases, while running chaos monkey-like fault injection to continuously vet our H/A architecture. By Wednesday night, all of our Dev environments are running the release candidate, delivering a rich set of data points for end-of-week ship readiness.
Between Thursday and Friday, our Staging environments are updated. Our Staging update is particularly interesting as we eat our own dogfood. We depend on our storage service in Staging as critical infrastructure for our build system, which is constantly building and running tests and storing build-dependent logs and packages. We regularly stress and fault test our staging environment, often without our end users knowing about it. If we detect any service disruption during these events, we are notified through our alerting system, as well as our build system or the affected developers.
On Monday, we ship to our first Production system, which in actuality is a canary system that happens to sit in Production. Like our Dev environment, our Production-level canary is running synthetic workloads that mimic patterns observed in customer workloads. Observing zero alerts on systems running synthetic workloads inspires confidence that releasing on Tuesday will come with no surprises.
Tuesday has arrived. Deployments begin according to time zones. For example, we are based in Seattle, so we parallelize all deployments on the East Coast early in the afternoon as our first batch of updates. By 7 p.m., the fleet is up to date with the latest released bits.
How We Achieve Predictability
Because we ship every week, we are heavily incentivized to react quickly to any issues observed in our deployment pipeline. From frontline engineers to the executive layer, everyone in Engineering is expected to take turns participating in troubleshooting, mitigating, and fixing issues that emerge in Production as well as our deployment pipeline. A heightened level of involvement keeps the entire team aware of trends and allows us to continuously harden our service quality by prioritizing action against our learnings.
We use a combination of time-series metrics and events to provide visual dashboards that deliver deep views into system state. When alerts are thrown, we heavily consume these dashboards coupled with centralized logging to help us quickly mitigate service disruption. Tied with our alerting system, our health checking delivers us both black box views (how the customer would perceive service health) as well as internal checks (how this system perceives service health) to identify and trigger action to resolve issues before the customer does. Finally, even before code is deployed to Dev, we rely heavily on our ability to push quality upstream and catch the majority of issues in our continuous integration framework.
Our system is designed for granular upgradability that is non-disruptive to applications. The top half of our system is completely stateless running on redundant Intel-based servers. The bottom half of our system that maintains all state-run on distributed, ARM-based nano-servers. A nano-server consists of an ARM-based CPU, redundant Ethernet connections, and a single drive. As such, we can update software to individual microservices on the top half and to individual nano-servers on the bottom half without taking the service offline.
We can also hide large chunks of code behind system-level flags, which we then unhide through phases of our deployment pipeline and enable when a new feature is considered production-ready. As we roll changes through the system, our H/A architecture allows client traffic to flow non-disruptively as we update each service. If one of the services targeted for an update fails for any reason, we have the ability to pause and make a decision: rollback or mitigate and roll forward.