Back in October of 2016, I was at OpenStack Conference in Barcelona and ran into a longtime friend, Rob Hirschfeld. He surprised me by talking about a problem domain that we have had discussions about for years, reframing it as “the data center underlay problem.”
His provocative statement was that while OpenStack solves many problems, it didn’t address the fundamental challenges of how to run things like OpenStack on actual physical infrastructure. This is a problem domain that is being radically redefined by the container ecosystem.
This is a problem that Rob has been tirelessly working on for nearly a decade, and it was interesting to get his perspective on the emerging ecosystem, including OpenStack, Kubernetes, Mesos, containers, private clouds in general (which include Azure Stack), etc. I thought it would be useful to share this with everyone.
For those of you who don’t know, Rob Hirschfeld is a longtime data center automation advocate with a history of blogging about open source, DevOps, and Lean Process at robhirschfeld.com. He’s served four terms on the OpenStack board, started the Kubernetes Cluster Ops SIG and co-founded the open Digital Rebar hybrid provisioning project. He’s also known as @zehicle online, and is currently the CEO of RackN.
Gene Kim: Tell me about the landscape of docker, OpenStack, Kubernetes, etc. How do they all relate, what’s changed, and who’s winning?
Rob Hirschfeld: Containers have dramatically altered the IT landscape because they combine improved distribution/packaging models with improved performance and efficiency. These dual benefits are highly disruptive to platforms built assuming virtualization (like OpenStack, SDN and SAN vendors).
They are also highly disruptive to platforms that assumed slow or controlled software distribution processes (like Chef, Puppet or Cloud Foundry). Even hardware vendors and cloud providers will be impacted because the smaller footprint of containers drives towards commodity hardware and frustration with the “VM tax.”
While very exciting, containers create a management challenge for users and operators. Since they are short lived and rapidly regenerated, keeping track of containers through a development pipeline and into production requires sophisticated automation such as Kubernetes, Docker Swarm, Mesos, Rancher or other platforms. These platforms also expose gaps in software defined storage (SDS) and networking (SDN). Ironically, OpenStack’s rocky progress in these areas will likely be scavenged to accelerate their successor.
There are real developer productivity and operational efficiency gains from containers, and the benefits are accelerating. I expect to see a wave of security and monitoring improvements that will make container packaging required in the near term. That progress comes at the expense of VM platforms like OpenStack.
GK: I recently saw a tweet that I thought was super funny, saying something along the lines “friends don’t let friends build private clouds” — obviously, given all your involvement in the OpenStack community for so many years, I know you disagree with that statement. What is it that you think everyone should know about private clouds that tell the other side of the story?
RH: The world is hybrid, everyone needs to adjust.
Telling everyone to go to public cloud is equally silly — the reality is that each company and workload is different. RackN’s mission is to have our customers not care about which infrastructure runs their application. Part of that is making Google scale data center automation accessible to everyone AND part is making workloads portable.
[GK note: I love the term for this: “GIFEE”: Google Infrastructure For Everyone Else]
The problem with OpenStack and building private clouds/infrastructure is that we tried to layer complex software on top of inconsistently managed, heterogeneous infrastructure. Even if OpenStack had been amazingly easy to run (and it was far from it), the resulting clouds would have been inconsistent. In my opinion, OpenStack did not solve the real problems that were driving people to AWS.
“Friends don’t let friends build private clouds without strong foundational process and automation” would be a better way to say it. Let’s help people do Ops better before we cloud-shame them.
GK: We talked about how much you loved the book Site Reliability Engineering: How Google Runs Production Systems by Betsy Beyer, which I also loved. What resonated with you, and how do you think it relates to how we do Ops work in the next decade?
RH: I strongly identify with operators who write software to improve Operations. The whole RackN team often feels like we are odd ducks because we care about data center operations and running infrastructure, not just pumping out software. The book really speaks and embraces that mentality as an important mindset.
The book also establishes great ground rules. Justifying the 50/50 coding/Ops balance is critical because so many operators are underwater and cannot find time to improve. We really underestimate the importance of time and focus to improve. The book also does a good job talking about different types of Ops work and the importance of staying connected to running the applications and infrastructure.
Most importantly, it humanizes Google’s Ops teams. Google made incredible investments in Operations. By sharing their lessons learned, we can build tools that focus on the right priorities. By explaining the value of their approach, we can explain why those tools are worth the investment.
GK: Tell me what about the work you did with Crowbar, and how that informs the work you’re currently doing with Digital Rebar?
RH: Crowbar was created at the dawn of OpenStack because my team at Dell had learned from other cloud efforts that installation was a system problem (this is a lesson repeated in the SRE book). We needed to three degrees of control: the full node from BIOS to application, between nodes to build cluster, and outside of nodes to interact with data center services like DNS, DHCP, NTP, and PXE. Crowbar, which is still in use by SUSE, was successful when it could fully control those elements.
Unfortunately, no install ever allowed us to control all the elements, so Crowbar was way too brittle in practice. It also lacked the state machinery needed to do upgrades. It turns out that upgrades are much more important (and harder!) than installs.
When we started over with Digital Rebar, we took a very different approach to the problem. We needed a design that could adapt to different site requirements, track state of a running system and make controlled incremental changes. We also wanted the platform to be much more transparent in how it was working so that mortals could troubleshoot issues in the field. Taken together, those are key SRE values.
The resulting Digital Rebar platform is a mix of specialized orchestration (annealing), functional work units (roles), and API-driven services. These three aspects of the architecture work together to create a composable platform. That allows us to mix physical and cloud infrastructure, mix Chef, Puppet, Ansible and Bash, and mix different operating systems and hardware types. Most critically, the composable nature allows us to make incremental changes needed for upgrades. Of course, the system is API driven and deployed as microservice containers!
We feel like Digital Rebar provides a solid foundation for SREs to build on. It allows them to (re)use open best-practice automation on their commodity gear. With those advantages, SREs can finally focus on the parts of their job that are actually different from other companies.