Platform Engineering Trends in Cloud Native: Q&A With Thomas Graf
Abstractions help developers move fast, while infra teams ensure security and reliability. eBPF and solutions like Cilium bridge old and new worlds.
Join the DZone community and get the full member experience.Join For Free
The rise of Kubernetes, cloud native, and microservices spawned major changes in architectures and abstractions that developers use to create modern applications. In this multi-part series, I talk with some of the leading experts across various layers of the stack — from networking infrastructure to application infrastructure and middleware to telemetry data and modern observability concerns — to understand emergent platform engineering patterns that are affecting developer workflow around cloud native.
The first participant in our series is Thomas Graf, CTO and co-founder of Isovalent, and the creator of Cilium — an open source, cloud-native solution for providing, securing, and observing network connectivity between workloads, fueled by the revolutionary Kernel technology eBPF.
Q: We are nearly a decade into containers and Kubernetes (K8s was first released in September 2014). How would you characterize how things look different today than ten years ago, especially in terms of the old world of systems engineers and network administrators and a big dividing line between these operations concerns and the developers on the other side of the wall? What do you think are the big changes that DevOps and the evolution of platform engineering and site reliability engineering have ushered in, especially from the networking perspective?
A: Platform engineering has brought traditional systems engineers and network administrators a lot closer to developers. The rise of containers has not only simplified the deployment for developers but also for platform engineering teams. Instead of serving machines, we are finally hosting applications. Unlike serverless, Kubernetes preserved some of the existing infrastructure abstractions, thus offering a more approachable evolutionary step. This allowed systems engineers and network administrators to step up and evolve into platform engineering, and with them, they brought decades of experience in how to operate enterprise infrastructure.
At the same time, platforming engineering has brought a radical modernization of the networking layer. Application teams look at the network like they look at the internet. A giant, untrusted connectivity plane connecting everyone and everything. This requires platform engineering to rethink network security and bring in micro-segmentation, zero-trust security concepts, and mutual authentication. At the same time, this new, exciting world has to be connected to the world of existing infrastructure, which requires mapping the world of identity-based network security to the old world of virtual networks, MPLS, and ACLs.
Q: The popularity of the major cloud service platforms and all of the thrust behind the last ten years of SaaS applications and cloud native created a ton of new abstractions for the level at which developers are able to interact with underlying cloud and network infrastructure. How has this trend of raising the abstraction for interacting with infrastructure affected developers, specifically?
A: The cost of developing an initial MVP for a new application has decreased enormously. A small team of developers can develop an application to early product maturity within weeks or months. This is achieved with automation on the cloud infrastructure side, managed databases and cloud services, and the composability of microservices. The cost of this shortcut is typically paid later in the form of the cloud bill, challenges in portability across cloud providers, and the inevitable consequence of having to develop a proper multi-cloud strategy as different application teams will start developing on different cloud platforms.
Developers are rightfully not concerned about infrastructure abstraction and infrastructure security early on. Time to market is everything. Platform engineering teams then typically come in and help port the application to a Kubernetes platform to start standardizing the applications to corporate needs by elevating security and monitoring standards, decoupling dependencies, and preparing the new application for scale.
Q: What are the areas where it makes sense for developers to have to really think about underlying systems versus the ones where having a high degree of instrumentation or customization ("shift left") is going to be very important?
A: Every couple of years, there is a new term for what is a pretty logical software development practice to test early. Iterative development, agile development, test-driven development, and now shift-left. The combination of system-aware development and abstraction-based development has always been crucial, but equally important is to shift concern for resilience and supportability to the left. Looking back at the Apollo mission, all these concepts played a significant role. A lot of the software was obviously written in a system-specific language to be as efficient as possible. The navigation business logic was using an abstracted language, however, in order to be able to compute complex vector computation. Last but not least, it was the concept of resilience that allowed the lander to overcome a faulty sensor that overloaded the computer, which would have prevented all required system components from grabbing enough CPU time.
Q: Despite the obvious immense popularity of cloud native, much of the world's infrastructure (especially in highly regulated industries) is still running in on-prem datacenters. What does the future hold for all this legacy infrastructure and millions of servers humming along in data centers? What are the implications for managing this mixed cloud-native infrastructure together with legacy data centers over time?
A: As with any other transformation, cloud native will take much longer than anticipated, but the benefits are so fundamental that anybody not undergoing the transformation is at risk of being disrupted from a technology perspective. The world of cloud native and data centers not only have to get to know each other but have to move in together for the foreseeable future. The typical enterprise-grade data center requirements have already come to cloud native. What we are now seeing is that some of the cloud-native concepts, such as further automation, better declarative approaches, and cleaner abstractions, are flowing from cloud native back into the world of data centers.
Cloud-native solutions will have to learn to live in data centers, running as appliances, and typical data center requirements will have to be met in the world of the cloud in order for the two worlds to be able to talk to each other.
Q: What do you think are some of the modern checklist items that developers care most about in terms of their workflow and how platform engineering makes their lives more productive? Broadly speaking, what are conditions that are most desirable versus least desirable in terms of the build environment and toolchains that modern developers care about?
A: The checklist has changed quite a bit over the last few years. In the beginning, the checklist was all about getting one of each to try and form the best possible stack out of hundreds of possible options. As this usual phase of an early technology stack now matures, the checklist has changed to limit the number of moving pieces and instead focus on core values such as developer efficiency, operational complexity, security risks, and total cost. This has led to a shift to managed offerings, wider platforms covering more aspects, and a focus on day-2 operational aspects and long-term cost aspects over exclusively building the best possible platform. The least desirable outcome in building a platform for developers is to ramp it all up but fail to make it sustainable operationally and economically.
Q: What is the significance of eBPF in this overall context of the evolution of platform engineering and SRE patterns?
A: eBPF has become the amazing hidden helper below deck, making everything better and faster. Its magical value comes from its programmability. The operation system has become a little bit like hardware, which is really hard to change. Making the operation system agile and programmable again allows software infrastructure to keep up with the changing demands of platform engineering technology like Kubernetes. Fundamentally, eBPF is not only able to solve problems really well, but even more importantly, it is a tremendous time to market hack for infrastructure and security software, and platforming engineering is driven by continued innovation as platforms are still being built up and requirements keep piling up.
Q: What does Cilium give platform teams beyond the built-in capabilities of eBPF? What is the relationship between the two technologies, and what should platform engineering teams be doing with Cilium today (if they are not already)?
A: eBPF is fundamentally designed for kernel developers, and early adopters of eBPF were typically companies running their own kernel teams. Think of eBPF as you think of FPGAs or GPUs in the context of AI. As an enterprise, you can’t go out there and buy a bunch of FPGAs or GPUs, stuff them into your data center, and then simply benefit from it. You will need people to build something with it. Cilium takes eBPF and utilizes it to implement core networking, security, and observability needs of platform engineering teams. It does so without requiring platform engineering teams to learn how to build it themselves with eBPF.
Kubernetes has created a whole new set of challenges on how to connect and secure workloads inside and outside of Kubernetes. eBPF is incredible at solving the problems of the new world of Kubernetes and equally well capable of translating that new world to the old legacy.
Q: How would you describe the overall evolution of networking, from the old scale-up days to distributed computing and commodity hardware, to virtual machines, then SDNs, and now where we are today? What do you think are the coolest trends in network infrastructure to watch today?
A: Networking has evolved over the years along with the needs of applications. What is interesting is that we are probably in the middle of one of the most significant shifts in networking but haven’t fully realized it yet. Looking back, networking was all about connecting machines physically. With Google and distributed computing, it became obvious that virtualization would play a massive role. As a consequence, software-defined networking was the networking shift that came along with it. But it still connected machines, and it inherited the vast majority of building blocks from the physical networking world. Cloud networking took that network virtualization technology and added APIs and automation in front of it.
With the rise of containers and Kubernetes, networking is now changing fundamentally as we no longer connect machines. We connect applications. A modern cloud-native networking layer looks more like a messaging bus than a network to developers but without requiring your applications to change in any way and while continuing to the strong security and performance requirements of a typical enterprise network. The cloud-native shift in networking will not just impact the Kubernetes networking layer. It will touch all aspects of connectivity from L3/L4 north-south load balancers, network firewalls, and VPNs all the way up to L7 WAFs and L7 east-west load balancers.
Opinions expressed by DZone contributors are their own.