O11y Guide: Who Are the Cloud-Native Observability Players?
Continue on a journey into the world of cloud-native observability: go out onto the playing field to understand who the players are and what teams they form.
Join the DZone community and get the full member experience.Join For Free
The first article in this series covered how developers have to deal with more than just code in a cloud-native world. It shared a look at cloud-native observability (o11y) and touched on what the three pillars are versus the three phases of observability.
This second article takes you out onto the playing field where you need to understand who the players are and what teams they form. It's no longer a world full of developers and operations teams as the cloud-native environments have pushed right on through those traditional walls.
Let's dive right in, shall we?
The basic introduction started from the point that developers are in a world without clouds and then made the transition to a cloud-native development world. What does this mean for them and what are some of the challenges they are having to embrace?
The Playing Field
Over time the traditional developer and operations teams saw a transition to different ways of working in the cloud-native world. The developers transitioned into DevOps teams where the operations activities merge and attempts are made with process agility. Organizations have tried DevOps, moved to platform engineering, and then moved to a more mature structure called CloudOps with a clear focus on cloud infrastructure. Beyond this, we're seeing today a role emerge known as Site Reliability Engineer (SRE), who's part of a team that is focused on a broader spectrum of modern resource reliability and not just for the organization's cloud infrastructure. Finally, at the larger scale of cloud-native operations, there is a new kid on the block, known as the site reliability team.
DevOps is a first step on the road to cloud-native operations and bridges both development and operations teams. As defined in the article "DevOps vs. CloudOps - What You Need to Know," you see that they have a specific mandate:
"DevOps is primarily the automation and optimization of the application development lifecycle, including post-launch fixes and updates. It uses continuous development, integration, testing, and deployment of cloud, computer, and downloadable applications. It also focuses on IT operations as they relate to application performance and availability."
By bringing operations and development closer to focus on processes and automation, they are making the push for agility, reliability, and speed for business goals within their organization. It remains focused, often due to the existence of more than just the cloud-native infrastructure, on application development and delivery.
Platform Engineering Teams
The next team to appear on the scene is one that takes the lessons learned from the DevOps experience and owns the engineering self-service experience as defined in "What is Platform Engineering?":
"Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era."
The idea is that if the experience is more self-service and pre-defined infrastructure for the deployment of engineering projects, then deploying code will become less time-consuming for developers.
The definition given by Professional DevOps.com puts CloudOps at the center of a business operational focus.
"...CloudOps provides organizations with proper (cloud) resource management. In an organization, CloudOps uses DevOps principles and IT operations applied to a cloud-based architecture to speed up the business processes."
This is a shift towards operations focusing on the cloud-native infrastructure more specifically than the other possible infrastructures available in an organization. Once the footprint of dependency on infrastructure choices from the past has been reduced, these teams are scaled up to ensure the improvement of development architecture (infrastructure in the cloud). They focus on simplification of cloud provisioning, application deployment to the cloud, and are big users of observability platforms for both application and infrastructure in the cloud.
Site Reliability Teams
Oscar Wilde once said, "With age comes wisdom, but sometimes age comes alone." As organizations become more active in a cloud-native world and scale up to full CloudOps teams alongside their DevOps teams, there is another role emerging to fill a gap left behind. That role is an SRE and they don't only focus on the cloud-native infrastructure. As noted by Chris Tozzi:
"...an SRE is an all-purpose role that aims to manage reliability for any type of environment."
SREs have to use both IT operations and development strategies to ensure that there is a focus on one thing, and one thing only: reliability. It's a full-time job avoiding downtime, optimizing the performance of all applications, and supporting infrastructure regardless of whether it is in the cloud-native world or not. Together with CloudOps teams, they are a very active player in cloud-native observability and the platforms used to assist them. They have a vested interest in cloud or multi-cloud security, costs, deployment automation, and all things that help observability at scale.
Central Observability Teams
The newest evolution was predicted by Martin Mao back in December 2021:
"This team is responsible for defining observability standards and practices, delivering key data to engineering teams and managing the tooling and storage of observability data, among other things."
This team has become more the norm than the exception over this last year as organizations investing in cloud-native at scale ramp up their observability practices. Their main focus is to define standards and practices that can be used by everyone, thus centralizing observability in their organization.
The following are four functions that the central observability team should own:
- Define: Define monitoring standards and practices.
- Deliver: Provide monitoring data to eng teams; must be in a format they are familiar with (i.e., Prometheus).
- Measure: Ensure reliability and stability of monitoring solutions.
- Manage: Manage tooling and storage of metrics data. Make it simple: if it takes a ninja, people won’t use it.
This has been a quick, down-and-dirty look at the teams on the field. Now let’s move on to the game.
The Observability Game
This takes us from the basic introduction, followed by a tour of the o11y playing field, and finally, you've met the players on the teams involved in cloud-native o11y.
Next up, I want to dive deeper into the pillars of monitoring and why at scale you might want to start thinking about the phases of cloud-native o11y instead.
Published at DZone with permission of Eric D. Schabell, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.