DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Why We Built Smart Scaler
  • The Death of "Text-Only" ChatOps: Why Google's A2UI Matters for DevOps and SRE
  • How Reactive Scaling Drains Your Cloud Budget Without Warning
  • Cost Is an SLI: Why Your System Is “Healthy” but Burning Cash

Trending

  • A Walk-Through of the DZone Article Editor
  • Introduction to Tactical DDD With Java: Steps to Build Semantic Code
  • Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales
  • The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Top 9 Skills for SREs From Ex-Instacart SRE

Top 9 Skills for SREs From Ex-Instacart SRE

In this article, I will share a list of the top nine SRE skills, from incident management to cloud computing, to networking and beyond.

By 
Quentin Rousseau user avatar
Quentin Rousseau
·
Feb. 11, 22 · Opinion
Likes (2)
Comment
Save
Tweet
Share
9.5K Views

Join the DZone community and get the full member experience.

Join For Free

It’s easy to talk at a high level about what Site Reliability Engineers (SREs) do: They ensure that IT systems achieve availability and performance requirements.

But which skills, exactly, do SREs need to do their jobs? That’s a more complicated question.

To answer it, this article walks through the top nine SRE skills that modern SREs (or aspiring SREs) should master. Although SRE skills may vary from one team to the next depending on the types of systems it manages and the main types of reliability challenges it faces, virtually all SREs need a core set of standard skills that allow them to understand and manage the type of complex, distributed systems they will have to support at the typical organization today.

Without further ado, here’s a breakdown of top SRE skills.

Networking Expertise

The network plays a pivotal role in connecting modern, distributed environments. As such, it’s often the culprit when something goes wrong -- a lesson that Facebook, for example, learned when a networking problem brought down its entire global infrastructure.

Situations like this are why SREs should master networking concepts. Even if their organization also employs networking engineers, SREs need a deep understanding of networking themselves to know when the network is the root cause of an incident and how to resolve network-caused issues effectively.

Cloud Computing

Like Linux and networking, cloud computing is another category of skill that modern SREs can’t live without.

The reason why is almost self-explanatory: Around 90 percent of businesses use the cloud, and you can’t manage reliability for cloud environments very well if you don’t understand cloud architectures, cloud networking, cloud data storage, cloud observability, and so on.

CI/CD Pipelines

SREs don’t typically help to develop software, but they nonetheless need a deep understanding of how software is written and deployed -- which, at most organizations today, is a process that happens via a CI/CD pipeline.

It’s hard to engineer reliability if you don’t know how to address reliability problems that emerge from application source code or deployment processes. Understanding how CI/CD processes work and which tools drive them is key for virtually every SRE today.

Linux and Unix

If you come from a Windows background but you want to be an SRE, there’s no getting around it: You’ll need to learn how to work with Linux and other Unix-like systems in addition to Windows.

That’s because, even at organizations that don’t rely heavily on Linux servers, you’re likely to find that Linux and Unix concepts are deeply embedded within other systems that you have to work with. Most public cloud management tools follow the conventions of Linux CLI tools, for example. So do systems like Docker and Kubernetes, even if you run them in a Windows environment.

Quality assurance and Software Testing Automation

SREs also don’t usually help to test software pre-deployment. That task falls to developers or quality assurance engineers.

Nonetheless, understanding how software is tested -- and how to use test automation to speed tests and expand test coverage -- is a vital SRE skill. After all, the more thoroughly and efficiently your team can test software, the greater your chances of catching reliability problems pre-deployment when they are easier to fix and pose a much lower risk to the business.

Security Engineering and Response

Securing is another domain that SREs don’t “own,” but where they nonetheless require significant skills. Indeed, good reliability engineering makes security a priority, and vice versa. SREs who don’t understand security fundamentals are at risk of implementing reliability solutions that are effective from a reliability standpoint, but not necessarily secure.

DevOps

Although SREs are not DevOps engineers, SRE and DevOps are closely related domains. SREs at most organizations today will be expected to understand DevOps concepts and, in many cases, work alongside DevOps teams.

So, plan to master DevOps skills as part of your SRE skills acquisition strategy.

Incident Management

Perhaps the single most important type of skill for SREs to learn is incident management. Although many roles may participate in incident response, SREs usually take the lead in organizing the incident response team, communicating with stakeholders, and devising the best strategy for resolving each incident as quickly as possible.

This means SREs should know how incident response roles are structured and understand incident response concepts. They should also be familiar with incident response platforms, that automate the complex processes required to ensure rapid, effective incident resolution.

Postmortems

In addition to overseeing incident response, SREs may be tasked with managing postmortems. Knowing how to run a postmortem -- as well as when a postmortem is necessary, and when it makes sense to use a “blameless” postmortem approach -- is an essential SRE skill.

Conclusion

The list of SRE skills could certainly go on. Above are only the most fundamental types of skills SREs will need for most modern environments. But if you’re just starting out on your journey to becoming an SRE, the nine skill domains described above are a good place to begin acquiring the knowledge you’ll need to excel in an SRE career.

Site reliability engineering Cloud computing

Opinions expressed by DZone contributors are their own.

Related

  • Why We Built Smart Scaler
  • The Death of "Text-Only" ChatOps: Why Google's A2UI Matters for DevOps and SRE
  • How Reactive Scaling Drains Your Cloud Budget Without Warning
  • Cost Is an SLI: Why Your System Is “Healthy” but Burning Cash

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook