DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Books To Start Your Career in Cloud, DevOps, or SRE in 2024
  • Mid-Mortem Should Not Be Optional
  • Top Book Picks for Site Reliability Engineers
  • XAI: Making ML Models Transparent for Smarter Hiring Decisions

Trending

  • SQL Server Index Optimization Strategies: Best Practices with Ola Hallengren’s Scripts
  • 5 Best Node.js Practices to Develop Scalable and Robust Applications
  • A Modern Stack for Building Scalable Systems
  • From Fragmentation to Focus: A Data-First, Team-First Framework for Platform-Driven Organizations
  1. DZone
  2. Culture and Methodologies
  3. Career Development
  4. Cracking the SRE Interview

Cracking the SRE Interview

This article discusses what skills that companies look for in a candidate while interviewing for a Site Reliability Engineering role.

By 
Krishna Vinnakota user avatar
Krishna Vinnakota
·
May. 23, 24 · Opinion
Likes (3)
Comment
Save
Tweet
Share
5.7K Views

Join the DZone community and get the full member experience.

Join For Free

This article discusses the skill set that is expected by various companies for the roles of SREs. I have worked as a Site Reliability Engineer for companies such as Amazon, Microsoft Corporation, and TikTok. I have attended numerous interviews for Site Reliability Engineering roles and have interviewed other engineers for SRE roles in the companies where I worked.

The role of Site Reliability Engineer can have different titles in various companies. For example, Google calls this role Site Reliability Engineering, Microsoft used to call this role Service Engineering, Amazon calls it Systems Development Engineer, Meta calls it Production Engineering, and a few other companies call this role DevOps. These roles have many common requirements.

Let's look into various skills that companies, especially the big technology companies, look for while interviewing engineers for these roles.

Coding

One of the important skills that SREs need to have is coding since automating repetitive tasks and writing tools to manage infrastructure efficiently is an important part of the SRE job. Companies test the candidate's coding skills through coding interviews. Usually, these interviews tend to be of two types.

The first type of coding interview focuses on standing data structures and algorithms. Coding challenges from websites like leetcode or hackerrank will help practicing coding for this type of interview. The second type of coding interview focuses on coding challenges that may emulate some of the day-to-day tasks SREs work on. For example, reading data from files and processing the data, etc.

Companies are usually open to candidates using any programming language but, based on my experience, coding in Python would be helpful since it is easy to implement solutions in Python and the majority of SREs use Python for day-to-day automation.

System Design

The second important skill that an SRE needs to have is a solid understanding of large-scale distributed systems. Companies look for this knowledge by asking System Design questions during the interviews. An example question for a system design interview is "Design a logging service." These questions tend to be vague and it is important to ask a lot of clarifying questions before coming up with a design solution. A few key things to focus on as an SRE while designing a system are Scalability, Reliability, and Security of the system. It is also important to focus on Non Abstract parts of the systems such as capacity planning.

Operating Systems

A deep understanding of Operating Systems, especially Linux, is an important skill that will be invaluable for an SRE. Companies look for this knowledge through the interviews focused on the Linux operating system. The questions may include various topics such as popular Linux commands to administer and troubleshoot issues on Linux, Linux Kernel, System Calls, troubleshooting performance issues on Linux, and Memory/Network/Disk/Process sub-systems of Linux.

Computer Networking

A good understanding of various protocols and TCP/IP models is a great skill to have for an SRE as this will help in troubleshooting any production issues or designing infrastructure. A few protocols that are important to have a deeper understanding of are HTTP, TLS, DNS, TCP, UDP, IPv4, IPv6, ARP, ICMP, etc. It is also useful to know which tools can be used to analyze each of these protocols.

SRE Best Practices

Companies often look for candidates who understand the SRE best practices related to topics such as observability (alerts, metrics, logs, traces, dashboards, etc.), incident management, change management, automation, operational excellence, and capacity planning. The topics may also include concepts such as SLI/SLA/SLO, MTTR/MTTA/MTTI, etc.

Work Experience

This category includes questions related to the kind of projects that you have worked on in your current and previous jobs. Interviewers typically ask for a specific project that the candidate worked on in the past and dive deep in to understand various aspects such as the complexity of the project, challenges faced during the project and how the candidate overcame those challenges, and what the candidate learned from any failures from the projects.

Infrastructure

A key responsibility of SREs is to design, deploy, and maintain various infrastructure components such as Kubernetes, SQL databases, non-SQL databases, message queues, load balancers, Content Delivery Networks, etc. Knowledge and experience working on various major cloud services such as Amazon Web Services(AWS), Microsoft Azure and Google Cloud Platform(GCP) is another important aspect that companies look for in the candidate. Depending on the role where the position is in, companies may assess the engineer's understanding of one or more of these infrastructure components.

Troubleshooting

Being part of the on-call rotation is an essential part of an SRE's job. Effective troubleshooting skills are important to have since resolving user-impacting issues under time pressure is critical for maintaining the uptime of the services. SREs combine their knowledge of various technologies, and systems and their experience operating services in production to troubleshoot issues. Companies assess troubleshooting skills by asking how the engineer would solve a given hypothetical issue. Approaching the troubleshooting problem methodically and showing the understanding of distributed systems is important in this type of interview.

Behavioral

Every company has its unique culture, values, and leadership principles. The behavioral interviews focus on asking questions to probe whether the engineer matches the company's culture. These questions tend to focus on how the engineer acted in the past in a similar situation. An example question is "Tell me a scenario when you had to disagree with your manager." A popular method to use to answer such questions is the STAR method. STAR refers to Situation, Task, Action, and Result.

Conclusion

Site Reliability Engineering role is a challenging role where one needs to have a deeper understanding of various technologies. By focusing on these key skills one can become a great Site Reliability Engineer crack challenging technical interviews and have a rewarding career. Happy interviewing!

Site reliability engineering career

Opinions expressed by DZone contributors are their own.

Related

  • Books To Start Your Career in Cloud, DevOps, or SRE in 2024
  • Mid-Mortem Should Not Be Optional
  • Top Book Picks for Site Reliability Engineers
  • XAI: Making ML Models Transparent for Smarter Hiring Decisions

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!