A Day in the Life of an SRE
A Day in the Life of an SRE
We interview Paul Greig about his experiences as an Site Reliability Engineer, uncover some of the day-to-day tasks that an SRE encounters, and discuss what he's regularly reading to stay up to date.
Join the DZone community and get the full member experience.Join For Free
Do you need to strengthen the security of the mobile apps you build? Discover more than 50 secure mobile development coding practices to make your apps more secure.
I just recently had the opportunity to interview Paul Greig, an SRE (site reliability engineer) who previously worked at both Atlassian and with various groups "in the world of hedge funds." He'll be speaking at the upcoming All Day DevOps virtual conference where he's hosting the session: Who Wants PIE? A Series of Post Incident Experiments which is part of the Site Reliability Engineering track.
Below, I talk a bit with Paul about his experiences as an SRE, uncover some of the day-to-day tasks that an SRE encounters, and discuss what he's regularly reading to stay up to date.
Enjoy the interview!
DZone: So, what exactly is site reliability engineering? Can you paint us a picture of your day-to-day—how much coding is involved, how much of it is more ops/admin focused?
Paul: We aim at 50% minimum of the team contributing towards the reliability of the services in their coverage. When teams are faced with less proactive time, we increase our focus on toil reduction. Toil that contributes to a reliable customer experience is invaluable, both for the end user's experience and as an opportunity to reduce SRE time spent through service improvements.
You’ve been an SRE at Atlassian for five years now and prior to that worked in the hedge fund world doing similar work—from your description on ADDO “ensuring the reliability of high-frequency ultra-low latency trading engines and market data from the world’s stock exchanges.” Can you tell us a bit about how your experience in trading technology and reliability prepared you to lead groups ensuring reliability for the JIRA and Confluence Cloud services?
For me, I'm very customer focused. At a hedge fund, my customers were the trading desks, whereas at Atlassian they were the users of our services. Being responsible for a services reliability takes two prerequisites to be successful:
1) Understanding your service or trading engines current resilience measures
2) Assuming those measures will be insufficient at a future point of time and need maintenance
What are some of the major differences between these two experiences?
Being familiar with the language of the customers. In trading, we considered the exposure during an incident to be financial, whereas at Atlassian the exposure is to the teams being impacted in their use of the service. These both drive the same urgent response and project rosters from SRE teams, but you first need to know what your customers care about.
Along those same lines, can you tell us a bit about the changes you’ve seen in the SRE world during this time?
Maturity of monitoring and how this has spurred the adoption of service level objectives. Where once we may have relied on basic system metrics and log lines & a sysadmin team, today we having tracing and extensive instrumentation, companies have centralized observability groups and everyone is familiar with monitoring and observability concepts.
How have you seen the growth and maturity of DevOps affecting your work, if at all?
DevOps encourages service owners to consider not only the mechanisms for faster releases but how to ensure their teams won't be affected by re-work by monitoring those incremental changes to a service. Consideration of a service's operational maturity is more regularly considered in services being built, this allows SREs and DevOps practitioners to share their knowledge to all teams.
Can you tell us about any specific, memorable incidents that caught you and your team off guard?
One that calls back to the example above, know your resiliency measures and assume they will be insufficient in the future. The event itself is well known... back in 2012, Knight Capital released an unintended trading strategy into production, with an impacted $440 million in 45 minutes. The event led to stronger adoptions of kill-switches between brokers and exchanges.
What did you do in the immediate to respond?
In the days after Knight Capital, we started daily tests utilizing the kill-switches and their effectiveness in removing exposure in the case of an unexpected event.
How did you adapt and learn from this experience?
The event really sticks with me for the importance of ensuring blameless culture in a post-incident review and learn from others' experiences. If teams were to blame only an individual for this change, the likely result would be more process and manual verification for every release. Digging past the person encourages people to accept those changes will one-day arrive in production, how can we be best placed to mitigate their impact when it happens.
Lastly, we’re a community of readers here on DZone. Can you tell us any people or blogs that you’re following or reading regularly that you think would be helpful for folks hoping to keep up to date and learn more on SRE best practices?
My favorite book of recent is Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations by Nicole Forsgren PhD, Jez Humble, Gene Kim.
I'm regularly following the blog and articles shared by Mike Amundsen on his personal blog.
Finally, I administer an external group called Guild of Reliability Engineers (GoRE) where we are building and sharing SRE experiences and discussions for the Asia Pacific regions. Subscribe to the group on https://groups.google.com/forum/#!forum/guild-of-reliability-engineers-gore.
Thanks for the interview!
Slides for Paul's Upcoming Talk
All Day DevOps 2018
The free, online conference goes live on October 17th, offering 100 different practitioner-led sessions, each one 30-minutes long. With 5 separate tracks: CI/CD, Cloud-Native Infrastructure, DevSecOps, Cultural Transformations, & Site Reliability Engineering, and 100 speakers, there's sure to be something for everyone.
And speaking of everyone, if you're part of an organization with 20+ people that want to attend the conference (again, it's free!) then you should consider joining the Club 20 program so that you might get your company logo added to the ADDO site. Check out some of the Club 20 participants here and consider joining them.
Hope to see you online at the show!
Opinions expressed by DZone contributors are their own.