10 years ago, the world of web development was very different to how it is today. The traditional developer flow of design, code, test, deploy has been augmented with a larger focus on infrastructure, thanks to the emergence of the platforms like Heroku and Google App Engine. So here’s a quick introduction of what you need to know about developer operations in 2012.
- The Infrastructure Is The Key To Application Adoption
It’s time that you came to realise that there’s no point trying to ignore infrastructure anymore. It’s everything to your application - design, tests and code don’t matter - it’s all about how it runs on production servers.
- Everything Is Infrastructure
From the application server, to your monitoring applications and virtual machines. Everything is part of the infrastructure, and it can break any time. The more pieces in the puzzle, the more parts that can cause a failure.
- Take Time To Setup The Hardware
Even the cloud is based on hardware, and you’ll need to be familiar with everything related to having a rack of servers. Know the strengths and weaknesses of all parts of your setup - decisions will be easier to make as a result.
- Know The Operating System
You’ll probably be running on a Unix based system, so know the fundamentals of these operating systems. Know what swap space is, know what happens when you run out of memory.
- No Problem Fixes Itself
You’re the operations guy, and you can’t sweep problems under the carpet. If something broke, find out why. Analyze every bit of data available to you in order to get to the bottom of the issue.
- Deploy Early, Deploy Often
Don’t leave deployment until the end, or half way through development. The minute you have something that can run, even just a “hello world”, build it, deploy it and see it working for yourself.
Now that you’ve got deployment ready early, automate it. Make it simple to iterate - if you’ve done something twice, and you think you might need to do it again, write a script for it. It pays off in the long run.
- Beware New Technology
Don’t forget that every piece of the infrastructure introduces a new point of failure. Know what the risks are, and have a fallback plan in case it doesn’t work out. And don’t rush for the latest shiny technology!
- Use Feature Flips
There’s no need to make new features available across the board. You don’t want to frustrate users, and you probably want to initially test it out with a beta group. A feature flip means that if something is breaking, it can be turned off without taking down the entire application. This goes back to what we all know about modularity being good.
- If You're Going To Fail, Fail Well
Think the Twitter Fail Whale - if something’s broken, let the user know with helpful messages. Inform them that you’re working on the situation. Try out scenarios where you use external services to find out what happens.
- Supervise Everything
Data is everything, so you’ll want to have a good monitoring system in place, with alerts for thresholds. Try and get your levels/alerts right at the start, so you don’t start ignoring them. If a process crashes, there should be procedures in place to restart it or notify you.
- Use Science - Measure
If your application takes too long to execute an operation (think video upload) the user will move to a different provider. Add metric recording to everything, instrumenting your code in key areas. You should have a dashboard where you get a useful overview of the data.
- Use Timeouts Everywhere
Don’t allow requests to pile up. Have timeouts so that calls to external resources don’t block up the application.
- Know Your Core - The Database
Know how the index is built, understand what happens when you run a query. The database is key to your application, so you need to know it well, and understand how to get the best performance possible from it.
- Love Logging
Even though you shouldn’t expect to get it right first time, logging is one of the most important things in your application. Make sure that you add logging to the key operations. As you find issues, continue to adapt your logging, improving it’s reliability and relevance.
- Get Used To The Command Line
You might have a cool dashboard, but there will come a time that everything else it broken, and you’ll need to revert to the command line. Know your way around Unix, how to set key parameters and how to make your way through the application logs.
- Everything Breaks When You Scale
As you scale out, more parts are added to the infrastructure. And like I said earlier, the more things that are added, the more points of failure that exist.
- Embrace Failure
And speaking of failure, treat it like a good thing. If you know what caused a problem, and you’ve fixed it, then it shouldn’t happen again right? Analyze everything when you see a failure to zero in and put measures in place to stop it from happening again
All of these points come from Mathias Meyer’s excellent post “Web Operations 101 For Developers"