Efficient Retry/Backoff Mechanisms
In day to day life as a systems engineer there aren’t too many opportunities to build solutions based purely on theoretical studies or research papers – generally solutions are built for you and you “just” need to make them work, tie up the spaghetti, shave the yak and so on. Entering the world of programming and software development though, a lot of new problems come up that you can find your own solution to. On one hand this can be extremely liberating – speaking from experience, having for the first time in my career being employed as a software engineer for just over a year now. You can let your creativity go wild, dream up whatever crazy solution and then implement it and see if it works.
On the other hand, there is a vast body of existing work out there to build on, almost exclusively by people smarter than you are and who have put a lot of effort and thought into solving a specific problem. Some good examples in recent times are Paxos and similarly Raft consensus algorithms, F1 Distributed Database, BigTable, Map/Reduce and many many more. A great blog that I try to keep on top of is Werner Vogels’ All Things Distributed, where he maintains a mix of AWS-related announcement-style posts and musings on the various research papers he is reading while on flights between engagements. Needless to say that the Google Research Papers are another valuable source of information, as well as a lot of things coming out of the various SIG* organizations (e.g. SIGOPS, SIGGRAPH).
But back to my point. Having recently coded up a few systems which hurl copious amounts of requests at various AWS services, I very quickly found out that you WILL get errors and need to handle them in a smart way. Not retrying is not an option, as is retrying indefinitely with no limit. Linearly-spaced retries with a limit tend to also suffer from abusing the target system beyond what it can tolerate, before giving up entirely and perhaps prematurely. Once you start to implement a system with exponential backoff similar to how most multi-access medium algorithms work (think Ethernet with CSMA/CD), you are finally starting to get somewhere.
Someone at $EMPLOYER recently linked to this paper which details backoff mechanisms utilising a Fibonacci sequence. It is very interesting work, and taking the time to read and understand such research allows you to build stronger, more resilient software with characteristics far exceeding what you could have come up with on your own. This in itself I find fascinating and addictive. I strongly recommend giving it a read and utilizing this approach to building software.