Often you want to automatize something using shell scripting. In a perfect world your script robot works for you without getting tired, without hiccups, and you can just sit at the front of your desk and sip coffee.
Then we enter the real world: Your network is disconnected. DNS goes downs. Your HTTP hooks and downloads stall. Interprocess communication hangs. Effectively this means that even if your script is running correctly from the point of operating system it won’t finish its work before you finish your cup of coffee.
Below is an example how to create timeouts and notifications in a shell script.
Automation must work 100% and you must always know if it doesn’t do that. Otherwise if you cannot trust the automated systems you could have this glorious moment of “oh it has been broken for three months now.” No amount of coffee makes your day after that. Then you spend your nights pondering whether your scripting is running instead of playing Borderlands 2.
Some safety guards around your coffee cup include
- Sane timeout thresholds for commands to avoid hang situations. E.g. your automated “make smallapp” should probably not run for 50 hours straight.
- Get a notification if a fully automatized process does not end up as expected. So that you find out the problem right away, not three months later. Do not use email as the communication channel, it is unreliable and has awful signal-to-noise ratio. Better solutions include instant messengers (Skype), SMS (check twilio.com)
- Get a notification also when a service, which should be running all the time, is restarted.
The shell scripting (sh, zsh, bash) does not offer built-in tools for timeouting commands by default (please correct me if I am wrong, but stackoverflow.com et. al folks did not know any better). In Bash cookbook, there exist a timeout wrapper script example (The orignal SO.com answer).
The trick is that when the timeout wrapper terminates the command, you’ll get the exit code of SIGTERM (143) or SIGKILL (137) to the parent script. This might not be entirely clear from reading the script.
3. Using timeout wrapper and sending failure notifications to Skype
Below is an example how you can hook timeout to your own script. Here we have a simple continous integration script (ghetto-ci) which polls version control repositories and on any change executes the test suite. A test failure is reported back to Skype by ghetto-ci using Sevabot Skype bot. The loop script uses a server specific (Ubuntu) way to set up a headless X server, so that the tests can run Firefox on the server (can one call a headless Firefox server “mittens”?) The continous integration loop is deployed simply as leaving it running on the screen‘ed terminal on a server.
Some protection we do: The script has extra checks to protect against hung processes by killing them using pkill before starting each test run. Selenium’s WebDriver seem to often cause situations where the Python tests don’t quit cleanly, leaving around all kind of zombie processes. In this case, the test command timeouts using the timeout wrapper and we get a notification to Skype. Based on this we can refine the test running logic, timeout delay and such to make the script more robust allowing us to focus more on drinking the coffee and less watching a looping process running in a UNIX screen for potential failures.
Pardon me for possibly ugly shell scripting code. SH is not my primary language. The example code is also on Github. Please see the orignal blog post for syntax colored example.