1. Avoid destructive commands
From time to time I’m working with new recruits and bringing them up to speed in operations. The first thing I emphasize is care with destructive commands.
What do I mean here? Well there are all sorts of them. SQL commands such as DROP table & DROP database. But also TRUNCATE and DELETE are all destructive. They’re easy to execute but harder to undo. Think of all the steps it would take to restore from your backup.
If you are logged in as root there are many many ways to shoot your own foot. I hope you know this right? rm has lots of options that can be very difficult to step back from like -r (recursive) and -f (force). Better to not use the command at all and just move the file or directory you’re working on by renaming it. You can always delete later.
2. Set your command prompts
When working on the command line, your prompt is crucial. You check it over and over to make sure you’re working on the right box. At the OS, your prompt can tell you if you’re root or not, what directory you’re sitting in, and what’s the hostname of the box. With a few different terminals open, it’s very easy to execute a heavy loading command or destructive command on the wrong box. Check thrice, cut once!
You can also set your mysql prompt too. This can provide similar insurance. It can tell you the database schema you’re set at default, and the user you’re logged in as. Hostname or localhost too. It is one more piece in the risk aversion puzzle.
3. Perform backups & test them
I know I know, we’re all doing backups already. Well I sure hope so. But if you’re getting on a system for the first time, it should be your very initial impulse to check and find out what types of backups are being done. If they’re not, you should set them up. I don’t care how big the database is. If it’s an obstacle, you need to sell or educate management on what might happen if. Paint some ugly scenarios. It’s not always easy to see urgency in these things without a good war story or two.
We wrote a guide to using xtrabackup for hotbackups. These can be done online even while your production database is serving customers without table locking or other downtime.
4. Stay off production machines
This may sound funny to some of you, but I live by it. If it ain’t broke, don’t go and try to fix it! You don’t need to be on all these boxes all the time. That goes for other folks too. Don’t give devs access to every production box. Too many hands in the pie so to speak. Also limit root users. But again if those systems are running well, you don’t have to login to them and poke around every five minutes. This just brings more chances for operator error.
5. Avoid change as much as possible
This one might sound controversial but it’s saved me more than once.
I worked at one firm a few years back managing the MySQL servers. The Oracle DBA was going on vacation for a few weeks so I was picking up the reigns for a bit. I met with the DBA for some brain dump sessions, and he outlined the main things that can and do go wrong. He also asked that I avoid any table alterations.
Sure enough ten days into his vacation, a problem arose in the application. One page on the site was failing silently. There was a missing field which needed to be added. I resisted. A fight ensued. Suddenly a lot of money was at stake if this change wasn’t pushed through. I continued to resist. I explained that if such a change were not done correctly, it very likely would break replication, pushing a domino of other things to break and causing an unpredictable mess.
I also knew I only had to hold on for a few more days. The resident dba would be returning and he could juggle the change. You see Oracle was setup to use multi-master replication those changes needed to go through a rather complex process to be applied. Done incorrectly the damage would have taken days to cleanup and caused much more financial damage.
The DBA was very thankful at my resistance and management somewhat magically found a solution to the application & edit problem.
Push back is very important sometimes.
6. Monitor important things
You should monitor your OS syslog and MySQL error log for starters. But also your slow query log for new activity, analyze them and send the reports along to devs. Provide analysis. Monitor your partitions. You don’t ever want disks to fill up. Monitor load average, and have a check that the database login or some other simple transaction can succeed. You can even monitor your backups to make sure they complete without error. Use your judgement to decide what checks satisfy these requirements.
7. Use one or more slaves & checksum
MySQL slave databases are a great way to provide insurance. You can use a lagging slave to provide insurance against operator error, or one of those destructive commands we mentioned above. Have it lag a few hours behind so you’ll have that much insurance. At night this slave may be fresh enough to use for backups.
Also since mysql uses statement based replication, data can get out of sync over time. Those problems may or may not flag errors. So use a tool to compare your master and slave for data consistency. We wrote a howto on using checksums to do just that.
8. Be very careful of automatic failover
Automation is wonderful when it works. We dream of a data center that works like clockwork, with robots that never sleep. We can work towards this ideal, and in some cases get close. But it’s important to also understand that failure is by nature *not* what we predicted. The myriad ways that complex systems can fail boggles the mind, and surprises even seasoned veterans of operations. So maintain a heathy suspicion of this type of automation. Understand that if you automate things to happen in this crucial time, you can potentially put yourself in an even *more* compromised position than simply failing.
Sometimes monitoring, alerting, and manual intervention are the more prudent path. Your mileage may vary of course.
9. Be paranoid
It takes many years of doing ops to realize you can never be paranoid enough. Already checked that you’re on the right host, and about to execute some command? Quit the shell prompt and check again. Go back and ask the team if that table really needs to be dropped. Try to rephrase what you’re about to do in different words. Email out again to the team and wait some time before you pull the trigger. Check one more time that you have a fresh backup.
Delay that destructive command as long as you possibly can.
10. Keep it simple
I know I know, we all want to use that new command or tool, or jump on the latest hardware and take it for a spin. We want to build beautiful architectures that perform great feats of magic. But the fewer moving parts, the less things that can go wrong. And in ops, your job is stability and availability. Can you avoid using multi-master replication and go with just basic master-slave replication in MySQL? That’s simpler. Can you have fewer schemas or fewer filter rules? Can you skip the complicated HA layer, and use monitoring and manual failover?
Made it this far? Grab our newsletter.