Originally Written by Tom Diederich
With 1.28 billion active users, Facebook MySQL database engineers are active and extremely valuable contributors to the global MySQL community. So naturally they are also active participants of Percona Live MySQL conferences! And next week’s Percona Live London 2014 (Nov. 3-4) is no exception. (Register now and use the promotional code “Facebook” to save £30!)
I spoke with Facebook database engineers Yoshinori “Yoshi” Matsunobu and Shlomo Priymak about their upcoming sessions along with what’s new at Facebook since our last conversation back in April.
Tom: Yoshi, last year Facebook deployed MySQL 5.6 on all production environments – what have you and your team learned since doing that? And do you have a few best practices you could share? I realize you’ll be going into detail during your session in London (MySQL 5.6 and WebScaleSQL at Facebook), but maybe a few words on a couple of the bigger ones?
Yoshi: MySQL 5.6 has excellent replication enhancements to use in large-scale deployments. For example, crash safe slave makes it possible to recover without rebuilding a slave instance on server crash. This can greatly minimize slave downtime, especially if your database size is large. There are many other new features such as GTID, multi-threaded slave, streaming mysqlbinlog and we actively use them in production.
For InnoDB, Online DDL is a good example to ease operations. Many MySQL users are doing schema changes by switching masters. This can minimize downtime but requires operational efforts. Online DDL made things much easier.
Tom: Facebook is an active and extremely valuable part of the overall MySQL community and ecosystem – what are some of the key features and improvements you’ve contributed in the past year since moving to MySQL 5.6?
Yoshi: For InnoDB, I think online defragmentation and faster full table scan are the most valuable contributions from Facebook in 5.6. I have received very positive feedback about faster InnoDB full table scan (Logical ReadAhead). My colleague Rongrong will speak about something interesting regarding online defragmentation at Percona Live London. For Replication, we have done many optimizations to make GTID and MTS work without pain. Semi-Synchronous mysqlbinlog and backported Loss-Less semisync from MySQL 5.7 are very useful when you use Semi-Synchronous replication.
Tom: Shlomo, your sesson, “MySQL Automation at Facebook Scale,” will be of great interest to DBAs at large and growing organizations considering that Facebook has one of the world’s largest MySQL database clusters. What are the two or three most significant things that you’ve learned as a database engineer operating a cluster of this size? And has anything surprised you along the way (so far)?
Shlomo: This is a great question! We like to speak of “10x” at Facebook when thinking of scaling. For example, what would you do differently if the number of servers you had was 10x more than what it is? This type of mental exercise is surprisingly useful when working with systems at scale. If you, or any of the readers, try to extrapolate this about systems you manage, there will be things you’ll be imagining about how a system like this would be – and you won’t be too far from our reality in many aspects.
You’d imagine that we automate much of the single units of work, like master/slave failover, upgrades and schema changes. You’d suspect we have automated fault detection, self managing systems, good alarming and self remediation. You’d presume that if you’re used to running a command on 100 machines, you’ll now be running it on 1000. At least that’s what I thought to myself, so these are not the things that surprised me. There are a few fundamental shifts in one’s thinking when you get to these sizes, which I didn’t foresee.
The first one is that there is absolutely no such thing as “one-off.” If there is a server somewhere that hits a problem every three years, and you have 1000 servers, this will be happening daily! Take it to 10,000 servers, and you can see absolutely nothing is a “one-off”. We can’t write things off as “worst case, I’ll get an SMS.” Whatever it is, we have to chase it down and fix it. Not just that – to deploy a fix at scale can require writing fairly large amounts of code, a fix that could be deployed manually by a DBA in smaller environments.
The second one is adapting to constraints which are very pragmatic and tangible. If you’re on AWS, you’re pretty much isolated from things like worrying where your servers are physically located, when they go over their lifetime, and if the firmware on the switch in the rack needs to be upgraded. If you’re a small shop and have a few racks up in a co-lo, hardware maintenance is just not as frequent, but it becomes more painful as you grow.
At Facebook, we run our own datacenters! We need to work around interesting challenges, such as running datacenters that have highly variable compositions of server hardware. Since we have so many servers, something is always going on. Racks of servers need to be moved. Whole clusters need to be rebuilt or refreshed, to be made better, faster, stronger. New datacenters are constructed, others decommissioned.
Tom: And this is where automation comes into the picture, right?
Shlomo: We have had to build a lot of automation to make these operations seamless, and we work closely with the Site Ops teams on the ground to coordinate these logistically complicated processes.
Another thing my team does in this space is planning capacity and hardware purchases. Since we build our own servers, the turnaround time between ordering and getting machines is quite long, so proper planning is paramount. Buy too much, and you’ve wasted millions of dollars. Buy too few servers, and there won’t be space for user growth and upcoming projects. The sheer scale makes these decisions more complicated and involved.
These things have actually made my job much more interesting, and I think I’d find it hard to adjust to a smaller environment.
Tom: Last April Facebook announced a move to the newly created WebScaleSQL. Yoshi, do you have an update on where WebScaleSQL is today? And I know it’s early, but has there been any impact on Facebook yet?
Yoshi: WebscaleSQL is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale. Collaboration is nothing new to the MySQL community. The intent is to make this collaboration more efficient.
We are based on the latest upstream (currently MySQL-5.6.21), and added many features. We added patches to improve InnoDB performance around compression LRU flushing, locking, NUMA Support, and doublewrite. We statically link Semi-Sync based on lessons learned at Facebook environments (plugin-lock caused hot mutex contentions). We have many upcoming features such as async clients.
We will continue to track the upstream branch that is the latest, production-ready release (currently MySQL 5.6). We are continuing to push the generally useful changes we have from all of the participants. If you think you have something to contribute, get in touch!
Tom: I remember being surprised earlier this year when you told me there was usually just one MySQL Operations team member on call at any given time thanks to “robots.” How many robots did your team build and what do they do? Oh, and should rank-and-file DBAs around the world be worried about losing their day jobs to these robots?
Shlomo: Instead of becoming obsolete as some fear, our team is shifting its focus from smaller to larger problems, as we rise higher in the levels of abstraction. Our team has progressed with the requirements of the role. From being a team of DBAs that automate some of their work, we have become more like Production Engineers. We design, write and maintain MySQL/Facebook-specific automation that does our work for us.
While we build these software “robots” to do our work, we also have to maintain them. The job of the oncall is to fix these robots when they malfunction, and that can sometimes be difficult due to the size of our codebase.
In regards to employment concerns, I’d say our work has become more interesting, and the amount has increased. It definitely did not decrease, so if Facebook is indicative of other companies, jobs are not at risk just yet. Speaking of jobs – if what we’re doing sounds interesting, we’re hiring!
Oh, and as for details about these “robots” – that’s the topic of my talk next week in London. Come and hear me speak if you want to know more!
Tom: Yoshi, you also will host a session titled “Fast Master Failover without Data Loss.” I don’t want to give too much away, but how did you get failover to work at scale – across vast datacenters?
Yoshi: Master failure is a norm at Facebook, because of the large amount of servers. Without automation, it is not realistic for a limited number of people to manage. We have a very interesting infrastructure to automate failure handling at Facebook scale. To automate stuff, reliability is important. Unreliable automation makes engineers spend lots of time fixing things manually, and that increases downtime. It is also important to define what to automate and what we shouldn’t automate. Define failure scenarios and write good test cases and continuously integrate. There are multiple failure scenarios like the ones below and you’ll hear about each in detail at my session:
– mysqld crash
– mysqld stalls
– kernel panic and reboot
– error spikes caused by H/W failure
– error spikes caused by bad application logic
– rack switch down
– multiple rack switches down
– datacenter down
Tom: What other sessions, keynotes or events are you looking forward to at Percona Live London 2014? And are you guys planning on attending the MySQL Community Dinner?
Yoshi: “MySQL 5.7: Performance and Scalability Benchmark” (led by Oracle MySQL performance architect Dimitri Kravtchuk). And yes, we’re looking forward to meeting with people at MySQL Community Dinner!
Tom: Thanks again Yoshi and Shlomo for taking the time to speak with me and I look forward to seeing you both in London next week!
And readers, I invite you to register now for Percona Live London using the promotional code “Facebook” to save £30. I also hope to see you at the MySQL Community Dinner next Monday (Nov. 3). Space is limited so be sure to reserve your spot now and join us aboard our private double-decker bus to the restaurant.
I’d also like to thank the Percona Live London 2014 Conference Committee for putting together a terrific event this year! The conference committee includes:
- Dailymotion’s Cédric Peintre, conference chairman
- Percona’s David Busby
- MariaDB’s Colin Charles
- ebay Classifieds Group’s Luis Motta Campos
- Booking.com’s Nicolai Plum
- Oracle’s Morgan Tocker
- Spil Games’ Art van Scheppingen