Data migrations: Oh I can do this in an hour. Two tops. Two days later:
— Swizec (@Swizec) August 20, 2015
On Thursday we pushed to production. Three months worth of code went from no users, to thousands of users. Or hundreds. Depends how you count.
My point is, our code went through the ultimate integration test — the real world.
The whole thing was much less intense than expected. We pushed in the middle of the night, ran our migrations, waited 36 minutes for them to run, ate some pizza while waiting for the fires to start, and went home when they didn’t. An uneventful deploy is the best kind of deploy.
Well okay, I say no fires started, but that’s not quite true. The whole engineering team has spent the last couple of days fixing known bugs that slipped through, squashing unknown bugs that users on strange computers in strange lands discovered by clicking something they shouldn’t, manually tweaking the production database to get people out of unforeseen corner cases, and implementing some features we forgot we needed.
Such is life without QA. You forget things.
As David Heinemeier Hansson once said: "It’s not worth fixing if a user doesn’t complain about it."
I kid, we knew hiccups would happen. Hair-on-fire mode lasting just three days after a major deploy is good. Very good.
That moment when workers run while you’re importing data and duplicate ids and your db doesn’t get primary keys and rails gets confused.
— Swizec (@Swizec) August 21, 2015
But the week before was the most frustrating week of my life. I was this close to tearing my hair out, punching my computer in the face, and taking up goat farming. This close. *Does the "this close" finger gesture*
I was doing a test run of the database migrations we accumulated in those three months. Shouldn’t take more than two hours, right? Wrong. It took a week.
Some of our database migrations are also data migrations. Mistake no. 1.
None of our new migrations that we’ve had for three months have ever been run against the full database. Mistake no. 2.
Our whole database is under 500,000 lines. The entire data dump falls well under 200 megabytes. Our database is tiny. What could possibly go wrong?
Oh so very much could.
Most migrations ran in a fraction of a second. Many took a second or two. A few beasts took ten minutes.
Ten whole minutes for a single migration on a tiny database. The mind recoils in horror, unable to understand what could be taking so long.
When we wrote a migration, we tested it on our local machines. 2.3 gigahertz computers with six or eight CPU cores, 16 gigabytes of memory, a solid state drive, and a database that could fit on a floppy drive.
“It was fast when I tried it”. Of course it was. Your computer is a monster running on bare metal. Heroku dynos are paltry “computers” running on many layers of virtualization. They’re meant to zerg rush a problem, not handle a big thing on their own.
You run into problems when your migrations look like this:
This migration adds a
type column to
promotional_tokens, then goes through existing tokens — of which each user has one — and updates them. Students get a new token, everybody else has theirs removed. Because that’s what business wants.
There are 90,000 users in the system.
Let’s play "spot the problem."
.allthat loads the whole table into memory
- We run a DB query for every token to fetch the user
- Then we run a DB query to update the token
- Then we run a DB query to get the token again, but with the correct object type
- Then we run a DB query to move the token reference to
- If the user isn’t a student, we run a destroy query instead
There’s only 200 users who aren’t students. We can ignore those.
For every user we have, this migration hits the database four times. 360,000 database queries. And just to make it more fun, each is made using ActiveRecord instead of raw SQL.
It took 10 minutes to run through the whole dataset on my local machine. It flat out crashed Heroku’s one-off dyno. Out of memory.
And if it didn’t crash, who knows. Our database runs on a separate server. Maybe in the same datacenter, maybe not. Heroku knows, we don’t.
That makes migrations even slower.
That moment when your laptop is so much stronger than Heroku dynos that you decide to migrate data locally.
— Swizec (@Swizec) August 21, 2015
I’m not kidding. I legit considered snapshotting the production database, running migrations locally, then swapping production with the updated snapshot. That could screw things up, especially permissions, but at least I could run it.
We salvaged this migration after many struggles. Like this:
Doesn’t look like much, but an order of magnitude less queries hit the database. The magic of scoping and
200 queries run to delete non-student tokens. Nothing you can do about that. A single query updates the type of all tokens still in the system. And a single query per token, moves the token reference to
Some 90,003 queries in total. Takes a minute or two.
In school they teach you that O(4n+1) is the same as O(n+3). It’s just linear complexity, who cares. Reality cares. 4 times 90,000 is a lot more than 90,000. Especially when you’re talking to a database across TCP/IP.
All in all it took myself and the genius who sits across from me an entire week to get our migrations running in under 20 minutes on a MacBook Pro. We removed a lot of
.alls, reduced many factors of linear complexity, and in one epic instance even turned linear complexity into constant complexity. That was cool.
We solved the rest by using
--size=px. That ran our migrations with Heroku’s beefiest dyno. 2.5 gigs of memory, dedicated machine, and 12x of computing. Whatever that means.
It took 36 minutes to run our migrations on production.
It took 36 seconds to test them in our dev environment.
You should follow me on twitter, here.