Originally authored by Jonathan Owens
A few weeks ago, we switched the New Relic website to run on Ruby 1.9.3. This was an enormous project spanning many months and required the effort of nearly every engineer in the company. But the results were excellent – improved speed, reduced memory usage and an infrastructure ready for future Ruby versions.
Moving such a large and long-lived Rails 2.3 application as New Relic required a very careful and thorough approach. Nearly every aspect of how our site is tested, deployed and ran was affected. We learned a lot during the process and want to share the most important lessons.
It’s Not Just Debt
On a codebase as large as New Relic, the engineering time required to upgrade to 1.9 was large enough to treat as a serious feature. As we later learned, the performance improvements were significant enough to take it very seriously indeed.
Work to get the New Relic code ready for 1.9 began years ago, before I joined the company. Taking it from exploratory to production ready took eight months of calendar time, distributed among several engineers for different upgrade tasks.
When you get started with such a migration, it’s important to assign at least one engineer with the task of chasing down every dependency and weird bug until it’s shipped.
Use a Ruby Version Management Tool
Using a Ruby version management tool helped us in all stages of the upgrade. Both RVM and rbenv are excellent tools for managing and installing Ruby versions all the way from the laptop to the server. We went with rbenv for our servers for the simple installation and used the puppet-rbenv module to set it up. It’s worth deciding on which tool to use early on in the process and getting everyone on the team comfortable with it.
Once you’ve picked one, you can use it to do cross version testing on laptops, set up multi-version build configurations on your CI server, and easily change patchlevels or major versions on your production servers.
During our upgrade process, there was a long season where the codebase had to be cross compatible between 1.8 and 1.9. By making version switching easy, we ensured that it wasn’t too huge a burden.
Make Your Test Server Do Most of the Work
We use Jenkins to build our code every time a push occurs. So to get started, we simply added a build that ran our tests in Ruby 1.9. With the first build we had over 200 failing tests. But that gave us a target. We held some bug bashes to help drive down the failures, then made fixing the ones that remained a sustaining task like any other. This wasn’t a fast process, but it was sustainable. And it allowed the upgrade to fit into our existing bug tracking and test process.
But Tests Aren’t Everything
It was a very exciting day when our 1.9 test job went green. The first thing I did was switch my laptop to 1.9 and try to run the site. It didn’t work at all. Whoops!
Turns out there’s a lot more to running the code than just the tests. We had lots of development-mode only code that set everything to run on a laptop, none of which was tested by our CI tasks. This meant several more days of chasing down errors we had no idea existed.
Partial, Reversible Deploys Are Essential
We had several preproduction environments in which to test our 1.9 performance. But none of them receive even a fraction of the traffic our production site does, nor do they have even a fraction of the dataset to work with. So when the time came to deploy the upgrade, we decided to do one server at a time to see how they fared.
We quickly discovered two things. First, 1.9 was performing about 80% slower than 1.8. And second, our load balancers didn’t think this was a problem and gave it just as much traffic as the other servers. Then things started to get ugly.
We scrambled to fix the load balancer by switching from round robin to least-connections as our balance strategy. This reduced the load on the now poorly performing server so we could troubleshoot the performance problem.
After many hurried code changes, we discovered that our own Ruby agent had a poorly performing garbage collection instrumentation strategy under 1.9, which we patched right away and later released as version 3.5. With the patched agent, 1.9 went from 80% slower to 30% faster. High fives were had all around.
We would have got that 30% improvement much more quickly if we had actually done the fire drill of taking a server out of rotation before introducing the Ruby version change.
It Really Works
We have a bias for measurement here and when you make a big change, such as this one, having some charts on your side can be a tremendous help. Especially when you’re trying to make the project about more than just debt, having load charts that to down and throughput charts that go up are a tremendous asset.. In our case, 1.9 was so much faster that it was like getting a free web server.
This machine is delivering more traffic with less CPU:
We can look back now and see that this was an upgrade that delivered real user happiness. We can reliably serve page to our users in less than two seconds, any time of the day.
The switch to Ruby 1.9.3 represented a major feature upgrade to New Relic. While it was an enormous project, it has improved every aspect of how our code is run and managed. We hope our lessons learned help you achieve the results you’re looking for.