Beer, Bravado, and Bitbucket: Using Data to Drive Code Decisions
The core message is this: use the power beer gave us to make data-driven code decisions.
Join the DZone community and get the full member experience.Join For Free
Product teams and marketing teams continually use data to drive decisions. But what about us as software engineers? What can we do with metrics, besides just pulling them in for somebody else?
“You mentioned beer. I was led to believe there would be beer.”
People like Guinness… a lot. By 1914, their annual output was almost a billion pints.
That’s going to need a lot of quality control. So, clever Mr. Claude Guinness decided to hire smart graduates to do this work, sort of like tech companies do now.
Let's talk about Billy — or William Sealy Gosset to his less-intimate friends — “An energetic — if slightly loony — 23 year-old scientist.”
To help maintain quality over a billion pints, our friend Billy invented a way to answer important questions by taking small samples of beer and using advanced math, nowadays known as statistics. The methods invented by our clever Billy became the fundamental basis for how doctors decide if a new drug will save you (or kill you), how Facebook decides how to manipulate you, and a ton of other out-of-this-world uses.
What Does This Have to Do With Code?
Bitbucket (and the rest of the company) has a big push on performance. On the frontend, we’re making a number of changes to boost performance of the Pull Request page. In particular, we want to speed up rendering the code diffs.
Rendering the different uses a
Chunks component, which renders a great many separate
I wondered, could we speed this up by merging the
CodeLinemarkup directly into
But that reduces modularity and, possibly, maintainability. Since there’s a cost, I want to know if there’s a real benefit to rendering time. But how do I tell? All I have so far is an idea.
Sure, I could load the Pull Request page a couple times before and after the change. But that’s hardly a rigorous check, for many, many reasons. And it’s going to be hard to tell if the improvement is small (but real).
Now, Tell Me More
I needed something better than refreshing my browser a couple times. So, I used Lighthouse, an awesome tool for measuring frontend performance and collecting metrics. And it’s now available for the command line.
I wrote a ‘custom audit’ that let me measure diff rendering times for the Pull Request page and a batch tool for executing multiple runs. Hooray for reliable frontend metric gathering!
In this instance, the audit measures our diff rendering time. This is shared for everyone’s enjoyment (everyone loves code right?), but bear in mind that this was written as an internal dev tool. Production code would have things like nice error handling. Also note that this is for v3.2.1 of Lighthouse.
First, the code you are measuring should mark and measure User Timing events for Lighthouse to pickup:
Next, the custom audit.
Create a config file for Lighthouse that runs the above audit.
Run the above audit once.
Next, use an example node script to batch run the Lighthouse audit.
Finally, run the batch script.
Running my new custom Lighthouse audit gave me these two sets of numbers (rendering times in milliseconds):
So, the average time is lower by ~5 percent. Hooray!! All is well!
“But not so fast… the average isn’t a lot lower, and the rendering times are all a little different. How do you know this shows a real improvement? And also, get back to the beer.”
You got me! And well spotted! This is the core of the question! The rendering times are all a bit different. I need to be confident the improved rendering time isn’t just a random quirk of this sample.
Luckily, our friend Billy the beer brewer gave us a tool called the t-test. This is how Billy worked it out, lke a supply of hops would give him the best beer!
It allowed me to ask: given the amount of ‘noise’ in my example rendering times, is this difference between the averages likely to be a real difference or just a random fluctuation? The phrasing you often hear is: is it significant?
Using Billy’s tool, I get these two values:
t = 10.5457, p < .0001
This tells me that the probability that my improved rendering times came from random chance are less than one-in-ten-thousand. I can conclude that the code change gives a speed improvement that averages only ~5 percent, but it’s highly likely to be a real 5 percent**.
** Caveat: I’m glossing over a LOT of details around good experimental design. This is just a highly simplified example! And FYI, results in the real, non-dev world (medicine, etc.) are never as neat as in this example. They show a lot more noise.
This post is about opening developers' eyes to the connection between statistics and coding (it’s not meant to be a detailed how-to).
The core message is: use the power beer gave us to make data-driven code decisions.
Many small improvements in performance can add up to big improvements — as long as those small improvements are real and not just random quirks. To make sure they’re real:
- Collect metrics from your own code.
- Run significance tests to see if you’re making a real difference.
“So, can I have a beer now?”
Published at DZone with permission of Christian Doan. See the original article here.
Opinions expressed by DZone contributors are their own.