It happens to us all, you receive the new A/B testing results and need to verify the data. Or you are turning your latest funnel analysis into a data application so you can continually reap the benefits of your hard work. As you scrutinize your work, you notice that the numbers aren’t quite adding up. Data verification is a part of life for anyone who works closely with data. It also seems analogous to tracking down and debugging code. Both can lead to frustration and seemingly lost work hours. Using real-world examples, I’ll give some of my tips and tricks so that you can quickly identify errors in your data analyses.
Don’t Assume Anything
Just because it seems right, doesn’t mean it is right. Our human brains like to trick us. I’ve noticed this line of thinking happens when analysts are rerunning or productizing an analysis. Even though, initially the query or script looks the same, a deeper investigation proves otherwise.
Let’s take a common issue I’ve seen people run into: changing an aggregation query.
Take the following two queries:
SELECT Month, Group1, Group2, Group3, CONCAT(Group1, “-”, Group2) as NewGroup, SUM(Usage) as total_usage FROM usage GROUP BY 1, 2, 3, 4, 5
SELECT Month, CONCAT(Group1, “-”, Group2) as NewGroup, SUM(Usage) as total_usage FROM usage GROUP BY 1, 2
Initially, many would look at these two queries and say that they are effectively the same. The left query is just including a few extra columns, right? But that’s not at all what’s going on. There are five levels of aggregation in the query on the left and there are only two on the right. The left query will return much smaller totals since the groups are more refined. Depending on what you are doing further down the line in your analysis, like window functions or even filtering, those extra groups could cause havoc. If you just drop them in future queries in the pipeline, you no longer have distinct groupings.
Aggregation mistakes are some of the most common causes of data errors. Even if it looks correct to you, do a double take.
It’s a Snap(shot)
Which leads me to point out another common data errors that I have encountered over the last four years, both as an analyst and as a teacher: the snapshot table.These are data tables that on a given time period (monthly, weekly, daily), take a digital snapshot in time.
For whatever reason, these types of tables trip many people up. First, they tend to be very poorly identified. What I mean is that for a new consumer of that table, it is not immediately identifiable as a snapshot table, causing the user to mishandle the data. An easy solution is prevention such as naming the table to reflect its internal structure.
What if you suspect a snapshot table and how to do you work with them? The biggest identifier that you may be working with in a snapshot table is that all your metrics are overly inflated. Did you pull a week’s worth of days and got results that look about 5-7x too large? Luckily, it’s a simple fix to work with these tables. You can either narrow down to a single day, like the last day in your time period, or take the max value. See below for examples:
Selecting a single day:
SELECT TD_TIME_FORMAT(time, ‘yyyy-MM’) as MONTH, category, usage FROM usage_snapshot WHERE TD_TIME_RANGE(time, ‘2016-04-01’)
Finding the max:
SELECT TD_TIME_FORMAT(time, ‘yyyy-MM’) as MONTH, category, MAX(usage) as total_max_usase FROM usage_snapshot
Being consistent with how you decide to work with snapshot tables is key. Depending on the context and goal, either method of dealing with them is valid.
Look for Patterns
When investigating an issue with data validation, I find it useful to try to find patterns in what is going wrong. Questions like:
- Is all the data affected?
- Is the affected data all from the same groups?
- Are the differences proportional to each other or do they look random?
- Are there any patterns around dates?
Help you narrow down a potential cause. If all of the data is affected, I know that tends to mean that the culprit is usually in the script or query, rather than in the data itself. However, if I notice a certain month or day is noticeably lower, I would investigate the underlying data. It could mean a problem with data collection occurred during that time period.
If the data you are verifying tends to be proportionally off compared to the original data, it could mean that some data is continually not being captured in your aggregations. Fundamental logic errors tend to present as “random”, meaning there seems to be no discernable pattern. However, this happens due to edge cases not being correctly dealt with or simply, faulty logic.
Start at the Beginning With a Fine-Tooth Comb
You’ve tried everything else and you still can’t figure it out. Now, it’s time to start digging. While it is tempting to start with the wrong answers and work backwards or start at the place you think it’s all going wrong, you need to start from the beginning.
Errors in data can begin benignly, but as you work your way through the analysis, the problem gets exponentially worse. It’s like a math problem: start at the beginning, find where it goes wrong. This part can be tedious and time-consuming, but after exhausting other techniques, going back to the beginning and watch how your data is changing from step to step will help show the point at which the query broke at what point the query broke.
Do have additional tips and tricks that you’ve learned to ease data verification pains? Share them here in the comments below.