Data Dictionaries and the Big Data Lifeline
Data Dictionaries and the Big Data Lifeline
Attempting to comprehend big data sets without a big data dictionary is no easy task, and you'll likely end up with a headache.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Data Dictionaries. Sounds like a blast from the past, right? Wrong. This simple, long-standing tool is even more relevant to us today than it ever was in the past. Working with data people every day, I know what it takes for our analysts and data engineers to go through the data and make it easier to analyze (it's a known fact that 80% of their time is spent preparing and managing data for analysis). And as more data is collected and stored, data dictionaries will become a much-needed lifeline in this ever-growing sea of data.
So what are data dictionaries, and how can they help us with all this complex data? A data dictionary, as defined by the UC Merced Library, is a "collection of names, definitions, and attributes about elements that are being used or captured in a database." Essentially, it is a communications tool that defines the critical information in a business-focused way, typically displayed in a spreadsheet format.
In the past, data dictionaries were used primarily by database developers, data scientists, and administrators — the people who were building the infrastructure that supports analysis. Now, however, usage has shifted: data collectors, analysts, and business users have also recognized the value of data dictionaries.
Data dictionaries help establish what's in a dataset and where it initially came from, without having to download and search through the whole thing first. In other words, you can tell if the dataset is relevant to your analysis.
Without a data dictionary, you'll likely waste time sifting through tons of data to dig out what you need. You'll also struggle to identify your data or recognize problems right away, such as duplicate content and inconsistencies which will waste your valuable time if you have to comb through it all by yourself.
Curious, I sat down with Zoe Haimovitch, our Head of Content (who also moonlights as our Chief Analyst for these reports), to see if she actually used this file. And sure enough, this is what she had to say:
Me: Hey Zoe, I opened the dataset on baby names from Kaggel, and I saw a zip file named Data_Dictionary_PBN_Final.xlsx. Did you use that at all?
Zoe: Absolutely. It's the first thing I looked at. I learned the hard way that data dictionaries are the holy grail for any dataset. When a dataset has a good data dictionary I review it to understand the data I will be downloading before I start any analysis.
Data analysts that assume what the data represents — like 1=low and 5=high when it is the other way around — are going to make big mistakes that will cost them time, money, and their professional reputation.
It's just bad practice to make assumptions based on just the column names without any explanation as to what is behind them. You need to be more precise and exact with the data and understand it's meaning. It could really be anything.
One time I was analyzing data from different networks for a cellular company and I didn't notice that the data was taken from different time zones. The insights on browsing behavior seemed very strange, and after days of crunching the wrong numbers, I had to go back and re-align all the data. That information was clearly stated in the data dictionary which could have saved me lots of time (and a few headaches) had I only looked.
Me: What's it like if you don't have a data dictionary?
Zoe: You feel somewhat blindfolded. You feel much less confident in the data because you have to guess what the data means. I feel less secure analyzing data without a dictionary because I want to make sure I understand the numbers in front of me. Sometimes I even try to reach out to the person who uploaded the data and ask them questions. If there is a data dictionary, it would save lots of time, energy, and guesswork.
Me: Have you ever created your own data dictionary?
Zoe: I always include explanations of the data for any analysis I do. I consider it a good investment of my time as there are always other people looking at the data and you need to make sure the right meaning is conveyed. For instance, when I ask a dashboard builder and designer to build a dashboard from the analysis, I always include my new data dictionary with it. How else are they supposed to display the data correctly if they don't understand what it represents?
Me: Are data dictionaries here to stay?
Zoe: You bet ya! Take embedded analytics for example. Now that so many SaaS companies are sharing data with their customers through embedded analytics, they will also need to share the meaning of this data. That's where data dictionaries can come in. It's a simple, inexpensive way to define data standards and consistency across multiple projects, teams, and, now, customers. As a person that analyzes data all the time, you want to know the source of truth for this data, and data dictionaries are a short synopsis of this.
It's Not the End, Just a Beginning
Here at Sisense, we are always proclaiming the benefits of a single source of truth. Data dictionaries bring us one step closer to this by ensuring consistent usage of data elements — no matter who is looking at the data. And, just like Zoe, once you begin to work with datasets that have data dictionaries, you won't know how you lived so long without them.
Published at DZone with permission of Dana Liberty , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.