Do We Need Data Normalization Anymore?
Data normalization is supposed to be the way you optimize databases to ensure their integrity. But maybe with many databases, normalization becomes redundant.
Join the DZone community and get the full member experience.
Join For FreeMany different roles in the technology world come into contact with data normalization as a routine part of many projects. Developers, database administrators, domain modelers, business stakeholders, and many more progress through the normalization process just as they would breathing. And yet, can something that seems so integral become obsolete?
As the database landscape becomes more diverse and hardware becomes more powerful, we might wonder if the practice of data normalization is required anymore. Should we be fretting over optimizing data storage and querying so that we return the minimum amount of data? Or if we should, do certain data structures make it more vital to solve those problems than others?
In this article, we will review the process of data normalization and evaluate when this is needed, or if it is still a necessary part of digitally storing and retrieving data.
What Is Data Normalization?
Data normalization is optimizing data structures in a relational database to ensure data integrity and query efficiency. It reduces redundancy and improves accuracy by putting the data through a series of steps to normalize the structure (normal forms). At its core, data normalization helps avoid insert, update, and delete data anomalies. These anomalies occur when creating new data, updating existing data, or deleting data and cause challenges in keeping data values in sync (integrity). We will talk more about this when we step through the normalization process.
The steps entail verifying keys (links to related data), separating unrelated entities to other tables, and inspecting the row and columns as a unified data object. While the full list of normal form steps is rather rigorous, we will focus on those most commonly applied in business practice: 1st, 2nd, and 3rd normal forms. Other normal forms are mostly used in academics and statistics. Normal form steps must be done in order, and we cannot move to the next normal form until the previous is complete.
How Do We Do Data Normalization?
Since we have three normal forms to take our data through, we will have three steps to this process. They are as follows:
- 1st normal form (1NF)
- 2nd normal form (2NF)
- 3rd normal form (3NF)
A database professor from college taught my class to memorize the three normal forms as “the key, the whole key, and nothing but the key” (like the courtroom oath for the truth). I had to refresh some of the normal form details for this article, but that basic phrase has always stuck with me. Hopefully, it might help you remember them, too.
I recently came across a coffee shop data set that seems to be a good fit for us to use as an example of normalizing a data set. With a bit of adjustment for our examples here, we can step through the process.
Denormalized Data
transaction_date |
transaction_time |
instore_yn |
customer |
loyalty_num |
line_item_id |
product |
quantity |
unit_price |
promo_item_yn |
2019-04-01 |
12:24:53 |
Y |
Camille Tyler |
102-192-8157 |
1 |
Columbian Medium Roast Sm |
1 |
2.00 |
N |
2019-04-01 |
12:30:00 |
N |
Griffith Lindsay |
769-005-9211 |
1,2 |
Jamaican Coffee River Sm, Oatmeal Scone |
1,1 |
2.45,3.00 |
N,N |
2019-04-01 |
16:44:46 |
Y |
Stuart Nunez |
796-362-1661 |
1 |
Morning Sunrise Chai Rg |
2 |
2.50 |
N |
2019-04-01 |
14:24:55 |
Y |
Allistair Ramirez |
253-876-9471 |
1,2 |
Cappuccino Lg, Jumbo Savory Scone |
2,1 |
4.25,3.75 |
N,N |
The data contains sales receipts for the company and was originally published on the Kaggle Coffee Shop sample data repository, though I have also created a GitHub repository for today’s post. The data displayed above shows sales made to a customer for ordered products.
Why is this data a problem? Earlier, we mentioned normalization to solve insert, update, and delete anomalies. If we try to insert a new row to this data, we could create a duplicate, or worse yet, have to gather all of the information on customers, products, and receipt date/time in order to create it. If we needed to update or delete the products purchased on the receipt, we would need to sort through the list in each product column to search for the value. So let’s see how to improve redundancy and integrity by normalizing this data.
1st Normal Form: The Key
For the first step in our “key, the whole key, and nothing but the key," a table should have a primary key (single or set of columns) that ensures a row is unique. Each column in a row should also contain only a single value; i.e., no nested tables.
Our example data set needs some work to get it to 1NF. While we can get unique rows with a combination of date/time or maybe date/time/customer, it is often much simpler to reference rows with a generated unique value of some sort. Let’s do that by adding a transaction_id
field to our receipt table.
There are also several rows that have more than one item ordered (transaction_id 156 and 199), so a few columns have line items with more than one value. We can correct this by separating the rows with multiple values into separate rows.
1NF Data
transaction_id |
transaction_date |
transaction_time |
instore_yn |
customer |
loyalty_num |
line_item_id |
product |
quantity |
unit_price |
promo_item_yn |
150 |
2019-04-01 |
12:24:53 |
Y |
Camille Tyler |
102-192-8157 |
1 |
Columbian Medium Roast Sm |
1 |
2.00 |
N |
156 |
2019-04-01 |
12:30:00 |
N |
Griffith Lindsay |
769-005-9211 |
1 |
Jamaican Coffee River Sm |
1 |
2.45 |
N |
156 |
2019-04-01 |
12:30:00 |
N |
Griffith Lindsay |
769-005-9211 |
2 |
Oatmeal Scone |
1 |
3.00 |
N |
165 |
2019-04-01 |
16:44:46 |
Y |
Stuart Nunez |
796-362-1661 |
1 |
Morning Sunrise Chai Rg |
2 |
2.50 |
N |
199 |
2019-04-01 |
14:24:55 |
Y |
Allistair Ramirez |
253-876-9471 |
1 |
Cappuccino Lg |
2 |
4.25 |
N |
199 |
2019-04-01 |
14:24:55 |
Y |
Allistair Ramirez |
253-876-9471 |
2 |
Jumbo Savory Scone |
1 |
3.75 |
N |
With this data, a composite (multi-column) key uniquely identifies a row by the combination of transaction_id
and line_item_id
, as a single receipt cannot contain multiple line item #1s. Here is a look at the data if we just boil the table down to those primary key values.
transaction_id |
line_item_id |
150 |
1 |
156 |
1 |
156 |
5 |
165 |
1 |
199 |
1 |
199 |
5 |
Each combination of those two values is unique. We have applied the first normal form to our data, but we still have some potential data anomalies. If we wanted to add a new receipt, we might need to create multiple lines (depending on how many line items it contained), and duplicate transaction ID, date, time, and other information on each row. Updates and deletes cause similar problems because we would need to ensure we get all the rows affected for data to be consistent. This is where the second normal form comes into play.
2nd Normal Form: The Whole Key
The second normal form ensures that each non-key column is fully dependent on the whole key. This is more of a concern for tables with more than one column as the primary key (like our coffee receipt table). Here is our data again in its first normal form:
transaction_id |
transaction_date |
transaction_time |
instore_yn |
customer |
loyalty_num |
line_item_id |
product |
quantity |
unit_price |
promo_item_yn |
150 |
2019-04-01 |
12:24:53 |
Y |
Camille Tyler |
102-192-8157 |
1 |
Columbian Medium Roast Sm |
1 |
2.00 |
N |
156 |
2019-04-01 |
12:30:00 |
N |
Griffith Lindsay |
769-005-9211 |
1 |
Jamaican Coffee River Sm |
1 |
2.45 |
N |
156 |
2019-04-01 |
12:30:00 |
N |
Griffith Lindsay |
769-005-9211 |
2 |
Oatmeal Scone |
1 |
3.00 |
N |
165 |
2019-04-01 |
16:44:46 |
Y |
Stuart Nunez |
796-362-1661 |
1 |
Morning Sunrise Chai Rg |
2 |
2.50 |
N |
199 |
2019-04-01 |
14:24:55 |
Y |
Allistair Ramirez |
253-876-9471 |
1 |
Cappuccino Lg |
2 |
4.25 |
N |
199 |
2019-04-01 |
14:24:55 |
Y |
Allistair Ramirez |
253-876-9471 |
2 |
Jumbo Savory Scone |
1 |
3.75 |
N |
We will need to evaluate each non-key field to see if we have any partial dependencies; i.e., the column depends on only part of the key and not the whole key. Since transaction_id
and line_item_id
make up our primary key, let’s start with the transaction_date
field. The transaction date does depend on the transaction ID, as the same transaction ID could not be used again on another day. However, the transaction date doesn’t depend on the line item ID at all. Line items can be reused across transactions, days, and customers even.
Ok, so we already found that the table does not follow the second normal form, but let’s check another column. What about the customer column? The customer is not dependent on both the transaction ID and line item ID. If someone gave us a transaction ID, we would know which customer made the purchase, but if we were given a line item ID, we wouldn’t know which single customer that receipt belonged to. After all, multiple customers could have ordered one, two, or six items on their receipts. The customer is linked to the transaction ID (assume multiple customers cannot split receipts), but the customer is not dependent upon the line item. We need to fix these partial dependencies.
The most direct solution is to create a separate table for order line items, leaving the columns that are only dependent on transaction_id
in the receipt table. The updated data in the second normal form looks like the one below.
Receipt
transaction_id |
transaction_date |
transaction_time |
instore_yn |
customer |
loyalty_num |
150 |
2019-04-01 |
12:24:53 |
Y |
Camille Tyler |
102-192-8157 |
156 |
2019-04-01 |
12:30:00 |
N |
Griffith Lindsay |
769-005-9211 |
165 |
2019-04-01 |
16:44:46 |
Y |
Stuart Nunez |
796-362-1661 |
199 |
2019-04-01 |
14:24:55 |
Y |
Allistair Ramirez |
253-876-9471 |
Receipt Line Item
transaction_id |
line_item_id |
product_id |
product |
quantity |
unit_price |
promo_item_yn |
150 |
1 |
28 |
Columbian Medium Roast Sm |
1 |
2.00 |
N |
156 |
1 |
34 |
Jamaican Coffee River Sm |
1 |
2.45 |
N |
156 |
2 |
77 |
Oatmeal Scone |
1 |
3.00 |
N |
165 |
1 |
54 |
Morning Sunrise Chai Rg |
2 |
2.50 |
N |
199 |
1 |
41 |
Cappuccino Lg |
2 |
4.25 |
N |
199 |
2 |
79 |
Jumbo Savory Scone |
1 |
3.75 |
N |
Now let’s test that our change fixed the issue and follows the second normal form. For our Receipt table, transaction_id
becomes the sole primary key. Transaction date is unique based on the transaction_id
, as is transaction_time
; i.e., there can only be one date and time for a transaction id.
Orders cannot be placed both in-store or outside it, so the value of whether a purchase was made in-store or not is dependent upon the transaction_id
. Since customers cannot split a receipt, a transaction would also tell us a unique customer. Finally, if someone gave us a transaction ID, we could identify a single customer loyalty number that is attached to it.
Next is the Receipt Line Item table. Line items are dependent upon the transaction (receipt) with which they are associated, so we retained the transaction ID on our line item table. The combination of transaction_id
and line_item_id
becomes our composite key on the line item table. Product_id
and product
are determined based on the transaction and line item together. A single transaction ID wouldn’t tell us which product (if the receipt contains multiple products purchased), and a single line item ID wouldn’t tell us which purchase was being referenced (different receipts could order the same products). This means the product_id
and product
values are dependent on the whole key.
We can also associate a quantity from the transaction_id
and line_item_id
. Quantities could be the same across receipts or line item IDs, but the combination of both keys gives us a single value for quantity. We also cannot uniquely identify our unit_price
or promo_item_yn
column values without both the transaction ID and line item ID fields together.
Although we have satisfied the second normal form, some data anomalies still exist. If we tried to create a new product for purchase or a new customer, we couldn’t create them in our current tables because we may not have receipts tied to them yet. If we needed to update a product or customer (for typo or name change), we would need to update all the line item rows with those values. If we wanted to delete a product or customer, we couldn’t unless we removed receipts or line items that referenced them. To solve these problems, we can move to the third normal form.
3rd Normal Form: And Nothing but the Key
The third normal form ensures that non-key fields are dependent on nothing but the key. In other words, they are not dependent on other non-key fields, causing a transitive dependency. Let’s review our 2NF data again:
Receipt
transaction_id |
transaction_date |
transaction_time |
instore_yn |
customer |
loyalty_num |
150 |
2019-04-01 |
12:24:53 |
Y |
Camille Tyler |
102-192-8157 |
156 |
2019-04-01 |
12:30:00 |
N |
Griffith Lindsay |
769-005-9211 |
165 |
2019-04-01 |
16:44:46 |
Y |
Stuart Nunez |
796-362-1661 |
199 |
2019-04-01 |
14:24:55 |
Y |
Allistair Ramirez |
253-876-9471 |
Receipt Line Item
transaction_id |
line_item_id |
product_id |
product |
quantity |
unit_price |
promo_item_yn |
150 |
1 |
28 |
Columbian Medium Roast Sm |
1 |
2.00 |
N |
156 |
1 |
34 |
Jamaican Coffee River Sm |
1 |
2.45 |
N |
156 |
2 |
77 |
Oatmeal Scone |
1 |
3.00 |
N |
165 |
1 |
54 |
Morning Sunrise Chai Rg |
2 |
2.50 |
N |
199 |
1 |
41 |
Cappuccino Lg |
2 |
4.25 |
N |
199 |
2 |
79 |
Jumbo Savory Scone |
1 |
3.75 |
N |
On our Receipt table, we need to check the non-key fields (everything except transaction_id
) to see if the values depend on other non-key fields. The values for transaction date, time, and in-store do not change based on each other or the customer or loyalty number associated, so they are properly dependent on nothing but the key.
But what about customer info? The value of the loyalty number could change if the customer changes. For instance, if we needed to delete or update the customer who made the purchase, the loyalty number would also need to be deleted or updated along with it. So, the loyalty number is dependent on the customer, which is a non-key field. This means our Receipt table is not in the third normal form.
What about our Receipt Line Item table? Quantity, unit price, and promo item values don’t vary based on the values of each other, nor on the product information, because the three fields state the value of an item at the time of purchase. However, the product is dependent on the product_id
because the value would change based on which product ID was referenced. So this table also needs some updates to comply with the third normal form.
Again, the best method to solve these issues is to pull the related columns to separate tables and leave a reference ID (foreign key) to link the original tables with the new ones. We eliminate data anomalies on insert, update, and delete, as well as reduce data redundancy and improve efficiency for storage and querying.
Receipt
transaction_id |
transaction_date |
transaction_time |
instore_yn |
customer_id |
150 |
2019-04-01 |
12:24:53 |
Y |
604 |
156 |
2019-04-01 |
12:30:00 |
N |
32 |
165 |
2019-04-01 |
16:44:46 |
Y |
127 |
199 |
2019-04-01 |
14:24:55 |
Y |
112 |
Receipt Line Item
transaction_id |
line_item_id |
product_id |
quantity |
unit_price |
promo_item_yn |
150 |
1 |
28 |
1 |
2.00 |
N |
156 |
1 |
34 |
1 |
2.45 |
N |
156 |
2 |
77 |
1 |
3.00 |
N |
165 |
1 |
54 |
2 |
2.50 |
N |
199 |
1 |
41 |
2 |
4.25 |
N |
199 |
2 |
79 |
1 |
3.75 |
N |
Product
product_id |
product |
28 |
Columbian Medium Roast Sm |
34 |
Jamaican Coffee River Sm |
77 |
Oatmeal Scone |
54 |
Morning Sunrise Chai Rg |
41 |
Cappuccino Lg |
79 |
Jumbo Savory Scone |
Customer
customer_id |
customer |
loyalty_num |
604 |
Camille Tyler |
102-192-8157 |
32 |
Griffith Lindsay |
769-005-9211 |
127 |
Stuart Nunez |
796-362-1661 |
112 |
Allistair Ramirez |
253-876-9471 |
Data Normalization Outside Relational Databases
So does this process of data normalization make sense outside of other databases? Is it needed for document, columnar, key-value, and/or graph databases?
From my perspective, the goals of data normalization - reducing redundancy, improving data integrity, and increasing query performance - are still highly valuable no matter the database you are working with. However, the process and rules for the normal forms in relational data normalization likely are not a one-for-one match with other data models. Let’s see some examples using our memory key of “the key, the whole key, and nothing but the key” for three main categories of databases: relational, document, and graph.
The relational databases’ goal was to optimize for assembling data into various meaningful sets by joining the tables in SQL queries. We already stepped through the normalization process from this perspective, so the benefits of redundancy, efficiency, and data integrity hopefully are clear from our earlier discussion.
In document databases, the model is optimized for grouping related information together in a single document, so that looking up a single customer retrieves all receipts and any other details. This statement is already in conflict with our data redundancy goal because we could potentially duplicate or allow inconsistencies in product information in order to store that data with a customer. Primary keys for documents that will be queried still make sense to avoid multiple lookups, but additional normalization steps may or may not be in conflict with the goals of the database model itself.
Graph databases balance the data integrity provided by relational and pre-baked relationship data provided by documents to create a unique model optimized for assembling data relationships without creating more data redundancy. Unique entities via a primary key are still important to improve query and storage efficiency but joins are stored as separate entities, naturally teasing related data into separate entities without analyzing each field for partial or non-key dependencies. Normalization exists here, but it feels more organic and less process-driven.
Wrap Up
In summary, we covered the process of data normalization as it pertains to the traditional relational database world. We discussed each step of the three normal forms and applied each one to a coffee shop receipt data set.
Finally, we looked at whether data normalization looks like in other types of databases (document and graph) and what forms made sense based on the structure of the database model.
Published at DZone with permission of Jennifer Reif, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments