Data Model Design Best Practices (Part 1)
Data Model Design Best Practices (Part 1)
Looking back at the history of data modeling can be enlightening, so I did some research to refresh myself. Check it out!
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Business applications, data integration, master data management, data warehousing, big data, data lakes, and machine learning — these all have (or should have) a common and essential ingredient: a data model.
The data model is the backbone of almost all high-value, mission-critical business solutions — from e-commerce and point-of-sale to financial, product, and customer management and all the way through to business intelligence and IoT. Without a proper data model, where is the business data? Probably lost!
Data models and data modeling methodologies have been around since the beginning of time (well, since the beginning of computing, anyway). Data needs structure in order to make sense and provide a way for computers to deal with its bits and bytes. Sure, we deal with unstructured and semi-structured data, too, but for me, this simply means that we've evolved to more sophisticated paradigms than our computing predecessors had to deal with. The data model therefore remains and provides the basis upon which we build highly advanced business applications.
Looking back at the history of data modeling may enlighten us, so I did some research to refresh myself.
A Brief History Lesson on the Data Model
In the Computing Dark Ages, we used flat record layouts/arrays: all data saved to tape or large disk drives for subsequent retrieval. However, in 1958, J. W. Young and H. K. Kent described modeling information systems as “a precise and abstract way of specifying the informational and time characteristics of a data processing problem.” Soon after, in 1959, CODASYL (Conference/Committee on Data Systems Languages) was formed by the Charles Babbage Institute at the University of Minnesota, which led to standard programming languages like COBOL and the Integrated Data Store (IDS), an early database technology designed in the 1960’s at GE/Honeywell by Charles Bachman. IDS proved difficult to use, so it evolved to become the Integrated Database Management System (IDMS) developed at B.F. Goodrich (a U.S. aerospace company at the time, and yes. the tire company we know today) and marketed by Cullinane Database Systems (now owned by Computer Associates). These two data modeling methodologies — called the hierarchal data model and the networkdata model respectively — were both very common across mainframe computing for the next 50 years. You may still find them in use today!
In the late 1960s, while working at IBM, E.F. Codd in collaboration with C.J. Date (author of An Introduction to Database Systems), mapped Codd’s innovative data modeling theories resulting in the Relational Model of Data for Large Shared Data Banks publication in 1970. Codd’s campaign to ensure vendors implemented the methodology properly published his famous Twelve Rules of the Relational Model in 1985. Actually, thirteen rules numbered zero to twelve; Codd was clearly a computer geek of his day. The relational model also introduced the concept of normalization with the definition of the Five Normal Forms. Many of us talk about 3NF (Third Normal Form), but do you know how to define it?
The next significant data modeling methodology arrived in 1996, proposed by Ralph Kimball (retired) in his groundbreaking book co-authored by Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. Kimball’s widely adopted star schema data model applied concepts introduced in the data warehouse paradigm first proposed in the 1970s by W.H. (Bill) Inmon (named in 2007 by Computerworld as one of the ten most influential people of the first 40 years in computing). Inmon’s Building the Data Warehouse, published in 1991, has become the defacto standard for all data warehouse computing. While there has been some history of disagreement between Inmon and Kimball over the proper approach to data warehouse implementation, Margy Ross of the Kimball Group presents a fair and balanced explanation for your worthy consideration in her article Differences of Opinion.
Recently, a new data modeling methodology has emerged as a strong contender: the data vault! Its author and inventor, Dan Linsdedt, first conceived the data vault in 1990 and released a publication to the public domain in 2001. The data vault model resolves many competing Inmon and Kimball arguments, incorporating historical lineage of data, and offering a highly adaptable, auditable, and expandable paradigm. I invite you to read my article, What is “The Data Vault” and why do we need it? Linstedt’s data vault proved invaluable on several significant DOD, NSA, and Corporate projects. In 2013, Linsdedt released Data Vault 2.0 addressing Big Data, NoSQL, unstructured, semi-structured data integration coupled with SDLC best practices on how to use it. Perfect timing, I’d say. So here we are…
A Data Model Summary
A quick summary of the different data modeling methodologies:
- Flat model: A single, two-dimensional array of data elements.
- Hierarchical model: Records containing fields and sets defining a parent/child hierarchy.
- Network model: Similar to the hierarchical model allowing one-to-many relationships using a junction ‘link’ table mapping.
- Relational model: Collection of predicates over a finite set of predicate variables defined with constraints on the possible values and combination of values.
- Star schema model: Normalized fact and dimension tables removing low cardinality attributes for data aggregations.
- Data vault model: Records long-term historical data from multiple data sources using hub, satellite, and link tables.
Database Development Lifecycle: DDLC
Today’s dialogue seems to focus entirely on complexity and sheer volume of data. Important, sure, but again I’d like to remind you that the data model should be an important part of the discussion. As requirements evolve, the schema (a data model) must follow along — or even lead the way. Regardless, it needs to be managed. Therefore, I present to you the Database Development Life Cycle!
For every environment where data is involved, developers need to accommodate and adapt code to its inevitable structural mutation. Similar to the Software Development Life Cycle (SDLC), a database should embrace appropriate data model design and best practices. Of the many data models that I have designed, clear precepts have emerged, which include:
- Adaptability: Creating schemas that withstand enhancement or correction.
- Expandability: Creating schemas that grow beyond expectations.
- Fundamentality: Creating schemas that deliver on features and functionality.
- Portability: Creating schemas that can be hosted on disparate systems.
- Exploitation: Creating schemas that maximize a host technology.
- Efficient Storage: Creating optimized schema disk footprint.
- High Performance: Creating optimized schemas that excel.
These design precepts incorporate the essence of any chosen modeling methodology — some in contradiction with others. In my experience, regardless of these dichotomies, a data model has just three stages of life:
- A fresh install: Based upon the current version of the schema.
- Apply an upgrade: Drop/create/alter database objects upgrading one version to the next.
- Data migration: Where a disruptive upgrade occurs (like splitting tables or platform).
Designing the data model can be a labor of love entailing both the tedious attention to detail tempered with the creative abstraction of ambiguity. Personally drawn to challenging schemas, I look for cracks and crevices to correct, which often present themselves in various ways. For example:
- Composite primary keys: Avoid them, rarely effective or appropriate; there are some exceptions depending upon the data model
- Bad primary keys: Usually datetime and/or strings (except a GUID or Hash) are inappropriate
- Bad indexing: Either too few or too many.
- Column datatypes: When you only need an Integer don’t use a Long (or Big Integer), especially on a primary key.
- Storage allocation: Inconsiderate of data size and growth potential
- Circular references: Where table A has a relationship with table B, table B has a relationship with table C, and table C has a relationship with table A; this is simply bad design.
Let's consider then a database design best practice: The design and release process of a data model. I believe that when crafting a data model one should follow a prescribed process similar to this:
Self-explanatory to most perhaps, yet let me emphasize the importance of adopting this process. While schema changes are inevitable, getting a solid data model early in any software development project is essential. Undoubtedly minimizing the impact to application code is desirable for delivering successful software projects. Schema changes can be an expensive proposition so understanding the database life cycle and its role becomes very important. Versioning your database model is critical. Use graphical diagrams to illustrate the designs. Create a data dictionary or glossary and track lineage for historical changes. It is a higher discipline, but it works!
Data Modeling Methodologies
Understanding the history of the data model and the best process under which to design them is only the starting point. As a database architect for both transactional (OLTP) and analytical (OLAP) models, I have discovered that the first three steps illustrated above represent about 80% of the work. Let's consider that next.
Sometimes, data models are easy, usually due to simplicity and/or small stature. Data models can also be very hard, usually due to complexity, diversity, and/or sheer size and shape of the data and the many places throughout the enterprise where it is used. I believe we should understand as early as possible the full extent of what and where data is, how it is affected by, or affects the applications and systems using it, and why it is there in the first place. Getting your head around who needs what and how to deliver it is the challenge. Mapping it out to ensure a solid Data Model is the goal. Choosing the right data modeling methodology is paramount.
In Part 2 of this series, I will illustrate and examine the basics and value of the logical and physical data model. I will also propose an expansion on the way we differentiate our data: holistically first, then separating out the conceptual details, before we even attempt a logical or physical design. These help us better understand the data, model the data, and validate the model of our database design.
Until then, ponder on the information presented here, and feel free to leave any comments, questions, and/or debate the principles presented. Cheers!
Published at DZone with permission of Dale Anderson , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.