What is a data model? As Talend developers, we see them every day, and we think we know what they are:
- A structural definition of a business system data.
- A graphical representation of business data.
- A data foundation upon which to build business solutions.
These may all be true statements but I'd like to suggest that they are all extraneous definitions because separately, they don't reach the root, purpose, or goal of what a data model really is.
OK then, what is a data model? I think it is many things, and yet a specific thing. For me, a data model is a structural foundation represented as a well-defined graphical characterization of a business information system.
What do you think? That’s the same as the definitions above, right? Not really. This definition encompasses all the elements into a single purpose: a means to identify, structurally, information about a business use case, not just its data.
In Part 1 of this blog series, I condensed a 50-year history of data modeling into about four short paragraphs. Sure, I left out a few bits, but I do believe understanding how we arrived at what we know about data modeling is a result of lessons learned and improvements achieved from our predecessors. Today, most companies use data models to help validate requirements — a true business value — but I often wonder if they understand how to do it right. In many cases, the illusion of a durable data model is presumed by the mere fact that there is one, without knowing or validating for sure if it is right.
As a practitioner of data architecture and database design, I have seen so many bad data models that I am compelled to suggest that most data models are probably wrong to some extent. I have seen many good ones, yet how do you know if a data model is good or bad? Why should we care? So long as data gets into and out of it, isn’t that good enough? The answer is a resounding no! For data models must be good, or great, they must ensure the success of the business systems running against and/or in cooperation with them. The data model is the essence of the business and therefore must be comprehensive, unimpeachable, and resilient.
The motivation of having a good data model is therefore apparent. Once you begin putting data in and getting data out with ETL/ELT tools like Talend Studio, this becomes clear (to most of us). I think a data model is one of three essential technical elements of any software project. The other two being the application code and the user interface.
You also read in Part 1 about the Database Development Life Cycle (DDLC) methodology, which every data model I design follows. This methodology has served me well and I highly recommend it for any serious database development team. Understanding and adopting this process can streamline, automate, and improve any implementation and maintenance of a data model. Yet there is more to this process which we need to explore. Let me share some additional best practices that can promote a reliable, pliable, and accurate data model for your business.
Data Modeling Best Practices
Many data models are designed using a process where the modeler creates a logical and then a physical model. Typically, logical models describe entities and attributes, and the relationships that bind them providing a clear representation of the business purpose of the data. Physical models then implement the logical model as tables, columns, datatypes, and indexes along with concise data integrity rules. These rules define primary and foreign keys and default values. Additionally, views, triggers, and stored procedures can be defined to support the implementation as required. The physical model also defines the storage allocation on-disk based on specific configuration options provided by most host systems (like Oracle, MS SQL Server, MySQL, etc.).
Fair enough, right? Yet many times I have been engaged in heated debate over the difference between a Logical model and a conceptual model. Many suggest to me that they are the same, both presenting entities and attributes of the business data. I couldn’t disagree more! The conceptual model aims to provide context as to the business understanding of data, not a technical one. All stakeholders can understand a conceptual model and many struggles with Entities and Attributes. I believe the conceptual model, done right, is the best tool for communication about the business data for everyone involved. I prefer to use aspects of the unified modeling language (UML) as my way to diagram a conceptual model and to keep it simple, not getting bogged down with details. I’ll leave that for the logical and physical models in which those details are essential and refined.
The enterprise business, usually having large numbers of application systems, introduces a higher level of concern when modeling data. I have found that even the conceptual, logical, and physical models are simply not enough. So, I introduce to you: the holistic data model (or at least my adaptation of it!).
The intent of the holistic data model is to identify and abstract data silos across a business enterprise, thus describing what exists or is needed, where they relate to each other, and how to organize them for the most effective use at the highest level.
The Four Data Modeling Process Layers
Given the potential for four different types of data models in an enterprise, I propose the following data modeling process to be followed as layers, top-down, for the definition, refinement of understanding, and specific design features. Key roles in each level identify who and where they become involved in the process.
Holistic Data Model
The holistic layer represents an abstract landscape of data silos across an enterprise. This data model creates the opportunity to establish widespread business data governance thus enabling a better understanding of all data relationships inherent to the enterprise. They are intended to incorporate data from any application, internal or external. I use a bubble chart to diagram the holistic data model. Here is how I do it.
Bubble Charts: Data Silos
The bubble chart is a composition of simple bubbles representing unique data silos. Lines (called links) connecting two bubbles (and only two) indicate that some relationship(s) exists between them. Fundamentally, each collection of bubbles (often designed with a center "hub" having radiating "spokes"), embodies a particular set of data silos identified across the enterprise; nothing more, nothing less. Here are some specification details:
The solid blue links indicate direct relationships between two data silos. The dashed red links indicate indirect relationships between two data silos. The dotted green links indicate extended relationships between two data silos. Use these links subjectively, as they may represent multiple relationships (to be defined in the conceptual layer). Simply, they define that a relationship exists.
Bubble charts define particular collections of business information. The objective is to identify, simplify, and consolidate information absent of any application, implementation, or technical details it may support.The holistic data model advantage is that all audiences can understand the enterprise data landscape in one comprehensive yet simplistic view offering a flexible starting point for the identification and insertion of any new data into the model with limited or perhaps no disruption to the underlying data models (discussed below).
Here is an example of what a fully defined holistic data model might look. Print them on the big printer and put them up on a wall. Many productive conversations can be had in examining these and they can become an effective, valuable asset to your business.
Conceptual Data Model
UML Information Architecture
- Protected, where values are pre-determined.
- Public, where values are mutable.
- Private, where values have restricted use.
Element objects connected directly to each other are deemed to have some "association" indicated with a solid gray link and purposeful labels. These associations, using the diamond symbol on the Parent element, present relationships that are either:
- Simple (no diamond).
- Shared (open diamond).
- Composite (solid diamond).
A child element may also be "navigatable," which is indicated by an arrow symbol further identified with a relational cardinality (0.* = zero to many, etc.).
Completing the UML diagram, elements can have self-joining associations, which are specific characteristics that extend the definition of a parent object and/or "associations" between specific characteristics. The specific extensions do not represent a class or a generalization but identify pertinent characteristics that are called out for the purpose of better understanding of the abstracted data silo. The connection of specific characteristics to an element is indicated by a solid red link and a purposeful label. Additionally, element characteristics can connect to other element characteristics of the same parent object indicated with solid green links similar to related generalizations.These relationships may also be "navigatable," which is indicated with an optional, open arrow symbol, then further identified with a relational cardinality (0.* = zero to many, etc.).
The conceptual data model describes particular data elements using a class-based metaphor, best diagramed using UML, which further explains abstracted holistic data silos. This objective is to define, refine, and mitigate business information, still agnostic to any application, implementation rules, or technical details, and also to encapsulate details left out of the holistic model.
Again, print it out big and note that this model represents a common interface against which application code can be written without the logical or physical data models that follow. This advantage can also present a validation point before which those subsequent data models are crafted. Validation of the UML model with both software engineering and stakeholders is a key milestone in the data modeling process. Here is an example of what a selection of a conceptual data model might look like. Note that this model has sub-elements that define particular aspects of the main element, clarifying unique and recurring characteristics.
Logical Data Model
The logical layer represents an abstract structure of semantic information organized in terms of domain entities (logical data objects), their attributes, and specific relationships between them. This data model is derived from element objects of the conceptual model and defines pertinent details (keys/attributes) plus relationships between entities without regard to any specific host storage technology. Entities can represent a single element, part of an element, or multiple elements as necessary to encapsulate suitable data structures. The logical data model encapsulates the structural entities and record sets identified in the conceptual model, adding specific attributes thus enabling a better understanding of the data involved. Here is how I do it:
Entity relationship diagrams (ERDs) describe uniquely identifiable entities capable of an independent existence which in turn require a minimal set of unique identifying attribute called a primary key (PK). Where a child entity is linked to some parent entity, referential data integrity should be enforced through the use of an identifying attribute in the child entity that matches the parent PK attribute(s) called a foreign key (FK). But you all know about these.
Cardinality has only two rules: the minimum and the maximum number of rows for each entity that can participate in a relationship where the notation closest to the entity is the maximum count. Specifying cardinality for a record set also suggests that the relationship is optional or mandatory, which assists with the design for the physical data model.
Notice a few things here. I’ve used colors to represent different functional areas which can map up to the conceptual and holistic models. I’ve also incorporated a "virtual" relationship between ENTITY_D and ENTITY_C (shown as a light gray link). This identifies that a logical relationship exists; however, the construct between these two entities plus ENTITY_B represents a circular reference, which is something to avoid completely in the physical model. Also, note that there are a few attributes that define an array of values. In the logical model, this is okay, as it simplifies and streamlines the model — just be sure to normalize them in the physical model.
Physical Data Model
The physical layer represents a composition of host system artifacts (physical data objects) derived from a logical data model coupled with its desired storage configuration. This data model incorporates tables, columns, data types, keys, constraints, permissions, indexes, views, and details on the allocation parameters available on the data store (see my blog Beyond the Data Vault for more on data stores). These host artifacts represent the actual data model upon which software applications are built. The physical data model encapsulates all these artifacts from entities and attributes defined in the logical data mode finally enabling an application access to store and retrieve actual data. Here is how I do it.
A schema (physical) design model (SDM) defines specific objects involved in a database information system. I'll presume that most of my readers know more about this data model than the previous three, so I’ll avoid describing the constructs. I prefer calling it an SDM so that it's not confused with the more widely used term ERD (which is not a physical data model). Instead, the SDM provides an engineering reference often chronicled with both the graphical diagram and a data dictionary document. Providing a critical, detailed reference to every database object implemented in the SDM, this document should incorporate their purpose, referential integrity rules, and other important information on any intended behavior. Here's a good structure I use:
- Object name and definition (tables/views)
- SQL object creation/modification filename
- Business domain and functional utilization
- Version/integrity level
- Default values
- Primary keys
- Foreign keys
- Natural Business keys
- Unique constraints
- Check constraints
- Unique and non-unique indexes (clustered and non-clustered)
- Control flows (when extra complexity design/use is involved)
- Useful comments
- Change history
An SDM data dictionary references objects alphabetically by name for ease of use. As most physical data models are highly normalized, referential integrity rules should be called out for each table. I've seen many ways to deal with these rules, in particular when executing SQL object scripts against an existing schema. Simply turning off integrity checks, running the scripts, then turning it back on works. Easy enough, but I'm not a fan of this method, as it's prone to errors.
Instead, I take the time to understand particular references to all tables and assign a level of integrity for each. A table integrity level identifies the hierarchical ordering of parent/child table relationship. In short, a table integrity level is based on any FK reference to the parent table(s). For example:
- A table with no parent tables is an L0 (level 0), the highest level.
- A table with at least one parent table is an L1 (level 1).
- A table with at least one parent table but that parent table has an L0 parent table is an L2 (or level 2).
- A table with multiple parent tables having parent tables of different levels is L+1 (level +1), the lowest level.
- ie: parent A is an L0, parent B is an L1, so child table is an L2or: parent A is an L1, parent B is an L4, so the child table is an L5
- And so on.
Note: L0 is the highest level as there are no parent tables; the lowest level is determined by the physical data model. This method also eliminates the potential for the creation of circular references (a bad data model design practice, IMHO).
The physical data model is the one model that is actually implemented. I prefer to use SQL object creation scripts or SOCS for this implementation. Using this method, I discovered that the DDLC for any physical data model can be decoupled as an independent process which is highly desirable and difficult to achieve. The idea is to create one SOCS file for one primary database object (table, view, trigger, or stored procedure). These scripts contain intelligent checks to determine which SQL statements to apply (drop, create, alter, etc.) and can, through switches and arguments passed to account for the lifecycle discussed in my previous blog, which are:
- A fresh install based on the current version of the schema.
- Apply an upgrade to drop/create/alter DB objects upgrading one version to the next.
- Data migration where a disruptive upgrade occurs (like splitting tables or platform).
These SOCS files incorporate best practices as well which include (yours may differ):
- Consistent naming conventions
- Table names all caps
- Column names all lowercase
- View names all camel case
- SOCS file names incorporate object name
- Single column PKs using appropriately sized integers
- Elimination of redundant/duplicate data (tuples)
- Elimination of all Circular Key References (where a Parent > Child > Parent may occur)
- Single SOCS file per object
- SOCS files contain consistent header/purpose/history sections matching this data dictionary
- SQL formatting provides readability and maintainability
Further details on my implementation of SOCS is outside the scope of this blog. Maybe I can be persuaded to write about this another time. Your feedback and questions is welcome.
Wherefore Art Thou Data Model?
A brief summary of these layers assists in understanding their purpose, how they support and differ from each other in the modeling process. Take a look at this table to see: