DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Refcards
  3. Data Warehousing
refcard cover
Refcard #160

Data Warehousing

Best Practices for Collecting, Storing, and Delivering Decision-Support Data

As a total architecture, data warehousing provides decision-support data that is consistent, integrated, standard, and simply understood. From descriptions to diagrams and integration patterns, this newly updated Refcard walks you through each aspect of data warehousing. Gain a complete understanding of data modeling, infrastructure, relationships, attributes, and speedy history loading and recording with atomic data.

Free PDF for Easy Reference
refcard cover

Written By

author avatar Roi Avinoam
CTO & Co-Founder, Panoply.io
author avatar Alon Brody
Lead Data Architect, Panoply.io
author avatar David Haertzen
Prinicpal Enterprise Architect, Allianz Life
Table of Contents
► What Is Data Warehousing? ► Data Warehouse Architecture ► Data ► Data Modeling ► Normalized Data ► Atomic Data Warehouse ► Supporting Tables ► Dimensional Database ► Facts ► Dimensions ► Data Integration
Section 1

What Is Data Warehousing?

Data warehousing is a process for collecting, storing, and delivering decision-support data for some or all of an enterprise. Data warehousing is a broad subject that is described point-by-point in this Refcard. A data warehouse is one of the artifacts created in the data warehousing process.

William (Bill) H. Inmon has provided an alternate and useful definition of a data warehouse: "A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision-making process."

As a total architecture, data warehousing involves people, processes, and technologies to achieve the goal of providing decision-support data that is consistent, integrated, standardized, and easy to understand.

See the book The Analytical Puzzle: Profitable Data Warehousing, Business Intelligence and Analytics (ISBN 978-1935504207) for details.

What a Data Warehouse is and is Not

A data warehouse is a database whose data includes a copy of operational data. This data is often obtained from multiple data sources and is useful for strategic decision-making. It does not, however, contain original data.

"Data warehouse," by the way, is not another name for "database." Some people incorrectly use the term "data warehouse" as if it's a generic name for a database. A data warehouse does not only consist of historic data — it can be made up of analytics and reporting data, too. Transactional data that is managed in application data stores will not reside in a data warehouse.

Section 2

Data Warehouse Architecture

Data Warehouse Architecture Components

The data warehouse's technical architecture includes data sources, data integration, BI/analytics data stores, and data access.

Data Warehouse Architecture Components

Data Warehouse Tech Stack

Item Description
Metadata
Repository
A software tool that contains data that describes other data. Here are the two kinds of metadata: business metadata and technical metadata.
Data Modeling Tool

A software tool that enables the design of data and databases through graphical means. This tool provides a detailed design capability that includes the design of tables, columns, relationships, rules, and business definitions.

Data Profiling Tool A software tool that supports understanding data through exploration and comparison. This tool accesses the data and explores it, looking for patterns such as typical values, outlying values, ranges, and allowed values. It is meant to help you
better understand the content and quality of the data.
Data Integration
Tools

ETL (extract, transfer & load) tools, as well as realtime integration tools like the ESB (enterprise service bus) software tools. These tools copy data from place to place and also scrub and clean the data.

RDBMS (Relational
Database
Management
System)

Software that stores data in a relational format using SQL (Structured Query Language). This is really the Database system that is going to maintain robust data and store it. It is also important to the expandability of the system.

MOLAP
(Multidimensional
OLAP)
Database software designed for data mart-type operations. This software organizes data into multiple dimensions, known as "cubes," to support analytics.
Big Data Store

Software that manages huge amounts of data (relational databases, for example) that other types of software cannot. This Big Data tends to be unstructured and consists of text, images, video, and audio.

Reporting and
Query Tools

Business-intelligence software tools that select data through query and present it as reports and/or graphical displays. The business or analyst will be able to explore the data-exploration sanction. These tools also help produce reports and outputs that are desired and needed to understand the data.

Data Mining Tools

Software tools that find patterns in stores of data or databases. These tools are useful for predictive analytics and optimization analytics.

Infrastructure Architecture

The data warehouse tech stack is built on a fundamental framework of hardware and software known as the infrastructure.

Infrastructure

Using a data warehouse appliance or a dedicated database infrastructure helps support the data warehouse. This technique tends to yield the highest performance. The data warehouse appliance is optimized to provide database services using massively parallel processing (MPP) architecture. It includes multiple tightly coupled computers with specialized functions, plus at least one array of storage devices that are accessed in parallel. Specialized functions include system controller, database access, data load, and data backup.

Data warehouse appliances provide high performance. They can be up to 100x faster than the typical database server. Consider the data warehouse appliance when more than 2TB of data must be stored.

Data Architecture

Data architecture is a blueprint for the management of data in an enterprise. The data architect builds a picture of how multiple sub-domains work. Some of these subdomains are data governance, data quality, ILM (information lifecycle management), data framework, metadata and semantics, master data, and, finally, business intelligence.

Architecture

Data Architecture Sub-Domains

Sub-domain Description
Data Governance (DG)

The overall management of data and information includes people, processes, and technologies that improve the value obtained from data and information by treating data as an asset. It is the cornerstone of the data architecture.

Data Quality
Management (DQM)

The discipline of ensuring that data is fit for use by the enterprise. It includes obtaining requirements and rules that specify the dimensions of quality required, such as accuracy, completeness, timeliness, and allowed values.

Information Lifecycle
Management (ILM)
The discipline of specifying and managing information through its life from its conception to disposal. Information activities that make up ILM include classification, creation, distribution, use,
maintenance, and disposal.
Data Framework

A description of data-related systems that is in terms of a set of fundamental parts and the recommended methods for assembling those parts using patterns. The data framework can
include: database management, data storage, and data integration

Metadata and
Semantics

Information that describes and specifies datarelated objects. This description can include: structure and storage of data, business use of data, and processes that act on the data. "Semantics" refers to the meaning of data.

Master Data
Management (MDM)
An activity focused on producing and making available a "golden record" of master data and essential business entities, such as customers, products, and financial accounts. Master data is
data describing major subjects of interest that is shared by multiple applications.
Business Intelligence

The people, tools, and processes that support planning and decision making, both strategic and operational, for an organization.

Data Flow

The diagram below displays how data flows through the data warehouse system. Data first originates from the data sources, such as inventory systems (systems stored in data warehouses and operational data stores). The data stores are formatted to expose data in the data marts that are then accessed using BI and analytics tools.

dataflow

Section 3

Data

Data is the raw material through which we can gain understanding. It is a critical element in data modeling, statistics, and data mining. It is the foundation of the pyramid that leads to wisdom and to informed action.

Data Attribute Characteristics

Characteristic Description
Name

Each attribute has a name, such as "Account Balance
Amount." An attribute name is a string that identifies
and describes an attribute. In the early stages of data design you may just list names without adding clarifying information, called
metadata.

Datatype

The datatype, also known as the "data format," could have a value such as decimal(12,4). This is the format used to store the attribute. This specifies whether the information is a string, a number, or a date. In addition, it specifies the size of the attribute.

Domain A domain, such as Currency Amounts, is a categorization of attributes by function.
Initial Value

An initial value such as 0.0000 is the default value that an attribute is assigned when it is first created.

Rules

Rules are constraints that limit the values that an attribute can contain. An example rule is "the attribute must be greater than or equal to 0.0000." Use of rules helps to improve data quality.

Definition A narrative that conveys or describes the meaning of an attribute. For example, Account Balance Amount is a measure of the monetary value of a financial account, such as a bank account or an investment account."
Section 4

Data Modeling

Three levels of data modeling are developed in sequence:

  1. Conceptual Data Model - a high level model that describes a problem using entities, attributes, and relationships.
  2. Logical Data Model - a detailed data model that describes a solution in business terms, and that also uses entites, attributes, and relationships.
  3. Physical Data Model - a detailed data model that defines database objects, such as tables and columns. This model is needed to implement the models in a database and produce a working solution.

Entities

An entity is a core part of any conceptual and logical data model. An entity is an object of interest to an enterprise — it can be a person, organization, place, thing, activity, event, abstraction, or idea. Entities are represented as rectangles in the data model. Think of entities as singular nouns.

Entities

Attributes

An attribute is a characteristic of an entity. Attributes are categorized as primary keys, foreign keys, alternate keys, and non-keys, as depicted in the diagram below.

Attributes

Relationships

A relationship is an association between entities. Such a relationship is diagrammed by drawing a line between the related entities. The following diagram depicts two entities — Customer and Order — that have a relationship specified by the verb phrase "places" in this way: Customer Places Order.

Relationships

Cardinality

Cardinality specifies the number of entities that may participate in a given relationship, expressed as one-to-one, one-to-many, or many-to-many, as depicted in the following example.

Cardinality

Cardinality is expressed as minimum and maximum numbers. In the first example below, an instance of entity A may have one instance of entity B, and entity B must have one and only one instance of entity A. Cardinality is specified by putting symbols on the relationship line near each of the two entities that are part of the relationship.

In the second case, entity A may have one or more instances of entity B, and entity B must have one and only one instance of entity A.

Cardinality2

Minimum cardinality is expressed by the symbol farther away from the entity. A circle indicates that an entity is optional, while a bar indicates that an entity is mandatory. At least one is required.

Minimum

Maximum cardinality is expressed by the symbol closest to the entity. A bar means that a maximum of one entity can participate, while a crow's foot (a three-prong connector) means that many entities may participate. This means a large unspecified number.

Maximum cardinality

Section 5

Normalized Data

Normalization is a data modeling technique that organizes data by breaking it down to its lowest level, i.e. its "atomic" components, to avoid duplication. This method is used to design the atomic data warehouse part of the data warehousing system.

Normalization Level Description
First Normal Form Entities contain no repeating groups of
attributes.
Second Normal Form Entity is in the first normal form and
attributes that depend on only part of a
composite key are separate.
Third Normal Form

The entity is in the second normal form.
This means that non-key attributes
representative of an entity's facts (about
other non-key attributes) separate two or
more independent, multi-valued facts.

Fourth Normal Form Entity is in its third normal form, and two or
more independent, multi-valued facts for an
entity are separate.
Fifth Normal Form Entity is in its fourth normal form, and all
non-primary key attributes depend on all
attributes that make up the primary key.
Section 6

Atomic Data Warehouse

The atomic data warehouse (ADW) is an area where data is broken down into low-level components in preparation for exporting to data marts. The ADW is designed using normalization and methods that make for speedy history loading and recording.

Header and Detail Entities

The ADW is organized into non-changing data with logical keys and changeable data that supports tracking of changes and rapid load/insert. Use an integer as the primary surrogate key. Then, add the effective date to track changes.

Header and Detail Entities

Associative Entities

Track the history of relationships between entities using an associative entity with effective dates and expiration dates.

Associative Entities

Atomic DW Specialized Attributes

Use specialized attributes to improve ADW efficiency and effectiveness. Identify these attributes using a prefix of ADW_.

Attribute name Description
dw_xxx_id Data Warehouse assigned surrogate
key. Replace 'xxx' with a reference
to the table name, such as 'dw_
customer_dim_id'.
dw_insert_date The date and time when a row was
inserted into the data warehouse.
dw_effective_date

The date and time when a row in the
data warehouse began to be active.

dw_expire_date The date and time when a row in
the data warehouse stopped being
active
dw_data_process_log_id A reference to the data process log.
The log is a record of the process of
how data was loaded or modified in
the data warehouse..
Section 7

Supporting Tables

Supporting data is required to enable the data warehouse to operate smoothly. Here is some supporting data:

  • Code management and translation
  • Data source tracking
  • Error logging

Code Translation

Data warehousing requires that codes, such as gender code and units of measure, be translated to standard values aided by code-translation tables like these:

  • Code Set – Group of codes, such as "Gender Code"
  • Code – An individual code value
  • Code Translation – Mapping between code values

Translation

Data-Source Tracking and Logging

Data-source tracking provides a means of tracing where data originated within a data warehouse:

  • Data Source – identifies the system or database
  • Data Process – traces the data-integration procedure
  • Data Process Log – traces each data warehouse load

Logging

Message Logging

Message logging provides a record of events that occur while loading the data warehouse:

  • Data Process Log – traces each data warehouse load
  • Message Type – specifies the kind of message
  • Message Log – contains an individual message

Message Logging

Section 8

Dimensional Database

A dimensional database is a database that is optimized for query and analysis and is not normalized like the atomic data warehouse. It consists of fact and dimension tables, where each fact is connected to one or more dimensions.

Sales Order Fact

The sales order fact includes the measurer's order quantity and currency amount. Dimensions of Calendar Date, Product, Customer, Geo Location, and Sales Organization put the sales order fact into context. This star schema supports looking at orders in a cubical way, enabling slicing and dicing by customer, time, and product.

 Dimensional Database

Section 9

Facts

A fact is a set of measurements. It tends to contain quantitative data that gets presented to users. It often contains amounts of money and quantities of things. Facts are surrounded by dimensions that categorize the fact.

Anatomy of a Fact

Facts are SQL tables that include:

  • Table Name – a descriptive name usually containing the word 'Fact'
  • Primary Keys – attributes that uniquely identify each fact occurrence and relate it to dimensions
  • Measures – quantitative metrics

OrderFact

Event Fact Example

Event facts record single occurrences, such as financial transactions, sales, complains, or shipments.

Event Fact

Snapshot Fact

The snapshot fact captures the status of an item at a point in time, such as a general ledger balance or inventory level.

Snapshot Fact

Cumulative Snapshot Fact

The cumulative snapshot fact adds accumulated data, such as year-todate amounts, to the snapshot fact.

Cumulative Snapshot Fact

Aggregated Fact

Aggregated facts provide summary information, such as general ledger totals during a period of time, or complaints per product per store per month.

Aggregated Fact

Fact-less Fact

The fact-less fact tracks an association between dimensions rather than quantitative metrics. Examples include miles, event attendance, and sales promotions.

Fact-less Fact

Section 10

Dimensions

A dimension is a database table that contains properties that identify and categorize. The attributes serve as labels for reports and as data points for summarization. In the dimensional model, dimensions surround and qualify facts.

Data and Time Dimensions

Date dimensions support trend analysis. Date dimensions include the date and its associated week, month, quarter, and year. Time-of-day dimensions are used to analyze daily business volume.

Data and Time Dimensions<

Multiple-Dimension Roles

One dimension can play multiple roles. The date dimension could play roles of a snapshot date, a project start date, and a project end date.

Multiple-Dimension Roles

Degenerate Dimension

A degenerate dimension has a dimension key without a dimension table. Examples include transaction numbers, shipment numbers, and order numbers.

Degenerate Dimension

Slowly-Changing Dimensions

Changes to dimensional data can be categorized into levels:

SCD Type Description
SCD Type 0 Data is non-changing. It is inserted once and never changed.
dw_insert_date The date and time when a row was inserted into the data warehouse.
dw_effective_date

The date and time when a row in the data warehouse began to be active.

dw_expire_date The date and time when a row in the data warehouse stopped being active.
Section 11

Data Integration

Data integration is a technique for moving data or otherwise making data available across data stores. The data integration process can include extraction, movement, validation, cleansing, transformation, standardization, and loading.

Extract Transform Load (ETL)

In the ETL pattern of data integration, data is extracted from the data source and then transformed while in flight to a staging database. Data is then loaded into the data warehouse. ETL is strong for batch processing of bulk data.

ETL

Extract Load Transform (ELT)

In the ELT pattern of data integration, data is extracted from the data source and loaded to staging without transformation. After that, data is transformed within staging and then loaded to the data warehouse.

ELT

Change Data Capture (CDC)

The CDC pattern of data integration is strong in event processing. Database logs that contain a record of database changes are replicated near real time at staging. This information is then transformed and loaded to the data warehouse.

CDC



CDC is a great technique for supporting real-time data warehouses.

Innovation in Data Warehouse Technology

Thanks to the agility offered by today’s cloud-based data warehouse solutions, there are cutting-edge innovations that can automate some of the key aspects of data warehousing. For example, the ETL process described earlier has changed considerably thanks to machine learning and natural language processing, ultimately resulting in its complete automation. What’s more, data warehouse storage and compute have also benefited from automated optimization, saving data analysts time on tasks associated with querying, storage, and scalability, which dramatically cuts down costs, coding time, and resources.

As a modern solution, cloud data warehouse automation saves endless hours of coding and modeling for data ingestion, integration, and transformation. The tasks listed below can be now easily automated and seamlessly connected to third-party solutions, such as business intelligence visualization tools, via the cloud:

  • Automate data source connections.
  • Seamlessly connect to third-party SaaS APIs.
  • Easily connect to the most common storage services.

New technology now exists that automates data schema modeling, where adaptive schema changes at real time along with the data, and changes are seamless. You would only need to just upload the data sources, everything else is automated including the following tasks:

  • Data types are automatically discovered, and a schema is generated based on the initial data structure.
  • Likely relationships between tables are automatically detected and used to model a relational schema.
  • Aggregations are automatically generated.
  • Table history feature allows you store data uploaded from API data sources, so you can compare and analyze data from different time periods.

Automated Query Optimization

Automated cloud data warehousing technology exists that can automatically re-index the schema and perform a series of optimizations on the queries and data structure to improve runtime based on algorithms that assess usage so that:

  • Re-indexing happens automatically whenever the algorithm detects changes in query patterns.
  • Redistributing the data across nodes to improve data locality and join performance is done automatically.

Solving Concurrency Issues
To remedy concurrency issues, new cloud data warehousing technologies today can separate storage from compute and increase the compute nodes based on the amount of connections. Consequently, the number of available clusters scales with the number of users and the intensity of the workload, supporting hundreds of parallel queries that are load-balanced between clusters.

Storage Optimization
Data warehouse automation has also vastly improved how data is stored and used. New, “smart” data warehouse technologies constantly run periodic processes to mark data and optimize the storage based on usage. Smart data warehouse technology scales up and down based on the data volume. Scaling happens automatically behind the scenes, keeping clusters available for both reads and write, and thus ingestion can continue uninterrupted. When the scaling is complete, the old and new clusters are swapped instantly. In addition, the data warehouse maintenance itself has been greatly improved, as well, by automating the cleaning and compressing of tables to boost database performance.

Like This Refcard? Read More From DZone

related article thumbnail

DZone Article

related refcard thumbnail

Free DZone Refcard

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.linkDescription }}

{{ parent.urlSource.name }}
by
CORE
· {{ parent.articleDate | date:'MMM. dd, yyyy' }} {{ parent.linkDate | date:'MMM. dd, yyyy' }}
Tweet
{{ parent.views }} ViewsClicks
  • Edit
  • Delete
  • {{ parent.isLocked ? 'Enable' : 'Disable' }} comments
  • {{ parent.isLimited ? 'Remove comment limits' : 'Enable moderated comments' }}