Over a million developers have joined DZone.

The Executive Guide to Data Warehouse Augmentation

DZone's Guide to

The Executive Guide to Data Warehouse Augmentation

Traditional data warehouses don't scale to tackle the challenges of big data. Here's a high-level look on how to transition to data lakes.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

The traditional data warehouse (DW) is constrained in terms of storage capacity and processing power. That’s why the overall footprint of the data warehouse is shrinking as companies look for more efficient ways to store and process big data. Although data warehouses are still used effectively by many companies for complex data analytics, creating a hybrid architecture by migrating storage and large-scale or batch processing to a data lake enables companies to save on storage and processing costs and get more value from their data warehouse for business intelligence activities.

Designing Your Architecture

Getting started with a traditional data warehouse can be a difficult first step. The video below explains the traditional DW architecture, pain points of the architecture as well as the modern data lake architecture.

Offloading storage as well as extract, transform and load (ETL) functions to a scale out architecture, such as Hadoop, enables enterprises to focus the DW on what it does best: Business Intelligence (BI). Data can be sent to BI tools and analytical tools that understand Hadoop or DW can augment the data lake to handle legacy tools, if needed. Consumers include BI tools that connect to DW using JDBC/ODBC or statistician tools like R and SAS.

Data from the various source systems is sent to a ETL tool where it is cleansed like removing bad records. Data is then standardized like data formats, transformed like lookups, joined into Facts and Dimensions and finally loaded to DW.

Building Your Lake

Now that you know the importance of having the right architecture, think about three key components to building a modern data lake: the hydrator, the transformer and the provisioner. Here’s why you need them and how they can help you make sense of your architecture:

Hydrator - Typically done through Bedrock managed ingestion, this is the architectural component that brings in data from various source systems into the data lake.

To build this component, you will need:

  • Source Connection manager
  • Source Type, Credentials, Owner
  • Data Feed Configuration
    • Feed Name, Type (RDBMS/File/Streaming)
    • Mode - Incremental/Full/CDC
    • Expected Latency
    • Structure information, PII
  • Reusable Scripts / Workflows on top of core components
    • Hadoop API’s for file
    • Sqoop for RDBMS
    • Kafka, Flume for streaming
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

    Transformer - Consider this component of the architecture the next generation ETL that cleans bad data, correlates and creates enriched insights from raw data.

    To build this component, you will need:

    • Application Development Platform
    • Built on Hadoop components Spark, MapReduce, Pig, Hive
    • Abstract and build reusable workflows for common problems
  • Business Rules Integration
    • The application platform should be able to integrate easily with rules provided by business
    • For example, an insurance company might have several rules for computing policy discount. The company should be able to change the rules without IT involvement
  • Workflow Scheduling / Management
    • Workflows should have scheduling, dependency management and logging
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

    Provisioner - This component is designed to extract data from the data lake and provide it to the consumers.

    Below are key design aspects of this component:

    • Destination Connection manager
    • Destination Type, Credentials, Owner
    • Provisioning Metadata:  
    • Type (RDBMS/File/Streaming)
    • Filters if applicable
    • Mode Full / Incremental
    • Frequency: daily / hourly / message
  • Reusable Scripts/Workflows on top of core components
    • Hadoop API’s for file
    • Sqoop for RDBMS
    • Kafka, Flume for streaming
  • Operational Stats
    • What, Who, When, Why
    • Failures and Notifications
    • SLA monitoring

    Benefits of DW Augmentation

    Building a modern data lake architecture has a long list of benefits. Zaloni works closely with enterprises to design the architecture for DW offload and implements Bedrock - the industry’s only fully integrated Hadoop data management platform - to not only accelerate deployment, but significantly improve visibility into the data. Learn how to save millions and enable faster time to insight by downloading our DW Augmentation solution brief.

    Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

    data warehouse ,data lake ,big data

    Published at DZone with permission of

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}