DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How to Build a Data-Warehouse in 4 weeks, Part 1

How to Build a Data-Warehouse in 4 weeks, Part 1

Victor Savkin user avatar by
Victor Savkin
·
Aug. 24, 11 · Interview
Like (0)
Save
Tweet
Share
16.66K Views

Join the DZone community and get the full member experience.

Join For Free

First of all, there is no reason to be afraid. Not only big corporations having terabytes of data can use data-warehouses. If you have a complex domain model and a few databases with millions of records here and there it might be useful to create a separate database instance for performing analytical queries and generating reports. Also, don’t believe people who say that you will need years to build a simple DW. You don’t need more than 4 weeks to develop DW that will bring some value to your business.

Architecture

It’s very important to think about the architecture of your DW as it will define tools and procedures that will be used. To make it simple: ‘Inmon VS Kimball‘ is the question you need to answer:

  • In Inmon’s paradigm all information is stored in 3rd normal form. Basically, the schema of your DW will be similar (more or less) to the schema of your transactional database. Inmon’s paradigm is good for creating large centralized data warehouses that will serve the whole enterprise. It’s a top down approach that requires more upfront design. The performance in this model isn’t as good as in Kimball’s model because data is normalized.

  • In Kimball’s paradigm you start building several databases (known as data marts) that will serve individual departments of your business. In this model DW is a combination of all data marts. The dimensional model, that is used to store data, simplifies the schema significantly and, as a result, gives you much better performance. As for me I prefer Kimball’s paradigm as you don’t have to design the whole DW upfront. You can start with creating a single data mart and it can be done by one person in a month.

Read about Kimball’s model

The best thing you can do is to buy this book “The Data Warehouse Toolkit, 2nd Edition: The Complete Guide to Dimensional Modeling”. It’s a good overview of the dimension model, which will help you if you don’t have any experience in building a DW. It helped me a lot. Highly recommend it to everyone interested in the topic.

 

Database

Choosing a DBMS for your DW is a very important step. And it’s not an easy decision to make. There are two options you can choose from. If your dataset is rather small you can choose a familiar row-oriented database (such as Postgres). Just having data denormalized will improve performance. But if your database is huge it’s time to put on big boy’s pants and look at column-oriented storages:

  • Row-oriented database are stable, reliable and familiar to everyone. They are perfect for transactional operations but not so good for analytical queries.
  • Column-oriented databases are not very mature. As a result, they are hard to administrate, hard to find any documentation about, and they are not as reliable as row-oriented databases. It isn’t surprising taking into account the fact that these databases have been developing only for a few years. But they can give you something that MySQL can’t – performance. From my experience, they are AT LEAST 10 TIMES FASTER for complex analytical queries.

My advice would be to load some data into your favorite row-oriented database and check the performance. And ONLY if you are not happy with it go and try MonetDB, LucidDB, Infobright or some other column-oriented DBMS.

 

Column-Oriented DBMS

If you decided to go with a column-oriented database I can give you some advice about two databases I had experience with:

Infobright

The biggest benefit you’ll get if you choose to use Infobright is that fact that it’s a mysql engine. You can use your favorite driver to connect to it (so you can use any language, any platform) and you won’t have to learn a new dialect of SQL. It’s a very solid product but the free version has severe limitations - you can use only bulk upload. No inserts, updates or deletes.

LucidDB

The distribution includes it’s own jdbc driver. If you are a ruby or a python developer you are out of luck. In addition, their site doesn’t contain any decent documentation. It’s just a collection of reference-like wiki page, most of which contain only a brief description of some crazy parameters such as processorMaxBytes or databaseShadowLogIncrementSize. I am 100% sure that most of all luciddb users will never change them. On the other hand, the free version of LucidDB doesn’t have any limitations.

 

From http://victorsavkin.com/post/9209987806/how-to-build-a-data-warehouse-in-4-weeks-part-1

Database Data warehouse Build (game engine)

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Choosing the Right Framework for Your Project
  • Fixing Bottlenecks in Your Microservices App Flows
  • Understanding and Solving the AWS Lambda Cold Start Problem
  • 3 Main Pillars in ReactJS

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: