I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Why You Make Less Money (job tips for geeks) Nate Silver Gets Real About Big Data (ReadWrite) Java StringBuilder myth debunked (Java Code Geeks) Dew Drop – March 29, 2013 (#1,517) (Alvin Ashcraft's Morning Dew) Generation Mooch? Why 20-somethings have a hard time paying for content (GigaOM) Double Shot #1096 (A Fresh Cup) Connecting Talking with Doing (Conversation Agent) Games Galore: Building Atari with CreateJS (noupe) Putting People in Boxes (Architects Zone – Architectural Design Patterns & Best Practices) Do Code Improvements Add Value? (Architects Zone – Architectural Design Patterns & Best Practices) Cassandra 1.1 – Reading and Writing from SSTable Perspective (Architects Zone – Architectural Design Patterns & Best Practices) Couchbase NoSQL at Tunewiki: A Billion Documents and Counting (Architects Zone – Architectural Design Patterns & Best Practices) The Daily Six Pack: March 29, 2013 (Dirk Strauss) Using Kanban for Scrum Backlog Grooming (Agile Zone – Software Methodologies for Development Managers) Humming (xkcd.com) Amazon Acquires Social Reading Site Goodreads, Which Gives The Company A Social Advantage Over Apple(TechCrunch) I hope you enjoy today’s items, and please participate in the discussions on those sites.
I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Getting Visual: Your Secret Weapon For Storytelling & Persuasion (The Future Buzz) My Clojure Workflow, Reloaded (Hacker News) Replacing Clever Code with Unremarkable Code in Go (Hacker News) Unit Test like a Secret Agent with Sinon.js (Web Dev .NET) Bliki: EmbeddedDocument (Martin Fowler) How we use ZFS to back up 5TB of MySQL data every day (Royal Pingdom) IMB to acquire Softlayer for a rumored $2-2.5 billion (Hacker News) Cloud SQL API: YOU get a database! And YOU get a database! And YOU get a database! (Cloud Platform Blog) You Should Write Ugly Code (Hacker News) How many lights can you turn on? (The Endeavour) Python Big Picture — What's the "roadmap"? (S.Lott-Software Architect) Salesforce announces deal to buy digital marketing firm ExactTarget for $2.5 billion (The Next Web) Dew Drop – June 4, 2013 (#1,560) (Alvin Ashcraft's Morning Dew) New Technologies Change the Way we Engage with Culture (Conversation Agent) Free Python ebook: Bayesian Methods for Hackers (Hacker News) How Go uses Go to build itself (Hacker News) Sustainable Automated Testing (Javalobby – The heart of the Java developer community) Breaking Down IBM’s Definition of DevOps (Javalobby – The heart of the Java developer community) Big Data is More than Correlation and Causality (Javalobby – The heart of the Java developer community) So, What’s in a Story? (Agile Zone – Software Methodologies for Development Managers) The Real Lessons of Lego (for Software) (Agile Zone – Software Methodologies for Development Managers) The Daily Six Pack: June 4, 2013 (Dirk Strauss) Get your mobile application backed by the cloud with the Mobile Backend Starter (Cloud Platform Blog) Open for Big Data: When Mule Meets the Elephant (Javalobby – The heart of the Java developer community) I hope you enjoy today’s items, and please participate in the discussions on those sites.
see also: part i: when to build your data warehouse part ii: building a new schema part iii: location of your data warehouse part iv: extraction, transformation, and load in part i we looked at the advantages of building a data warehouse independent of cubes/a bi system and in part ii we looked at how to architect a data warehouse’s table schema. in part iii, we looked at where to put the data warehouse tables. in part iv, we are going to look at how to populate those tables and keep them in sync with your oltp system. today, our last part in this series, we will take a quick look at the benefits of building the data warehouse before we need it for cubes and bi by exploring our reporting and other options. as i said in part i, you should plan on building your data warehouse when you architect your system up front. doing so gives you a platform for building reports, or even application such as web sites off the aggregated data. as i mentioned in part ii, it is much easier to build a query and a report against the rolled up table than the oltp tables. to demonstrate, i will make a quick pivot table using sql server 2008 r2 powerpivot for excel (or just powerpivot for short!). i have showed how to use powerpivot before on this blog , however, i usually was going against a sql server table, sql azure table, or an odata feed. today we will use a sql server table, but rather than build a powerpivot against the oltp data of northwind, we will use our new rolled up fact table. to get started, i will open up powerpivot and import data from the data warehouse i created in part ii. i will pull in the time, employee, and product dimension tables as well as the fact table. once the data is loaded into powerpivot, i am going to launch a new pivottable. powerpivot understands the relationships between the dimension and fact tables and places the tables in the designed shown below. i am going to drag some fields into the boxes on the powerpivot designer to build a powerful and interactive pivot table. for rows i will choose the category and product hierarchy and sum on the total sales. i’ll make the columns (or pivot on this field) the month from the time dimension to get a sum of sales by category/product by month. i will also drag in year and quarter in my vertical and horizontal slicers for interactive filtering. lastly i will place the employee field in the report filter pane, giving the user the ability to filter by employee. the results look like this, i am dynamically filtering by 1997, third quarter and employee name janet leverling. this is a pretty powerful interactive report build in powerpivot using the four data warehouse tables. if there was no data warehouse, this pivot table would have been very hard for an end user to build. either they or a developer would have to perform joins to get the category and product hierarchy as well as more joins to get the order details and sum of the sales. in addition, the breakout and dynamic filtering by year and quarter, and display by month, are only possible by the dimtime table, so if there were no data warehouse tables, the user would have had to parse out those dateparts. just about the only thing the end user could have done without assistance from a developer or sophisticated query is the employee filter (and even that would have taken some powerpivot magic to display the employee name, unless the user did a join.) of course pivot tables are not the only thing you can create from the data warehouse tables you can create reports, ad hoc query builders, web pages, and even an amazon style browse application. (amazon uses its data warehouse to display inventory and oltp to take your order.) i hope you have enjoyed this series, enjoy your data warehousing.
In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. Today we are going to look at where to put your data warehouse tables. Let’s look at the location of your data warehouse. Usually as your system matures, it follows this pattern: Segmenting your data warehouse tables into their own isolated schema inside of the OLTP database Moving the data warehouse tables to their own physical database Moving the data warehouse database to its own hardware When you bring a new system online, or start a new BI effort, to keep things simple you can put your data warehouse tables inside of your OLTP database, just segregated from the other tables. You can do this a variety of ways, most easily is using a database schema (ie dbo), I usually use dwh as the schema. This way it is easy for your application to access these tables as well as fill them and keep them in sync. The advantage of this is that your data warehouse and OLTP system is self-contained and it is easy to keep the systems in sync. As your data warehouse grows, you may want to isolate your data warehouse further and move it to its own database. This will add a small amount of complexity to the load and synchronization, however, moving the data warehouse tables to their own table brings some benefits that make the move worth it. The benefits include implementing a separate security scheme. This is also very helpful if your OLTP database scheme locks down all of the tables and will not allow SELECT access and you don’t want to create new users and roles just for the data warehouse. In addition, you can implement a separate backup and maintenance plan, not having your date warehouse tables, which tend to be larger, slow down your OLTP backup (and potential restore!). If you only load data at night, you can even make the data warehouse database read only. Lastly, while minor, you will have less table clutter, making it easier to work with. Once your system grows even further, you can isolate the data warehouse onto its own hardware. The benefits of this are huge, you can have less I/O contention on the database server with the OLTP system. Depending on your network topology, you can reduce network traffic. You can also load up on more RAM and CPUs. In addition you can consider different RAID array techniques for the OLTP and data warehouse servers (OLTP would be better with RAID 5, data warehouse RAID 1.) Once you move your data warehouse to its own database or its own database server, you can also start to replicate the data warehouse. For example, let’s say that you have an OLTP that works worldwide but you have management in offices in different parts of the world. You can reduce network traffic by having all reporting (and what else do managers do??) run on a local network against a local data warehouse. This only works if you don’t have to update the date warehouse more than a few times a day. Where you put your data warehouse is important, I suggest that you start small and work your way up as the needs dictate.
Building on Google's work, here are some suggestions on how to create effective documentation to make models open, accessible, and understandable to all teams.
Diving into KoP concepts, answering frequently asked questions, and the latest and future improvements the KoP community has made and will make to the project.
Learn how to build a modern data stack with cloud-native technologies, such as data warehouse, data lake, and data streaming, to solve business problems.
Discover the key differences between Databricks and Snowflake around architecture, pricing, security, compliance, data support, data protection, performance, and more.
Learn what data ingestion is, why it matters, and how you can use it to power your analytics and activate your data as an essential part of the modern data stack.