Let's Build a Data Warehouse Part II: Creation of Databases and Tables; Initial Load ETL
Source code for this project can be found in my Git.
Organization of Part II:
In the last section, which was fairly detailed, we went over the first half of the waterfall methodology, which we use here as a framework. We went over the feasbility study, the requirements, architecture, and design of the warehouse. In this section, much of the rubber meets the road as we build what was designed in the last section. While we won't go through every piece of code here and throughout this four-part case study, I will try to share the most important aspects, both here and moving forward. The link to the full source can be found in the link above. In the actual implementation, we use a batch file and psql to create the databases and tables at one time.
Build Stage DB and Tables
Stage database is pretty basic. Not a whole lot going on here, since it is just a temporary "resting" spot for our data, in-between the OLTP and DDS.
use master GO if db_id ('StageAZRetail') is not null drop database StageAZRetail; GO create database StageAZRetail collate SQL_Latin1_General_CP1_CI_AS GO alter database StageAZRetail set recovery simple GO alter database StageAZRetail set auto_shrink off GO alter database StageAZRetail set auto_create_statistics off GO alter database StageAZRetail set auto_update_statistics off GO
Likewise, the tables in stage are not very interesting. In the DDS, we will combine the two tables in Stage into FactOrders.
USE StageAZRetail GO IF EXISTS (SELECT * FROM sys.tables WHERE NAME = 'StageOrderDetails') DROP TABLE StageOrderDetails GO CREATE TABLE StageOrderDetails ( OrderID INT NOT NULL ,[LINENO] SMALLINT NOT NULL ,ProductCode INT NOT NUll ,UnitPrice SMALLMONEY NOT NUll ,UnitCost SMALLMONEY NOT NULL ,QuantitySold int NOT NULL )
Build DDS DB and Tables
The DDS tables and database are much more interesting. After the code is shown for FactOrders, and the DWH, a discussion of the design will follow. There is also an appertaining view created for each dimension and fact table. The view is an important layer that sits between the warehouse and OLAP. These views will allow us to both take advantage of the infrastructure of the DWH, and depart from it.
use master GO if db_id ('AZRetailDWH') is not null drop database AZRetailDWH GO CREATE DATABASE AZRetailDWH COLLATE SQL_Latin1_General_CP1_CI_AS GO USE AZRetailDWH GO alter database AZRetailDWH SET recovery simple GO alter database AZRetailDWH SET auto_shrink off GO alter database AZRetailDWH SET auto_create_statistics on GO alter database AZRetailDWH SET auto_update_statistics on GO
USE [AZRetailDWH] GO IF EXISTS (SELECT * FROM sys.TABLES WHERE name = 'FactOrders') DROP TABLE FactOrders GO CREATE TABLE FactOrders( CustomerKey INT NOT NULL ,SalesDateKey INT NOT NULL ,ShipDateKey INT ,StoreKey INT NOT NULL ,ProductKey INT NOT NULL ,OrderID INT NOT NULL ,LineNumber SMALLINT NOT NULL ,LineSale MONEY NOT NULL ,LineProfit MONEY NOT NULL ,Quantity INT NOT NULL ,IsDeleted BIT NOT NULL DEFAULT 0 ,CONSTRAINT pk_FactOrders PRIMARY KEY CLUSTERED (CustomerKey, SalesDateKey, StoreKey, ProductKey, OrderID, LineNumber) ,Constraint fk_FactOrders_StoreKey FOREIGN key (StoreKey) REFERENCES DimStore(StoreKey) ,Constraint fk_FactOrders_CustomerKey FOREIGN key (CustomerKey) REFERENCES DimCustomer(CustomerKey) ,Constraint fk_FactOrders_ProductKey FOREIGN key (ProductKey) REFERENCES DimProduct(ProductKey) ,Constraint fk_FactOrders_SalesDateKey FOREIGN key (SalesDateKey) REFERENCES DimTime(TimeKey) ,Constraint fk_FactOrders_ShipDateKey FOREIGN key (ShipDateKey) REFERENCES DimTime(TimeKey) ) GO CREATE COLUMNSTORE INDEX idx_fact_orders ON FactOrders (CustomerKey, SalesDateKey, StoreKey, ProductKey, OrderID, LineNumber) GO ALTER TABLE dbo.FactOrders NOCHECK CONSTRAINT fk_FactOrders_CustomerKey ALTER TABLE dbo.FactOrders NOCHECK CONSTRAINT fk_FactOrders_ProductKey ALTER TABLE dbo.FactOrders NOCHECK CONSTRAINT fk_FactOrders_SalesDateKey ALTER TABLE dbo.FactOrders NOCHECK CONSTRAINT fk_FactOrders_ShipDateKey ALTER TABLE dbo.FactOrders NOCHECK CONSTRAINT fk_FactOrders_StoreKey GO ALTER TABLE dbo.FactOrders NOCHECK CONSTRAINT fk_FactOrders_ProductKey GO
The fact table is where everything happens, so it is no surprise that the script to create this important table is relatively long. Usually, you use the foreign keys from the dimension tables in the fact table. Here, we create a composite primary key on non-null dimension keys, and on OrderID and LineNumber. We turn off the foreign key check contraint, because this will make the performance of the warehouse slower.
Unique to warehousing in MSQL and fact tables in particular are columnstore indexes. These indexes work by searching the column instead of the row. Performance of this index is often better than using traditional clustered indexes on a fact table.
It is important that changes to orders are tracked. The incremental loading techniques will update the appropriate field. But what if an order is deleted in the OLTP database? The stake-holders obviously want to track this, and this is where the IsDeleted column comes in. As we will see in the next section, the incremental load ETL will mark deleted orders with a 1.
Finally, we also create an AZRetailDWHLog table to track SSIS incremental updates, and a stored procedure in AZRetaiLDWH to update DimTime every time incremental loading is performed.
Initial ETL for Stage
The initial load for the stage is pretty straight-forward; we simply copy Orders, and OrderDetails to StageOrder and StageOrderDetails. We directly copy Customers to DimCustomers. However, we need to mark the initial load for these three tables in the OLTP, since we will be incremental loading (CDC) the warehouse based on changes to these tables and to the customer table. The loading of Stage occurs before the DWH load. AZRetail adds 5-10 new customers a day, so we really don't need a Stage table for Customer. Thus, all ETL will be off Customer directly.
Intitial ETL for DDS
The first major thing that needs to be done when loading FactOrders is to lookup the keys for Customer, Location, Time, Product, and Store. However, in order to get the time key for order and delivery date, we need to transform the datetime field from orders into the format: MM/DD/YY, so we can lookup the timekey based on the full date. We need to join the Order and OrderDetails from Stage. Then, we can look up each product key.
Finally, we find the total sales and total profit for each line. Then we can load the data into the warehouse.
In this section we looked at the implementation of what was designed in part I. We started with running the scripts that build the Stage database and tables, and then moved on to creating the DWH database and tables. We then turned our attention to the ETL processes that fill first the Stage database and then finally the warehouse. In the third part of the project, we define the incremental loading methods CDC, SCD and MERGE and then test the loaded data/incremental loading on newly-added data and changes made to the master tables.