Let's Build a Data Warehouse Part II: Creation of Databases and Tables; Initial Load ETL



Source code for this project can be found in my Git.

Organization of Part II:

Introduction

In the last section, which was fairly detailed, we went over the first half of the waterfall methodology, which we use here as a framework. We went over the feasbility study, the requirements, architecture, and design of the warehouse. In this section, much of the rubber meets the road as we build what was designed in the last section. While we won't go through every piece of code here and throughout this four-part case study, I will try to share the most important aspects, both here and moving forward. The link to the full source can be found in the link above. In the actual implementation, we use a batch file and psql to create the databases and tables at one time.

Build Stage DB and Tables

Stage database is pretty basic. Not a whole lot going on here, since it is just a temporary "resting" spot for our data, in-between the OLTP and DDS.

            
            use master
            GO
            
            if db_id ('StageAZRetail') is not null
            drop database StageAZRetail;
            GO
            
            create database StageAZRetail
            collate SQL_Latin1_General_CP1_CI_AS
            GO
            
            alter database StageAZRetail  set recovery simple 
            GO
            alter database StageAZRetail set auto_shrink off
            GO
            alter database StageAZRetail  set auto_create_statistics off
            GO
            alter database StageAZRetail  set auto_update_statistics off
            GO
        
       

Likewise, the tables in stage are not very interesting. In the DDS, we will combine the two tables in Stage into FactOrders.

           
            USE StageAZRetail
            GO
            
            IF EXISTS 
                (SELECT * FROM sys.tables 
                WHERE NAME = 'StageOrderDetails')
            DROP TABLE StageOrderDetails
            GO
            
            CREATE TABLE StageOrderDetails (
                OrderID INT NOT NULL
                ,[LINENO] SMALLINT NOT NULL
                ,ProductCode INT NOT NUll
                ,UnitPrice SMALLMONEY NOT NUll
                ,UnitCost SMALLMONEY NOT NULL
                ,QuantitySold int NOT NULL  
            )
           
       

Build DDS DB and Tables

The DDS tables and database are much more interesting. After the code is shown for FactOrders, and the DWH, a discussion of the design will follow. There is also an appertaining view created for each dimension and fact table. The view is an important layer that sits between the warehouse and OLAP. These views will allow us to both take advantage of the infrastructure of the DWH, and depart from it.

           
            use master
            GO
            
            if db_id ('AZRetailDWH') is not null
            drop database AZRetailDWH
            GO
            
            CREATE DATABASE AZRetailDWH
            COLLATE SQL_Latin1_General_CP1_CI_AS
            GO
            
            USE AZRetailDWH
            GO
            
            alter database AZRetailDWH SET recovery simple 
            GO
            alter database AZRetailDWH SET auto_shrink off
            GO
            alter database AZRetailDWH SET auto_create_statistics on
            GO
            alter database AZRetailDWH SET auto_update_statistics on
            GO
           
       
           
            USE [AZRetailDWH]
            GO
            
            IF EXISTS
                (SELECT * FROM sys.TABLES 
                WHERE name = 'FactOrders')
            DROP TABLE FactOrders
            GO
            
            CREATE TABLE FactOrders(
                CustomerKey INT NOT NULL	
                ,SalesDateKey INT NOT NULL	
                ,ShipDateKey INT 
                ,StoreKey INT NOT NULL
                ,ProductKey INT NOT NULL
                ,OrderID INT NOT NULL
                ,LineNumber SMALLINT NOT NULL	
                ,LineSale MONEY NOT NULL
                ,LineProfit MONEY NOT NULL
                ,Quantity INT NOT NULL	
                ,IsDeleted BIT NOT NULL DEFAULT 0
                
                ,CONSTRAINT pk_FactOrders
                PRIMARY KEY CLUSTERED (CustomerKey, SalesDateKey, StoreKey, ProductKey, OrderID, LineNumber)
                
                ,Constraint fk_FactOrders_StoreKey
                FOREIGN key (StoreKey)
                REFERENCES DimStore(StoreKey)
        
                ,Constraint fk_FactOrders_CustomerKey
                FOREIGN key (CustomerKey)
                REFERENCES DimCustomer(CustomerKey)
                
                ,Constraint fk_FactOrders_ProductKey
                FOREIGN key (ProductKey)
                REFERENCES DimProduct(ProductKey)
        
                ,Constraint fk_FactOrders_SalesDateKey
                FOREIGN key (SalesDateKey)
                REFERENCES DimTime(TimeKey)
                
                ,Constraint fk_FactOrders_ShipDateKey
                FOREIGN key (ShipDateKey)
                REFERENCES DimTime(TimeKey)
                    
            )
            
            GO
            
            CREATE COLUMNSTORE INDEX idx_fact_orders
            
            ON FactOrders (CustomerKey, SalesDateKey, StoreKey, ProductKey, OrderID, LineNumber)
                
            GO
            
            ALTER TABLE dbo.FactOrders
            NOCHECK CONSTRAINT fk_FactOrders_CustomerKey
            
            ALTER TABLE dbo.FactOrders
            NOCHECK CONSTRAINT fk_FactOrders_ProductKey
            
            ALTER TABLE dbo.FactOrders
            NOCHECK CONSTRAINT fk_FactOrders_SalesDateKey
            
            ALTER TABLE dbo.FactOrders
            NOCHECK CONSTRAINT fk_FactOrders_ShipDateKey
            
            ALTER TABLE dbo.FactOrders
            NOCHECK CONSTRAINT fk_FactOrders_StoreKey
            GO
            
            ALTER TABLE dbo.FactOrders
            NOCHECK CONSTRAINT fk_FactOrders_ProductKey
            GO
                
                                
           
       

The fact table is where everything happens, so it is no surprise that the script to create this important table is relatively long. Usually, you use the foreign keys from the dimension tables in the fact table. Here, we create a composite primary key on non-null dimension keys, and on OrderID and LineNumber. We turn off the foreign key check contraint, because this will make the performance of the warehouse slower.

Unique to warehousing in MSQL and fact tables in particular are columnstore indexes. These indexes work by searching the column instead of the row. Performance of this index is often better than using traditional clustered indexes on a fact table.

It is important that changes to orders are tracked. The incremental loading techniques will update the appropriate field. But what if an order is deleted in the OLTP database? The stake-holders obviously want to track this, and this is where the IsDeleted column comes in. As we will see in the next section, the incremental load ETL will mark deleted orders with a 1.

Finally, we also create an AZRetailDWHLog table to track SSIS incremental updates, and a stored procedure in AZRetaiLDWH to update DimTime every time incremental loading is performed.

Initial ETL for Stage

The initial load for the stage is pretty straight-forward; we simply copy Orders, and OrderDetails to StageOrder and StageOrderDetails. We directly copy Customers to DimCustomers. However, we need to mark the initial load for these three tables in the OLTP, since we will be incremental loading (CDC) the warehouse based on changes to these tables and to the customer table. The loading of Stage occurs before the DWH load. AZRetail adds 5-10 new customers a day, so we really don't need a Stage table for Customer. Thus, all ETL will be off Customer directly.



Intitial ETL for DDS

The first major thing that needs to be done when loading FactOrders is to lookup the keys for Customer, Location, Time, Product, and Store. However, in order to get the time key for order and delivery date, we need to transform the datetime field from orders into the format: MM/DD/YY, so we can lookup the timekey based on the full date. We need to join the Order and OrderDetails from Stage. Then, we can look up each product key.

Finally, we find the total sales and total profit for each line. Then we can load the data into the warehouse.



Conclusion

In this section we looked at the implementation of what was designed in part I. We started with running the scripts that build the Stage database and tables, and then moved on to creating the DWH database and tables. We then turned our attention to the ETL processes that fill first the Stage database and then finally the warehouse. In the third part of the project, we define the incremental loading methods CDC, SCD and MERGE and then test the loaded data/incremental loading on newly-added data and changes made to the master tables.