Let's Build a Data Warehouse Part III: Incremental Loading ETL
Source code for this project can be found in my Git.
Organization of Part III:
- Add New Data to OLAP
- Incremental Loading CDC Tables
- Incremental Loading SCD Tables
- Test Incremental Load
In the last section, we filled our warehouse with data that is already in the OLTP database. But what happens when new customers, orders, or locations are added to the AZRetail OLTP? What happens when a store name changes, or a product price changes? One of the major challenges in implementing a data warehouse is figuring out what changes in the source OLTP to capture, the methods to capture them, and how to add new data without duplication.
Generally, in Mirosoft SQL Server we have three choices to track changes to the DWH. First, we have Microsoft's implementation of SCDs (slowly-changing dimensions) in SSIS and specific to Microsoft SQL we have CDC and the MERGE statement. MERGE is not a method to track changes, but it can perform a similar functionality, by comparing source and destination tables and then updating, inserting, or deleting, depending on cases. Of course, the use of each of these, how they are used, or if they are used comes down to what changes we want to track and what historical changes we want to keep. A good rule is that generally master tables like Product and Store are tracked through SCD. Transaction tables are tracked through CDC or the MERGE statement, or both.
As mentioned in functional requirements, AZRetail wants to track changes made to the master tables Store, Location, and Product. Changes should show an end and start date. Customers could be an SCD, but if you are adding customers everyday, it doesn't make sense. If AZRetail only added customers every month or so, then SCD might make sense. For Customer, Orders, and OrderDetails, we have enabled CDC when we created the DWH database and tables, so we will use CDC.
CDC (Change Data Catpure), works on the OLTP database. If a change is made to the database, it is recored in the CDC table for each master table. So, in our case, Customer, Orders, and OrderDetails have a $tracked_table_name_CDC_State table that records the state of the OLTP table it tracks. In the previous part of the case study, we marked the initial load and below we see the status of the CDC_State table for Customer as ILEND, which shows the table just passed through its initial load.
When we run the incremental load, as we will go through below, we will see the status change to TFEND, or Trickle Flow End.
The real work in CDC is done in a system table within AZRetailOLTP. A cdc_dbo_Customer_CT table (the change table) exists and captures all changes that are enabled in CDC. All of this will make sense as we go through code below.
Add new Data to OLAP
In order to test incremental loading, we need to load some new data. In the AZRetailDWH folder from my git, there is a sql file called "change_static_tables". If we run this file, we add a new customer (me), add a new location (Tucson), change a product category, delete an order, and add a new order.
USE AZRetailOLTP GO UPDATE dbo.Store SET description = 'PHOENIX SUR' WHERE StoreID = 115 -- Change Product with Code 5's name to Monkey Wrench UPDATE dbo.Product SET description = 'Monkey Wrench' WHERE ProductCode = 5 UPDATE dbo.ProductCategory SET description = 'TOOLS' WHERE Category = '3V' -- Add new address INSERT INTO dbo.Address ( City, State ) VALUES ( 'Tucson', -- City - varchar(50) 'Arizona' -- State - varchar(50) ) -- Add new customer INSERT INTO dbo.Customer ( FirstName, LastName, Addressnumber ) VALUES ( 'John', -- FirstName - varchar(50) 'Bellamy', -- LastName - varchar(50) 1001 -- AddressKey - int ) -- Add new Order INSERT INTO dbo.Orders ( CustomerNumber, OrderDate, ShipDate, DateLastContact, StoreNumber ) VALUES ( 2001, -- CustomerNumber - int GETDATE(), -- OrderDate - datetime NULL, -- ShipDate - datetime GETDATE(), -- DateLastContact - datetime 115 -- StoreNumber - int ) -- Update OrderDetails with new order INSERT INTO dbo.OrderDetails ( OrderID, [LINENO], ProductNumber, Price, Cost, Quantity ) VALUES ( 72905, -- OrderID - int 1, -- LINENO - smallint 5, -- ProductNumber - int 533.6552, -- Price - smallmoney 444.7127, -- Cost - smallmoney 5 -- Quantity - int ) -- Finally, delete Order with ID = 2 DELETE FROM dbo.Orders WHERE OrderID = 2 DELETE FROM dbo.OrderDetails WHERE OrderID = 2 DELETE FROM dbo.Orders WHERE OrderID = 2
Before I go over the details of how CDC is implemented for Order and OrderDetails, let's look at cdc_dbo_OrderDetails_CT after adding a new order and deleting order 2. We can clearly see our new order number (72905) and the deleted rows from OrderDetails below.
Incremental Loading CDC Tables
Below, we use CDC as a data source to load all changes to StageOrders and StageOrderDetails. Because we have a fact table that is not equivalent to any table in OLTP (it's a combination of two tables, afterall) we can't use CDC Splitter directly. However, we can get all changes from CDC. Then, when we load fact, we will determine whether an order is new by looking up the order in the FactTable. If it's not there, we will insert, if it is there, we will update. However, if it is not in OLTP, but it is in DWH, we need MERGE. Below is the ETL for each case.
In the large image which shows the ETL for incremental loading of FactOrders we see what is in the data flow task directly above. After this Data Flow task ends, the merge statement runs in the control flow task Execute SQl it is connected to. This is where we update the IsDeleted column. Another interesting aspect about this ETL is that we have NULL ShipDateKeys so we don't want to lose those records to No Match Lookup output. So, we Union All Matches and No Matches; kinda odd :)
MERGE FactOrders AS SOURCE USING [AZRetailOLTP].[dbo].[Orders] AS TARGET ON (SOURCE.OrderID = TARGET.OrderID ) WHEN NOT MATCHED BY SOURCE THEN UPDATE SET IsDeleted = 1;
The CDC for DimCustomer is fairly straight-forward. We use the CDC Splitter directly. Like FactOrders, we want to log when a customer is deleted using MERGE as we did before.
Finally, in both applications of CDC, we need to update the CDC table to show the state of CDC. We do this by defining a process range.
Incremental Loading SCD Tables
SCD for the master tables is very straight-forward. The SSIS-generated process flow below is very similar for DimProduct, DimStore and DimLocation. With SCD, if we are tracking historical changes (SCD Type II), we can set a start and end date. This shows when a change begins, so we can get historic details later if needed.
Test Incremental Load
After running the incremental load package, we can query the database to make sure our updates went through.
We can see the store and product names have changed in DimStore and DimProduct, the new order is showing in FactOrders, the deleted order 2 showing up where IsDeleted = 1 in FactOrders and the new location is showing in DimLocation; it appears the incremental loading is working.
In this section, we covered a lot of ground. We described CDC, MERGE and SCD strategies for incremental loading. We loaded a new order and made changes to all the master and transaction tables. We then tested the DWH to see if the changes were reflected. In the next and final part of this project we build and test an OLAP model and deploy our solution.
A closing note: I said we checked all tables, but I didn't show DimCustomer. How do we know incremental loading worked for DimCustomer? Well, we can see the new CustomerKey in FactOrders and we have a lookup for CustomerKey on CustomerID from DimCustomer when we incremental load FactOrders. If the new customer (me) wasn't there, the new entry in FactOrders would not show up, because the lookup in DimCustomer would fail. We see the new order, so we know the new customer is there as well.