Let's Build a Data Warehouse Part I: Project Management and Modeling
Source code for this project can be found in my Git.
Organization of Part I:
- Requirements Gathering
- Functional/Non-Functional Requirements
- Feasibility Study
- Identify Business Process
- Identify Grain
- Identify Dimensions
- Identify Facts
It can be argued that much of the demand for data professionals is for the application of business intelligence. Indeed, being able to compile and curate historical data—data that is the single version of truth—into a format that can be visualized easily and quickly can make the difference between having actionable insight and not.
Our fictional client is AZRetail a retailer operating in Phoenix and Tucson Arizona, with some locations in San Diego, California. The retailer does most of its sales shipping to clients accross the U.S. The company has come to us asking that we create a solution that will help the stake-holders answer important business questions through analysis. The company wants help with an OLAP solution built on the Microsoft BI Stack. We will go through each step here in a four-part case study and offer a fifth section with visuals being served on a live connection hosted in Amazon.
Some Caveats and Gotchas
The scope of a data warehouse project is huge and many aspects are never-ending. I wanted this project to be a realistic overview of OLTP to OLAP to visualization implementation in a Microsoft BI stack using SSAS, SQL Server, and SSIS. I wanted the project to be as realistic as possible, but some tradeoffs between practicality and realism had to be made. Although things like logging and change capture have been represented in a realistic way, other aspects like data quality have been skipped. Also, often necessary things like partitioning the database have not been included. Again, because of my infrastructure limitations, what is usually performed on several computers will be performed on one, though obviously, with the exception of paritioning, all that would be changed for a multiple machine infrastructure are the machines we connect to for ETL and OLAP.
About the OLTP data:
I chose to use Redgate's Data Generator, which is easy to use and fun to generate a mocked-up data set. The idea was that the OLTP data should be something like data from a small retailer. Below is a screenshot of the diagram that shows the design of the OLTP database. There are 72,865 Orders, with 400,000 order details. There are 2,000 customers, 5 Stores and 3 Divisions. There are 200 Products in 20 ProductCategories.
The executives and managers at AZRetail want to find insight from sales data. Stores and divisions have quarterly sales goals, which managers and district managers motivate their team to reach. Furthermore, sometimes products are on promotion and there are sales goals for these products as well. The warehouse and the OLAP solution will need to provide insight into these two main areas. Managers and analysts should be able to compare previous year sales to this year, and also look at sales year-to-date. Finally, executives should be able to see sales by state and even city.
Functional requirements cover what the end-users want or need from a piece of software. It is the what it does aspect of the software. Github allows for secure source control, Facebook allows people to connect with each other, and the AZRetail DWH will allow AZRetail to gain insight from their own data. The ability for a user to accomplish these outcomes are the functional requirements. We also have non-functional requirements, which are the physical requirements needed to meet the functional requirements, and the performance requirements.
Much of the functional requirements for a data warehouse project are defined by what is going to be measured in the fact table, and what attributes in the dimension tables are measured. Of course, the functional requirements don't come in the shape of dimension and fact tables, but in what question the user needs to answer with visualizations. The challenge of designing and implementing a warehouse can then be described as turning analytical needs into dimension and fact tables.
The non-functional requirements have to do with hardware and downtime. Do we have the infrastructure we need for the warehouse? What else do we need? Is there a budget for additional needed infrastructure? How much down-time is tolerated? How often should the OLAP cube and data warehouse be updated? These are all non-functional requirements. Based on stake-holder input, changes made to data in the master tables (Customer, Store, etc.) will need to be tracked in the warehouse.
One of the additional requirements is that the bandwidth for the daily ETL be kept as low as possible. Therefore, for the orders, order details and customer tables, CDC will be used to update the warehouse. For the ETL to update the warehouse, stage tables are used for order and order details, but not customers. For the mostly static master tables: product, store, and location, we will use slowly-changing dimensions. It is important to the business owners that changes to master tables be recorded historically. For the changing tables—customer, order and order details—it is important information is updated and kept. If an order is deleted, or a customer is deleted, it is important it can be tracked.
To close this phase of the project, the functional and non-functional requirements, and the project risks should be documented and signed off by stake-holders. This has been done and is available in my Git
For this project, the Single DDS data store architecture was chosen. This configuration consists of two data stores: stage (an intermediate table to ease bandwidth on OLTP system) and DDS (this is the dimensional store that makes the bones of the warehouse). A diagram of the data flow is below.
The physical architecture in a typical moderately-sized system would involve several machines. As mentioned, I will not be writing the ETL for such an infrastructure, but the only difference would be server names and the addition of partitions for data > 10GB. However, we can imagine an install that might look like the diagram below.
For the actual infrastructure that I use later when I create visualizations with the Power BI App connected to the tabular model, I use an Amazon EC2 to host all of the infrastructure, with a gateway running on the instance. Clients (you) will have access to the visuals in an embedded application. I share a vsiualiztion of that infrastructure in the
previously linked section. Let us imagine that the architecture of AZRetail's warehouse looks like the diagram below.
Identify Business Process
For ease of demonstration, the business owners are only interested in measures from sales transactions. In reality, we would have not only the transactional business data as we have here, but perhaps CRM data, HRM data, marketing data, and so on. Thus, the first step to this whole process is to identify the first business process we are going to tackle. In our case, as the end-users specify, sales data is where we will start. In a real warehouse scenario, where the warehouse is new, our functional requirements would involve many different business processes, facts, and dimensions, of which we would tackle one at a time.
Identify the Grain
Here, we need to define a grain. Usually, we are very liberal at defining a grain, because this is one of those areas that defines the shape of the fact table which defines the shape of the warehouse. For example, if we chose the grain of order total, we would lose the grain of order lines. If we suddenly needed this grain we might be in trouble and might even need to create a new fact table. Luckily, in our case order details or order line items is both the natural and lowest grain.
Identify the Dimensions
Dimensions are the nouns we are measuring. These are things like location, customer, vendor, campaign, and such. Dimension store the actual data. And as much of the warehousing process, it is often a "go backward to go forward" situation. You might begin by defining the grain, defining the facts and then defining the dimensions, and then going back to define more dimensions and add more facts to the fact table.
Identify the Facts
Most books on the Kimball method say to define dimensions and then facts. This is sort of silly, since we don't know what we are measuring until we know what measures (facts) we have. Thus this process too is often iterative, and again, we will look to requirements gathering for much of this information.
Now that we know what business process we will tackle first, and the requirements for the project, a question looms: do we even need a warehouse? Does the cost and benefits outweigh the risk? This is where a proposal (or business case)—leading from a feasibility study—answers all of these questions. Participants in the project (project managers, stake-holders, contractors, architects, dev, etc.) are listed, as well as their role. If the project makes sense and is feasible, then it will go on into the project management phase, and the feasibility aspect will become part of the proposal. For the sake of demonstration, let's assume that AZRetail, our case study has made it through the feasibility study.
Data Feasibility Study
Do we have the right data? This portion of the pre-project answers that question and is focused on loading the warehouse, and the availability of measures. Do we have Dimension Attribute X to measure Measure Y? The data feasibility study also qualifies project risks based on data. For example, in the requirements gathering section, it was mentioned that ETL should use bare-minimum bandwidth. Thus, to reduce ths risk, we will use CDC; we will only extract new orders or orders that have changed.
For this scenario, we can imagine that the company has the infrastructure they need, after testing for throughput, with a new machine for dedicated ETL.
The warehouse design aspect, which follows from the requirements gathering and dimensional modeling is critical. In order to pass beyond the project design phase we need to know if the data we need is available. We learned in the requirements gathering phase what data we need. Using a mapping document we can show where each attribute (this is what is measured, like StoreName) in the dimension tables will come from. Using the design document and understanding the measures we need, we can determine if we have the data we need to move forward. Plus, the design phase and the mapping document gives us a roadmap to follow in building the data warehouse.
The design phase is really where we decide how the final warehouse will look, what attributes are included in the dimensions, and what will be measured in the fact table(s). It is here we decide indexing, and other important performance considerations.
The 7 "W's" of Warehouse Design
We can think of our warehouse design problem as the answer to a User Story. What is a story that describes the transaction we are trying to model? Then, using the 7 W's (Who, What, When, Why, Where, How Much, How Many), we can identify the measures and attributes.
In our case it could look something like:
Alice buys one Valspar from the Tucson Downtown store on October, 29th 2018. She paid the total for the order.
Analyzing the above, we see that who is Alice, what is buys item, when is October, 29th 2018, where is Tucson Downtown, how many is quantity and how much is order total.
The How questions gives us our measures and the W questions our attributes. With all the requirements gathering and everything else we have done, we are almost ready to create our data warehouse.
The dimension tables are of a common design: a surrogate key as primary key.
The fact table has a slightly more complicated design, with a primary key built on Customer, SalesDate, Product and Store keys, along with the OrderID and LineNumber, since an order can have many items. This made the OrderDetails table in the OLTP obsolete, since it is combined with the Order table. This gets rid of the one-to-many relationship that existed in the OLTP database and gives us practically the granularity of OrderItem, though we only need the granualrity of order. On the FactOrder table, the foreign key constraint was removed for all affected tables and a COLUMNSTORE index was created on: CustomerKey, SalesDateKey, StoreKey, ProductKey, OrderID, and LineNumber. Below is the diagram of the warehouse.
We can see that we have a star pattern.
In this first part we went over the feasablity study, requirements gathering, architecture, and design for the project, as well as the documentation we need to move forward. Now that the first phase is over, we can put some of these ideas to work in part II where we develop the actual infrastructure.