I have a question on how to model my customer dimension. We have a existing conformed dimension built using our transactional order processing systems that service the retail and non-retail customers. We also feed the customer data into transportation system that does deliveries.
The transportation system has their own system generated IDs. So for the company_cust1 id with address and location details, transportation generates trans_cust1 id etc. Same with business unit (BU). We feed the business unit data into transportation and they generate their own system id. All the transportation is related using the transportation generated IDs.
Now we are doing the star schema dimensional model for the transportation data. We are bringing in the transportation customer data. I have a question on how to model the transportation data.
ID | Company_ID | Address | State | Zip | etc |
---|---|---|---|---|---|
trans_cust1 | company_cust1 | 123ln | CA | 90011 | etc |
trans_cust2 | company_cust2 | 789ln | TX | 78156 | etc |
Existing Customer Conformed dimension
Company_ID | Address | State | Zip | etc |
---|---|---|---|---|
company_cust1 | 12356ln | CA | 90011 | etc |
company_cust2 | 789ln | TX | 78156 | etc |
Tranporation fact table
transaction_sid | cust_sid | company_id | BU_sid | BU | measure1 | meansure2 | etc |
---|---|---|---|---|---|---|---|
trans_id1 | trans_cust1 | company_cust1 | trans_bu1 | bu1 | 100 | 50 | etc |
trans_id2 | trans_cust2 | company_cust2 | trans_bu2 | bu2 | 80 | 130 | etc |
Now to bring in the transportation customer data, what is the best approach to model it?
Approach1
Keep the existing customer conformed dimension as is which is used for 90% of the reporting. Have the transportation customer data as a separate customer table (non-conformed). For the transportation fct, it would join on the non-conformed transporation customer data. If it needs additional customer attributes related to the company, it can joined via the snowflake join to the conformed customer table.
Transportation customer dimension
Trans_ID | Company_ID | Address | State | Zip |
---|---|---|---|---|
trans_cust1 | company_cust1 | 123ln | CA | etc |
trans_cust2 | company_cust2 | 789ln | TX | etc |
With the above approach, the load process is going to be simple and the customer dimension SCD2 also doesn’t grow drastically due to limited attributes. Joining transactional data with point in time customer data will be less expensive.
Approach2
Add the transportation customer attributes to the existing conformed customer attributes.
conformed dimension
Company_ID | Address | State | Zip | trans_id | trans_address1 | trans_state |
---|---|---|---|---|---|---|
company_cust1 | 12356ln | CA | 90011 | trans_id1 | 123ln | CA |
company_cust2 | 789ln | TX | 78156 | trans_id2 | 789ln | TX |
With this approach, we have the true conformed dimension but it becomes a complex model to load. Also tracking the point in the time data will be difficult
as the customer dimension has already 100+ attributes.
Please share your thoughts on what is the best approach? I have some users who are strongly inclined towards a conformed dimension with star schema but with all the latest technology advancements, I want to know your thoughts on a snowflake model that enables parallel loads and faster joins.