Here’s a scenario where I’m wondering whether to denormalize a relational database (MS SQL).
Text description of the requirement
I have users (stored in a users table).
Users belong to Accounts:
- An account is created before its users are created
- Every user must belong to exactly one account (neither no account, nor more than one account)
- Any account may contain many users
- A user’s assignment to an account is permanent: users are never moved from one account to another
The Users table is predefined so I can’t add an “accountId” column to the Users table; so to implement accounts I have:
- A Users table, with a userId key
- An Accounts table, with an accountId key
-
A UserAccounts table, with userId and accountId columns (to identify the account associated with each user);
the UserAccounts table is constrained:- Foreign key relations to Users and Accounts table
- Uniqueness constraint to ensure that a userId appears no more than once
Because I can’t modify the Users table to insert and accountId column in it, I can’t ensure that every user has an accountId. And in fact I will insert into the
UserAccounts table after inserting into the Users table, though probably in the same transaction.
- I have a dozen other tables (e.g. Sales) which define various types of thing which belong to users
- These tables have a userId column (to identify which user each row belongs to)
- The problem is that I also need to know what account each thing belongs to
Graphic description of the requirement
Here’s a reworded version of the same problem (thanks to MichaelT who suggested this rewording).
+-----------+ +--------------+ +--------------+
| Users | | UserAccount | | Account |
+-----------+ +--------------+ +--------------+
+--> userId pk <----+ userId (uniq)| +--> accountId pk |
| | | | accountId +---+ | |
| +-----------+ +--------------+ +--------------+
|
|
| +-----------+
| | Sales |
| +-----------+ (and several other tables like this)
| | saleId pk |
+--+ userId fk |
| |
+-----------+
I can’t modify the Users
table for various reasons: which is why each user’s (single) account is defined using the separate UserAccount table, instead of adding an accountId foreign key to the Users table.
This UserAccount table implements a many-to-one relationship between users and accounts (not a many-to-many relationship). It could be defined using either the following keys:
userId
as the primary key(userId,accountId)
as the primary key plususerId
as a unique key
The question
Now for the question:
When doing a query against a Sales table that needs to include the Account information, I think there are two ways to implement knowing which account each thing belongs to:
-
Join the table to the UserAccounts table (to select the accountId for each userId):
<code>select S.*, UA.accountIdfrom Sales Sjoin UserAccount UA on S.userId = UA.userIdwhereS.something = somethingElse</code><code>select S.*, UA.accountId from Sales S join UserAccount UA on S.userId = UA.userId where S.something = somethingElse </code>select S.*, UA.accountId from Sales S join UserAccount UA on S.userId = UA.userId where S.something = somethingElse
-
Denormalize the Sales table[s], by storing the accountId in it as well as the userId (the accountId can then be retrieved from Sales without a join to UserAccount):
<code>| +--------------+| | Sales || +--------------+ (and several other tables like this)| | saleId pk |+--+ userId fk || accountId fk |+--------------+</code><code>| +--------------+ | | Sales | | +--------------+ (and several other tables like this) | | saleId pk | +--+ userId fk | | accountId fk | +--------------+ </code>| +--------------+ | | Sales | | +--------------+ (and several other tables like this) | | saleId pk | +--+ userId fk | | accountId fk | +--------------+
If I did the latter I could ensure integrity by defining the userId plus accountId pair/combination as a foreign key constraint into the UserAccounts table (to ensure that the pairing of accountId with userId matches the pairing defined in the UserAccounts table).
<code>CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId)REFERENCES UserAccounts (userId,accountId)</code><code>CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId) REFERENCES UserAccounts (userId,accountId) </code>CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId) REFERENCES UserAccounts (userId,accountId)
I don’t have vast experience with database design: I learned it from books (by Joe Celko).
I’m inhibited against denormalizing in general, but denormalizing seems appropriate here: why an extra JOIN in every SELECT statement (to get the accountId), when the same could be obtained by storing the accountId in the table with a (compound or composite) foreign key.
Is there a compelling reason to use (or to not use) either of these two possibilities?
If not then I’m tempted to use the 2nd method:
- because it’s simpler to define the select statement (without a join)
- because it’s (presumably) slightly better performance (without a join)
- because it ensures that a corresponding row (with an accountId) exists in the UserAccounts table
16
- You should get rid of the
UserAccounts
table, since the relationship betweenUser
andAccount
is a one-to-many relatioship (given the uniqueness constraint you have onUserAccounts
). There’s no need for the connection table where the relationship between entities is one-to-many. Such a table is only needed when the relationship is many-to-many. - Let’s suposse there’s a political reason why you can’t get rid of
UserAccounts
or modifyUser
, then… - I advice you to go with option 1. The specific denormalization you suggest will bring you headaches. Ask yourself: What if a user changes account?. In that case you willl have to update a lot of tables. Not having to do so is preciselly what RDBMS are for.
- There are some cases when de-normalization makes sense but not here.
- On the argument on join simplicity: some old sage said once: “depend on views, not on tables”. Create as many views as needed with the tables already joined. Then program against those views, not tables, that way you don’t have to worry about that specific join all the time. There should not be any performance issue if you have indexes and FKs. That’s what RDBMS are for.
EDIT: the advantage of the view is that you will no longer have to write the join yourself in every query, which, if I understood well, bothers you, and that, if I understood well, is one of your reasons againts not denormalizing. The view already has the join under the hood, so it will be transparent for you that the AccoountID
is in another table. Is has the advantages of normalization but not the disadvantages. One of the advantages of views is offering a denormalize appearance but without denormalizing. You only write the join once (in the view) rather that everytime you want to know the AccoountID
, since you will be using the view VSales
(which already has an AccoountID
column) instead of the table Sale
(which doesn’t) in queries.
12
In this particular case, it doesn’t sound like denormalizing to me–there is a 1:1 required relationship and the only piece of lookup information is the User key. Adding another table just to store the relationship doesn’t add any value and actually works against you (since as you mentioned, you can’t require the field that way).
As for your reasons why, I wouldn’t worry about simplicity or performance–with proper indexing, you would never even notice the join. The most important aspect is data correctness, and the second way offers that guarantee for you.
1
The point of not denormalizing would be to avoid update anomalies.
Not denormalizing would be good, if and only if:
- Users are moved from one account to another
- When a user is moved, all the Sales records for that user should implicitly move with the user to the user’s new account
In this case, ‘denormalizing’ is OK for three reasons:
- Users are guaranteed to never change accounts
- Even if a user did change accounts, we wouldn’t want the user’s historical Sales rows to change
- userId plus accountId are (or could be) defined together (is it called ‘compound’ or ‘composite’?) as a primary key, so defining both together in the Sales table is not denormalizing (they’re two physical columns but kind of one logical column)
The denormalizing is PREFERABLE for three reasons (in order of decreasing importance):
- It records the historical accountId at the moment when the Sales row is created. The account (not the user) pays for the sale and owns the thing sold, so the mapping from Sales row to Account row must never change in the future even if the user subsequently changes account (which the user never will)
- It guarantees that no Sales are inserted for Users which (by an anomaly) have no associated Account
- It makes the SELECT simpler: no need to JOIN Sales to UserAccount (instead that JOIN is defined implicitly and at INSERT time by the compound/composite FOREIGN KEY contraint)
I vaguely remember reading that denormalizing is normal, standard, best practice when dealing with historical data: and this might be an example of that kind of “historical” data where denormalizing is justified.
1