Effect of denormalizing

Here’s a scenario where I’m wondering whether to denormalize a relational database (MS SQL).

Text description of the requirement

I have users (stored in a users table).

Users belong to Accounts:

An account is created before its users are created
Every user must belong to exactly one account (neither no account, nor more than one account)
Any account may contain many users
A user’s assignment to an account is permanent: users are never moved from one account to another

The Users table is predefined so I can’t add an “accountId” column to the Users table; so to implement accounts I have:

A Users table, with a userId key
An Accounts table, with an accountId key
A UserAccounts table, with userId and accountId columns (to identify the account associated with each user);
the UserAccounts table is constrained:
- Foreign key relations to Users and Accounts table
- Uniqueness constraint to ensure that a userId appears no more than once

Because I can’t modify the Users table to insert and accountId column in it, I can’t ensure that every user has an accountId. And in fact I will insert into the
UserAccounts table after inserting into the Users table, though probably in the same transaction.

I have a dozen other tables (e.g. Sales) which define various types of thing which belong to users
These tables have a userId column (to identify which user each row belongs to)
The problem is that I also need to know what account each thing belongs to

Graphic description of the requirement

Here’s a reworded version of the same problem (thanks to MichaelT who suggested this rewording).

<code> +-----------+ +--------------+ +--------------+

+-----------+ +--------------+ +--------------+

+--> userId pk <----+ userId (uniq)| +--> accountId pk |

| +-----------+ +--------------+ +--------------+

| +-----------+

| | Sales |

| +-----------+ (and several other tables like this)

| | saleId pk |

+--+ userId fk |

| |

+-----------+

</code>

   +-----------+    +--------------+      +--------------+
   | Users     |    | UserAccount  |      | Account      |
   +-----------+    +--------------+      +--------------+
+--> userId pk <----+ userId (uniq)|   +--> accountId pk |
|  |           |    | accountId    +---+  |              |
|  +-----------+    +--------------+      +--------------+
|                                                       
|                                                       
|  +-----------+                                        
|  | Sales     |                                        
|  +-----------+  (and several other tables like this)  
|  | saleId pk |                                        
+--+ userId fk |                                        
   |           |                                        
   +-----------+

I can’t modify the Users table for various reasons: which is why each user’s (single) account is defined using the separate UserAccount table, instead of adding an accountId foreign key to the Users table.

This UserAccount table implements a many-to-one relationship between users and accounts (not a many-to-many relationship). It could be defined using either the following keys:

userId as the primary key
(userId,accountId) as the primary key plus userId as a unique key

The question

Now for the question:

When doing a query against a Sales table that needs to include the Account information, I think there are two ways to implement knowing which account each thing belongs to:

Join the table to the UserAccounts table (to select the accountId for each userId):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>select S.*, UA.accountId
from Sales S
join UserAccount UA on S.userId = UA.userId
where
S.something = somethingElse
</code>
<code>select S.*, UA.accountId from Sales S join UserAccount UA on S.userId = UA.userId where S.something = somethingElse </code>
```
select S.*, UA.accountId
from Sales S
    join UserAccount UA on S.userId = UA.userId
where
    S.something = somethingElse
```
Denormalize the Sales table[s], by storing the accountId in it as well as the userId (the accountId can then be retrieved from Sales without a join to UserAccount):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>| +--------------+
| | Sales |
| +--------------+ (and several other tables like this)
| | saleId pk |
+--+ userId fk |
| accountId fk |
+--------------+
</code>
<code>| +--------------+ | | Sales | | +--------------+ (and several other tables like this) | | saleId pk | +--+ userId fk | | accountId fk | +--------------+ </code>
```
|  +--------------+                                        
|  | Sales        |                                        
|  +--------------+  (and several other tables like this)  
|  | saleId pk    |                                        
+--+ userId fk    |                                        
   | accountId fk |                                        
   +--------------+                                        
```
If I did the latter I could ensure integrity by defining the userId plus accountId pair/combination as a foreign key constraint into the UserAccounts table (to ensure that the pairing of accountId with userId matches the pairing defined in the UserAccounts table).

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId)
REFERENCES UserAccounts (userId,accountId)
</code>
<code>CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId) REFERENCES UserAccounts (userId,accountId) </code>
```
CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId)
    REFERENCES UserAccounts (userId,accountId)
```

I don’t have vast experience with database design: I learned it from books (by Joe Celko).

I’m inhibited against denormalizing in general, but denormalizing seems appropriate here: why an extra JOIN in every SELECT statement (to get the accountId), when the same could be obtained by storing the accountId in the table with a (compound or composite) foreign key.

Is there a compelling reason to use (or to not use) either of these two possibilities?

If not then I’m tempted to use the 2nd method:

because it’s simpler to define the select statement (without a join)
because it’s (presumably) slightly better performance (without a join)
because it ensures that a corresponding row (with an accountId) exists in the UserAccounts table

You should get rid of the UserAccounts table, since the relationship between User and Account is a one-to-many relatioship (given the uniqueness constraint you have on UserAccounts). There’s no need for the connection table where the relationship between entities is one-to-many. Such a table is only needed when the relationship is many-to-many.
Let’s suposse there’s a political reason why you can’t get rid of UserAccounts or modify User, then…
I advice you to go with option 1. The specific denormalization you suggest will bring you headaches. Ask yourself: What if a user changes account?. In that case you willl have to update a lot of tables. Not having to do so is preciselly what RDBMS are for.
There are some cases when de-normalization makes sense but not here.
On the argument on join simplicity: some old sage said once: “depend on views, not on tables”. Create as many views as needed with the tables already joined. Then program against those views, not tables, that way you don’t have to worry about that specific join all the time. There should not be any performance issue if you have indexes and FKs. That’s what RDBMS are for.

EDIT: the advantage of the view is that you will no longer have to write the join yourself in every query, which, if I understood well, bothers you, and that, if I understood well, is one of your reasons againts not denormalizing. The view already has the join under the hood, so it will be transparent for you that the AccoountID is in another table. Is has the advantages of normalization but not the disadvantages. One of the advantages of views is offering a denormalize appearance but without denormalizing. You only write the join once (in the view) rather that everytime you want to know the AccoountID, since you will be using the view VSales (which already has an AccoountID column) instead of the table Sale (which doesn’t) in queries.

In this particular case, it doesn’t sound like denormalizing to me–there is a 1:1 required relationship and the only piece of lookup information is the User key. Adding another table just to store the relationship doesn’t add any value and actually works against you (since as you mentioned, you can’t require the field that way).

As for your reasons why, I wouldn’t worry about simplicity or performance–with proper indexing, you would never even notice the join. The most important aspect is data correctness, and the second way offers that guarantee for you.

The point of not denormalizing would be to avoid update anomalies.

Not denormalizing would be good, if and only if:

Users are moved from one account to another
When a user is moved, all the Sales records for that user should implicitly move with the user to the user’s new account

In this case, ‘denormalizing’ is OK for three reasons:

Users are guaranteed to never change accounts
Even if a user did change accounts, we wouldn’t want the user’s historical Sales rows to change
userId plus accountId are (or could be) defined together (is it called ‘compound’ or ‘composite’?) as a primary key, so defining both together in the Sales table is not denormalizing (they’re two physical columns but kind of one logical column)

The denormalizing is PREFERABLE for three reasons (in order of decreasing importance):

It records the historical accountId at the moment when the Sales row is created. The account (not the user) pays for the sale and owns the thing sold, so the mapping from Sales row to Account row must never change in the future even if the user subsequently changes account (which the user never will)
It guarantees that no Sales are inserted for Users which (by an anomaly) have no associated Account
It makes the SELECT simpler: no need to JOIN Sales to UserAccount (instead that JOIN is defined implicitly and at INSERT time by the compound/composite FOREIGN KEY contraint)

I vaguely remember reading that denormalizing is normal, standard, best practice when dealing with historical data: and this might be an example of that kind of “historical” data where denormalizing is justified.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 20:20

Thẻ: data-integrity, database-design, performance

Effect of denormalizing

Here’s a scenario where I’m wondering whether to denormalize a relational database (MS SQL).

Text description of the requirement

I have users (stored in a users table).

Users belong to Accounts:

An account is created before its users are created
Every user must belong to exactly one account (neither no account, nor more than one account)
Any account may contain many users
A user’s assignment to an account is permanent: users are never moved from one account to another

The Users table is predefined so I can’t add an “accountId” column to the Users table; so to implement accounts I have:

A Users table, with a userId key
An Accounts table, with an accountId key
A UserAccounts table, with userId and accountId columns (to identify the account associated with each user);
the UserAccounts table is constrained:
- Foreign key relations to Users and Accounts table
- Uniqueness constraint to ensure that a userId appears no more than once

I have a dozen other tables (e.g. Sales) which define various types of thing which belong to users
These tables have a userId column (to identify which user each row belongs to)
The problem is that I also need to know what account each thing belongs to

Graphic description of the requirement

Here’s a reworded version of the same problem (thanks to MichaelT who suggested this rewording).

<code> +-----------+ +--------------+ +--------------+

+-----------+ +--------------+ +--------------+

+--> userId pk <----+ userId (uniq)| +--> accountId pk |

| +-----------+ +--------------+ +--------------+

| +-----------+

| | Sales |

| +-----------+ (and several other tables like this)

| | saleId pk |

+--+ userId fk |

| |

+-----------+

</code>

   +-----------+    +--------------+      +--------------+
   | Users     |    | UserAccount  |      | Account      |
   +-----------+    +--------------+      +--------------+
+--> userId pk <----+ userId (uniq)|   +--> accountId pk |
|  |           |    | accountId    +---+  |              |
|  +-----------+    +--------------+      +--------------+
|                                                       
|                                                       
|  +-----------+                                        
|  | Sales     |                                        
|  +-----------+  (and several other tables like this)  
|  | saleId pk |                                        
+--+ userId fk |                                        
   |           |                                        
   +-----------+

This UserAccount table implements a many-to-one relationship between users and accounts (not a many-to-many relationship). It could be defined using either the following keys:

userId as the primary key
(userId,accountId) as the primary key plus userId as a unique key

The question

Now for the question:

When doing a query against a Sales table that needs to include the Account information, I think there are two ways to implement knowing which account each thing belongs to:

Join the table to the UserAccounts table (to select the accountId for each userId):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>select S.*, UA.accountId
from Sales S
join UserAccount UA on S.userId = UA.userId
where
S.something = somethingElse
</code>
<code>select S.*, UA.accountId from Sales S join UserAccount UA on S.userId = UA.userId where S.something = somethingElse </code>
```
select S.*, UA.accountId
from Sales S
    join UserAccount UA on S.userId = UA.userId
where
    S.something = somethingElse
```
Denormalize the Sales table[s], by storing the accountId in it as well as the userId (the accountId can then be retrieved from Sales without a join to UserAccount):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>| +--------------+
| | Sales |
| +--------------+ (and several other tables like this)
| | saleId pk |
+--+ userId fk |
| accountId fk |
+--------------+
</code>
<code>| +--------------+ | | Sales | | +--------------+ (and several other tables like this) | | saleId pk | +--+ userId fk | | accountId fk | +--------------+ </code>
```
|  +--------------+                                        
|  | Sales        |                                        
|  +--------------+  (and several other tables like this)  
|  | saleId pk    |                                        
+--+ userId fk    |                                        
   | accountId fk |                                        
   +--------------+                                        
```
If I did the latter I could ensure integrity by defining the userId plus accountId pair/combination as a foreign key constraint into the UserAccounts table (to ensure that the pairing of accountId with userId matches the pairing defined in the UserAccounts table).

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<code>CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId)
REFERENCES UserAccounts (userId,accountId)
</code>
<code>CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId) REFERENCES UserAccounts (userId,accountId) </code>
```
CONSTRAINT Sales_FK FOREIGN KEY (userId,accountId)
    REFERENCES UserAccounts (userId,accountId)
```

I don’t have vast experience with database design: I learned it from books (by Joe Celko).

Is there a compelling reason to use (or to not use) either of these two possibilities?

If not then I’m tempted to use the 2nd method:

because it’s simpler to define the select statement (without a join)
because it’s (presumably) slightly better performance (without a join)
because it ensures that a corresponding row (with an accountId) exists in the UserAccounts table

You should get rid of the UserAccounts table, since the relationship between User and Account is a one-to-many relatioship (given the uniqueness constraint you have on UserAccounts). There’s no need for the connection table where the relationship between entities is one-to-many. Such a table is only needed when the relationship is many-to-many.
Let’s suposse there’s a political reason why you can’t get rid of UserAccounts or modify User, then…
I advice you to go with option 1. The specific denormalization you suggest will bring you headaches. Ask yourself: What if a user changes account?. In that case you willl have to update a lot of tables. Not having to do so is preciselly what RDBMS are for.
There are some cases when de-normalization makes sense but not here.
On the argument on join simplicity: some old sage said once: “depend on views, not on tables”. Create as many views as needed with the tables already joined. Then program against those views, not tables, that way you don’t have to worry about that specific join all the time. There should not be any performance issue if you have indexes and FKs. That’s what RDBMS are for.

The point of not denormalizing would be to avoid update anomalies.

Not denormalizing would be good, if and only if:

Users are moved from one account to another
When a user is moved, all the Sales records for that user should implicitly move with the user to the user’s new account

In this case, ‘denormalizing’ is OK for three reasons:

Users are guaranteed to never change accounts
Even if a user did change accounts, we wouldn’t want the user’s historical Sales rows to change
userId plus accountId are (or could be) defined together (is it called ‘compound’ or ‘composite’?) as a primary key, so defining both together in the Sales table is not denormalizing (they’re two physical columns but kind of one logical column)

The denormalizing is PREFERABLE for three reasons (in order of decreasing importance):

It records the historical accountId at the moment when the Sales row is created. The account (not the user) pays for the sale and owns the thing sold, so the mapping from Sales row to Account row must never change in the future even if the user subsequently changes account (which the user never will)
It guarantees that no Sales are inserted for Users which (by an anomaly) have no associated Account
It makes the SELECT simpler: no need to JOIN Sales to UserAccount (instead that JOIN is defined implicitly and at INSERT time by the compound/composite FOREIGN KEY contraint)

Filed under: softwareengineering - @ 20:20

Thẻ: data-integrity, database-design, performance

Thiết kế website giá rẻ

Danh mục

Effect of denormalizing

Text description of the requirement

Graphic description of the requirement

The question

Effect of denormalizing

Text description of the requirement

Graphic description of the requirement

The question