I got to see many designs that normalization wasn’t the first consideration in decision making phase.
In many cases those designs included more than 30 columns, and the main approach was “to put everything in the same place”
According to what I remember normalization is one of the first, most important things, so why is it dropped so easily sometimes?
Edit:
Is it true that good architects and experts choose a denormalized design while non-experienced developers choose the opposite? What are the arguments against starting your design with normalization in mind?
9
What’s interesting about this Q&A thread is that there are actually 3 questions. Everybody has answered a different one, and almost nobody has answered the first one:
- Why aren’t some databases in the wild normalized?
- Why/when should a normalized database be denormalized?
- In what situations is it harmful or unnecessary to normalize in the first place?
Alert readers will note that these are very different questions, and I’ll try to answer each of them separately while avoiding too much detail. By “too much”, I mean that I don’t think this is the appropriate context in which to be carrying out an extended debate over the merits of various arguments in favour of or against normalization; I’m simply going to explain what those arguments are, maybe list a few caveats, and save the philosophy for more specific questions, if they ever come up.
Also, in this answer I am assuming that “normalization” implies “BCNF, 3NF, or at least 2NF”, since that’s the level of normalization that designers generally aim to achieve. It’s rarer to see 4NF or 5NF designs; although they certainly aren’t impossible goals, they concern themselves with the semantics of relationships rather than just their representation, which requires considerably more knowledge about the domain.
So, onward and upward:
1. Why aren’t some databases in the wild normalized?
The answer to this could be “because they shouldn’t be”, but making that assumption right off the bat is pretty piss-poor detective work. We wouldn’t make very much progress as a society if we always operated on the assumption that whatever is, ought to be.
The real reasons that databases don’t get normalized in the first place are more complicated. Here are the top 5 that I’ve come across:
-
The developers who designed it didn’t know or didn’t understand how to normalize. Strong evidence of this comes in the form of many other accompanying bad design choices, like using varchar columns for everything or having a spaghetti mess of meaningless table and column names. And I assure you, I’ve seen “real” databases that are every bit as bad as those in the TDWTF articles.
-
The developers who designed it didn’t care or were actively against normalization on principle. Note, here I am not talking about instances where a deliberate decision was made not to normalize based on contextual analysis, but rather teams or companies where normalization is more-or-less understood but simply ignored or shunned out of habit. Again, surprisingly common.
-
The software is/was done as a Brownfield project. Many purists ignore this perfectly legitimate business rather than technical reason for not normalizing. Sometimes you don’t actually get to design a new database from scratch, you have to bolt on to an existing legacy schema, and attempting to normalize at that point would involve far too much pain. 3NF wasn’t invented until 1971, and some systems – especially financial/accounting systems – have their roots even farther back than that!
-
The database was originally normalized, but an accumulation of small changes over a long period of time and/or a widely distributed team introduced subtle forms of duplication and other violations of whatever normal form was originally in place. In other words, the loss of normalization was accidental, and too little time was spent on refactoring.
-
A deliberate business decision was made not to spend any time on business analysis or database design and just “get it done”. This is often a false economy and ultimately becomes a mounting form of technical debt, but is sometimes a rational decision, at least based on information that was known at the time – for example, the database may have been intended as a prototype but ended up being promoted to production use due to time constraints or changes in the business environment.
2. Why/when should a normalized database be denormalized?
This discussion often comes up when a database is normalized to start with. Either the performance is poor or there is a lot of duplication in queries (joins), and the team feels, rightly or wrongly, that they’ve gone as far as they can with the current design. It is important to note that normalization improves performance most of the time, and there are several options to eliminate excess joins when normalization appears to be working against you, many of which are less invasive and risky than simply changing to a denormalized model:
-
Create indexed views that encapsulate the most common problem areas. Modern DBMSes are capable of making them insertable or updatable (e.g. SQL Server
INSTEAD OF
triggers). This comes at a slight cost to DML statements on the underlying tables/indexes but is generally the first option you should try because it is nearly impossible to screw up and costs almost nothing to maintain. Of course, not every query can be turned into an indexed view – aggregate queries are the most troublesome. Which leads us to the next item… -
Create denormalized aggregate tables that are automatically updated by triggers. These tables exist in addition to the normalized tables and form a kind of CQRS model. Another CQRS model, more popular these days, is to use pub/sub to update the query models, which gives the benefit of asynchrony, although that may not be suitable in very rare instances where the data cannot be stale.
-
Sometimes, indexed views are not possible, the transaction rates and data volumes are too high to admit triggers with acceptable performance, and the queries must always return realtime data. These situations are rare – I’d hazard a guess that they might apply to things like High-Frequency Trading or law enforcement/intelligence databases – but they can exist. In these cases you really have no option but to denormalize the original tables.
3. In what situations is it harmful or unnecessary to normalize in the first place?
There are, in fact, several good examples here:
-
If the database is being used only for reporting/analysis. Typically this implies that there is an additional, normalized database being used for OLTP, which is periodically synchronized to the analysis database through ETL or messaging.
-
When enforcing a normalized model would require an unnecessarily complex analysis of the incoming data. An example of this is might be a system that needs to store phone numbers that are collected from several external systems or database. You could denormalize the call code and area code, but you’d have to account for all of the different possible formats, invalid phone numbers, vanity numbers (1-800-GET-STUFF), not to mention different locales. It’s usually more trouble than it’s worth, and phone numbers are usually just shoved into a single field unless you have a specific business need for the area code on its own.
-
When the relational database is primarily there to provide transactional support for an additional, non-relational database. For example, you might be using the relational database as a message queue, or to track the status of a transaction or saga, when the primary data is being stored in Redis or MongoDB or whatever. In other words, the data is “control data”. There’s usually no point in normalizing data that isn’t actually business data.
-
Service-Oriented Architectures that share a physical database. This is a bit of an odd one, but in a true SOA, you will occasionally need to have data physically duplicated because services are not allowed to directly query each other’s data. If they happen to be sharing the same physical database, the data will appear not to be normalized – but generally, the data owned by each individual service is still normalized unless one of the other mitigating factors is in place. For example, a Billing service might own the Bill entity, but the Accounting service needs to receive and store the Bill Date and Amount in order to include it in the revenue for that year.
I’m sure there are more reasons that I haven’t listed; what I’m getting at, in essence, is that they are quite specific and will be fairly obvious when they come up in practice. OLAP databases are supposed to use star schemas, SOAs are supposed to have some duplication, etc. If you’re working with a well-known architecture model that simply doesn’t work with normalization, then you don’t normalize; generally speaking, the architecture model takes precedence over the data model.
And to answer the very last question:
Is it true that good architects and experts choose a denormalized design while non-experienced developers choose the opposite? What are the arguments against starting your design with normalization in mind?
No, that is complete and utter B.S. It’s also B.S. that experts always choose a normalized design. Experts don’t just follow a mantra. They research, analyze, discuss, clarify, and iterate, and then they choose whatever approach makes the most sense for their particular situation.
The 3NF or BCNF database is usually a good starting point for analysis because it’s been tried and proven successful in tens of thousands of projects all over the world, but then again, so has C. That doesn’t mean we automatically use C in every new project. Real-world situations may require some modifications to the model or the use of a different model altogether. You don’t know until you’re in that situation.
1
The assumption built into the question and in some of the answers is that normalization is synonymous good database design. This is in fact often not the case. Normalization is one way of achieving a particular set of design goals and a requirement if you are relying heavily on the database to enforce “business rules” about the relationships between data elements.
Normalization gives you a few key benefits:
- Minimizes the amount of redundant data.
- Maximizes the extent to which the database’s built in integrity mechanisms (foreign key constraints, uniqueness constraints) can be leveraged to ensure the integrity of the data.
- Reduces the number of columns per row increasing the efficiency of IO in some cases. Wide rows take longer to retrieve.
That said, there are plenty of valid reasons to denormalize:
- Performance, particularly for analytics, can be crippled by normalization. For analysis against relational databases, denormalized dimensional models are the standard approach.
- The benefit of enforcing data integrity inside of the database is starting to decline. As more and more development is focused on the object-oriented middle-tier that is often enforcing business rules, reliance on relational constraints in the database is less important.
- As others have mentioned, normalization will complicate the queries required to retrieve relevant data.
It is not clear that normalization is a sign of good design. In some cases, normalization is an artifact of a time when storage space was at a premium and when much of the responsibility for encoding business rules resided in the database (think about 2-tiered client-server applications with most if not all business logic in stored procedures). It may well be that many projects veer away from normalization based on good architectural decisions rather than a poor grasp of database design principles.
The article by Jeff Atwood referenced in the comments above provides some good detailed discussion – “Maybe Normalizing Isn’t Normal”.
8
- A lot of developers don’t know or care about normalization, or about data modeling or database.
- For some jobs it’s really not important.
- Sometimes there’s a really good reason to de-normalize, for example to make a particular difficult workload perform well.
- Relational Database concepts are recently less in fashion than they were in the 1990s and 2000s. Developers tend to be influenced by fashion, even if they claim to be very rational. There’s no point arguing about taste.
Normalization is also, historically, a territory for near religious argument, so I hesitate to say much more.
6
In large projects, and specially those in mainframes, this is not the case. In fact if you search job sites you will see several positions for data modelers. Also, having many columns on a single table does not go against normalization. Nevertheless, your observation is valid for some projects.
Database design is one of the skills required to build quality systems. Having said that, some developers don’t know enough about database design and still get assigned to the task of data modeling and database design. Some projects even skip data modeling. The focus on many projects is mainly on coding and front-end design.
Another factor for poor database design is the fact that Normalization is not a trivial topic specially when it comes for 4th NF,5th NF, etc. Most books I have seen could not clearly explain those forms well. There is usually bad examples and too much theory. This makes the topic less popular than it should.
Errors in database design are hard to come by unless you look for them or you encounter them during testing. Having no standard for database design quality lets errors happen more likely.
Add to that the fact that some projects don’t follow a rigorous development methodology (one that promotes database design), as a result, responsibilities get mixed and tasks get lost between the business analyst, the developers and the DBAs. Developers talk in OO and UML where DBAs talk in DD and some in ERDs and probably many don’t get UML or OO. In short, the lack of knowledge, lack of good clear resources, lack of unified language to describe data and lack of methodology are all to blame.
3