I’m designing a system which needs to report basic distance between zip codes. It will use the Google Distance Matrix API but I want to cache the results in the database so that multiple requests for the same data points don’t result in duplicate API calls.
A basic Distance
class might look like this:
A basic data structure for the cache might look like this:
Since I’m not using fine-grained detail concerned with changes due to one-way streets or divided highways, interchanges, etc., it therefore doesn’t matter to me whether I’m getting the data from point A to point B or from point B to point A. But I’m not sure how to express this with full normalization in the database. With this primary key, it would be perfectly legal for the same distance to exist in two separate row, i.e.
var row1 = new Distance("00001", "00002");
var row2 = new Distance("00002", "00001");
It would probably be a good idea for me to require a SortedList
parameter in the constructor, but is there a way I can design it from the database side to enforce full normalization?
2
If the location is being represented by a zipcode (fine for the cases where most distance estimates are within a tolerance for the same zipcode), then I’d impose a CHECK constraint on the fields Zip1 and Zip2 that Zip1 =< Zip2.
Of course this imposes a corresponding burden on the middleware logic to understand that the convention to assure only one record is putting the smaller zipcode in Zip1. This is mainly a concern for INSERTing rows, as queries can be agnostic about which field is first, i.e. WHERE (Zip1 = myZip1 and Zip2 = myZip2) OR (Zip1 = myZip2 and Zip2 = myZip1).
However normalization is not violated by storing both orders in the compound key. It uses something less than twice the number of records, but these look to be fairly small rows, so I’m not sure you can justify the logic complexity on efficiency grounds (least of all in terms of speed).
1
According to this Zip code FAQ, there are about 43,000 zip codes in the US. The figure fluctuates a couple thousand per year. Note the value returned by distance(zip1, zip2)
may change over time possibly due to new road construction, zip code boundary changes, etc.
If you store the return values indefinitely, you might wind up with 43,000^2 (approx 1.8 billion) values. In that case, avoiding the inverses (zip2, zip1) makes some sense.
If you let your cache records expire (keep them for a day? a week? a month?) then it may make more sense to not worry about the inverses.
As long as you’re not coding for mobile or embedded devices, disk space is really cheap.
1
Consider departing from normalization. For every departure from one of the normal forms, there is a corresponding anomaly to be overcome. For 2NF through 5NF the anomalies are generally in the area of harmful redundancy.
What makes this kind of redundancy harmful is the possibility that the database will contain mutually contradictory facts, and therefore yield erratic results. You have to program in such a way that these contradictions don’t occur.
For routine database work, it’s usually better to normalize, and thereby obviate contradiction than to try to be careful with updates.
But if you were to carry this line of reasoning to its logical conclusion, you would never do any cacheing whatsoever. A cache always contains data that is redundant to the underlying data it caches. Whenever you build a cache, you have to guard against the cache delivering results that are no longer valid. Programming to cope with harmful redundancy involves the same kind of disciplined thinking.
At the end of the day, you may decide to normalize after all. But don’t treat normalization as some sort of holy grail. It isn’t.
There is an alternative. Whenever you add a data point to the cache, arrange the data so that zip1.lt.Zip2. Whenever you do a lookup, arrange the two zips the same way. It’s a little extra arithmetic, but it may beat adding every point twice.
Preliminary
There are two simple parts to the solution. But there are a few things you need to understand first.
-
There is no such thing as a reason (logic) to avoid Normalisation, it is an absolute necessity.
- Read and understand @WalterMitty’s answer.
-
In this order, you need to (a) understand the data properly, which is achieved by (b) modelling the data properly, before you can (c) use it.
- To that end, you need a data model (before you can produce the Object model)
- and the best kind is a Relational Data Model.
-
The PK for a ZipCode is indeed a
ZipCode
. The PK for Distance is indeedZipCode_1, ZipCode_2
.- Tony at Google is an idiot. There is no need at all to add a column
RouteId
orRecordId
orDoesNothing
, which is an additional index and hotspot. - The abstraction merely prevents you from noticing duplicate rows.
- Tony at Google is an idiot. There is no need at all to add a column
-
To prevent the duplicate row for
(ZipCode_2, ZipCode_1)
, and also a self-reference, implement this:
CONSTRAINT CHECK ZipCode_1 < ZipCode_2
Relational Data Model
Note
- The Standard for Relational Data Modelling since 1983 is IDEF1X. For those unfamiliar with the Standard, refer to the short IDEF1X Introduction.