Distance calculation and data normalization

I’m designing a system which needs to report basic distance between zip codes. It will use the Google Distance Matrix API but I want to cache the results in the database so that multiple requests for the same data points don’t result in duplicate API calls.

A basic Distance class might look like this:

A basic data structure for the cache might look like this:

Since I’m not using fine-grained detail concerned with changes due to one-way streets or divided highways, interchanges, etc., it therefore doesn’t matter to me whether I’m getting the data from point A to point B or from point B to point A. But I’m not sure how to express this with full normalization in the database. With this primary key, it would be perfectly legal for the same distance to exist in two separate row, i.e.

<code>var row1 = new Distance("00001", "00002");

var row2 = new Distance("00002", "00001");

</code>

<code>var row1 = new Distance("00001", "00002"); var row2 = new Distance("00002", "00001"); </code>

var row1 = new Distance("00001", "00002");
var row2 = new Distance("00002", "00001");

It would probably be a good idea for me to require a SortedList parameter in the constructor, but is there a way I can design it from the database side to enforce full normalization?

If the location is being represented by a zipcode (fine for the cases where most distance estimates are within a tolerance for the same zipcode), then I’d impose a CHECK constraint on the fields Zip1 and Zip2 that Zip1 =< Zip2.

Of course this imposes a corresponding burden on the middleware logic to understand that the convention to assure only one record is putting the smaller zipcode in Zip1. This is mainly a concern for INSERTing rows, as queries can be agnostic about which field is first, i.e. WHERE (Zip1 = myZip1 and Zip2 = myZip2) OR (Zip1 = myZip2 and Zip2 = myZip1).

However normalization is not violated by storing both orders in the compound key. It uses something less than twice the number of records, but these look to be fairly small rows, so I’m not sure you can justify the logic complexity on efficiency grounds (least of all in terms of speed).

According to this Zip code FAQ, there are about 43,000 zip codes in the US. The figure fluctuates a couple thousand per year. Note the value returned by distance(zip1, zip2) may change over time possibly due to new road construction, zip code boundary changes, etc.

If you store the return values indefinitely, you might wind up with 43,000^2 (approx 1.8 billion) values. In that case, avoiding the inverses (zip2, zip1) makes some sense.

If you let your cache records expire (keep them for a day? a week? a month?) then it may make more sense to not worry about the inverses.

As long as you’re not coding for mobile or embedded devices, disk space is really cheap.

Consider departing from normalization. For every departure from one of the normal forms, there is a corresponding anomaly to be overcome. For 2NF through 5NF the anomalies are generally in the area of harmful redundancy.

What makes this kind of redundancy harmful is the possibility that the database will contain mutually contradictory facts, and therefore yield erratic results. You have to program in such a way that these contradictions don’t occur.

For routine database work, it’s usually better to normalize, and thereby obviate contradiction than to try to be careful with updates.

But if you were to carry this line of reasoning to its logical conclusion, you would never do any cacheing whatsoever. A cache always contains data that is redundant to the underlying data it caches. Whenever you build a cache, you have to guard against the cache delivering results that are no longer valid. Programming to cope with harmful redundancy involves the same kind of disciplined thinking.

At the end of the day, you may decide to normalize after all. But don’t treat normalization as some sort of holy grail. It isn’t.

There is an alternative. Whenever you add a data point to the cache, arrange the data so that zip1.lt.Zip2. Whenever you do a lookup, arrange the two zips the same way. It’s a little extra arithmetic, but it may beat adding every point twice.

Preliminary

There are two simple parts to the solution. But there are a few things you need to understand first.

There is no such thing as a reason (logic) to avoid Normalisation, it is an absolute necessity.
- Read and understand @WalterMitty’s answer.
In this order, you need to (a) understand the data properly, which is achieved by (b) modelling the data properly, before you can (c) use it.
- To that end, you need a data model (before you can produce the Object model)
- and the best kind is a Relational Data Model.
The PK for a ZipCode is indeed a ZipCode. The PK for Distance is indeed ZipCode_1, ZipCode_2.
- Tony at Google is an idiot. There is no need at all to add a column RouteId or RecordId or DoesNothing, which is an additional index and hotspot.
- The abstraction merely prevents you from noticing duplicate rows.
To prevent the duplicate row for (ZipCode_2, ZipCode_1), and also a self-reference, implement this:
CONSTRAINT CHECK ZipCode_1 < ZipCode_2

Relational Data Model

Note

The Standard for Relational Data Modelling since 1983 is IDEF1X. For those unfamiliar with the Standard, refer to the short IDEF1X Introduction.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 20:36

Thẻ: database-design, normalization

Thiết kế website giá rẻ

Danh mục

Distance calculation and data normalization

Preliminary

Relational Data Model

Note