How do you store “fuzzy dates” into a database?

This is a problem I’ve run into a few times. Imagine you have a record that you want to store into a database table. This table has a DateTime column called “date_created”. This one particular record was created a long time ago, and you’re not really sure about the exact date, but you know the year and month. Other records you know just the year. Other records you know the day, month and year.

You can’t use a DateTime field, because “May 1978” isn’t a valid date. If you split it up into multiple columns, you lose the ability to query. Has anyone else run into this, if so how did you handle it?

To clarify the system I’m building, it is a system that tracks archives. Some content was produced a long time ago, and all that we know is “May 1978”. I could store it as May 1 1978, but only with some way to denote that this date is only accurate to the month. That way some years later when I’m retrieving that archive, I’m not confused when the dates don’t match up.

For my purposes, it is important to differentiate “unknown day in May 1978” with “May 1st, 1978”. Also, I would not want to store the unknowns as 0, like “May 0, 1978” because most database systems will reject that as an invalid date value.

Store all dates in normal DATE field in the database and have additional accuracy field how accurate DATE field actually is.

date_created DATE,
date_created_accuracy INTEGER,

date_created_accuracy: 1 = exact date, 2 = month, 3 = year.

If your date is fuzzy (e.g May 1980) store it at start of period (e.g. May 1st. 1980). Or if your date is accurate to year (e.g. 1980) store it as January 1st. 1980 with corresponding accuracy value.

This way can easily query in a somewhat natural way and still have notion how accurate dates are. For example this allows you to query dates between Jan 1st 1980 and Feb 28th 1981, and get fuzzy dates 1980 and May 1980.

If you don’t need to use this kind of data as regular date-time information, any simple string format would do.

But if you need to keep all the functionality, there are two workarounds I can think of, both requiring additional information stored in the database:

Create min date and max date fields, which have different values for “incomplete” data, but will coincide for accurate dates.
Create types for each kind of inaccurate date (none _ 0, date_missing _ 1, month_missing _ 2, year_missing_4, etc _ so you can combine them). Add a type field to the records and keep which information is missing.

This is really more of a requirements definition than a technical problem — what you need to focus on is “how can we define the dates in the past” and the technical solution will flow.

The times I’ve had to approach something like this we’ve typically:

Define how to map things — like MichaelT suggests, decide that anything that is defined as Month/Day becomes defined as midnight on the 1st of said month. This is typically good enough for most purposes — if the exact date was that important you probably would have a record of it 35 years later, right?
Figure out if you need to track this — IE, do records with slightly made up create dates need a flag saying so? Or is that just a user training issue so folks know and can act accordingly.

Sometimes one needs to do something like make the dates fuzzy — for example, that may one date might need to respond to a query for anything in May 1978. This is doable — just make your create_date 2 fields, old records get a 30 days spread as appropriate, new ones get 2 identical values.

The simplest way to denote if the date is accurate is to create an accuracy field INT(1) with default NULL

If the date is accurate store date-time in “date_created” & leave accuracy NULL

If date is only accurate to month store date-time as 1st of Month with accuracy value 1

If date is only accurate for year store date-time 1st of January with accuracy value 2

You can use different numbers to hold different values such as first quarter etc

In the past I’ve stored dates-with-accuracy as a start date and an end date. The day may21,2012 would be represented as start=12am,may21,2012 and end=12am,may22,2012. The year 2012 would be represented as start=12am,Jan1,2012 end=12am,Jan1,2013.

I’m not sure if I’d recommend this approach. When displaying the information to the user you need to properly detect that a date range exactly covers a day in order to show “may 25” instead of two over-specific endpoints (which means dealing with daylight savings and so forth).

However, when you’re not trying to translate to human, programming with the endpoints is a lot easier than with center+accuracy. You don’t end up with lots of cases. That’s pretty nice.

Why not store two dates.

Created_After and Created_Before. The actual semantics being “created on or after” and “created on or before”

So if you know the exact date then Created_After and Created_Before will be the same date.

If you know it was the first week in May 2000 then Created_After = ‘2000-05-01’ and Created_Before = ‘2000-05-07’.

If you just know May 1999 then the values will be ‘1999-05-01’ and ‘1999-05-30’.

If it’s “Summer of ’42” then the values would be ‘1942-06-01’ and ‘1942-08-31’.

This schema is simple to query with normal SQL, and quite easy for a non technical user to follow.

For instance to find all documents that might have been created in May 2001:

SELECT * FROM DOCTAB WHERE Created_After < '2001-05-31' And Created_Before > 2001-05-01;

Conversely to find all documents that were definitely created in May 2001:

SELECT * FROM DOCTAB WHERE Created_After > '2001-05-01' And Created_Before < 2001-05-31;

ISO 8601 date time format comes with duration definition, e.g.

2012-01-01P1M (read: 2012, January 1st, period: 1 month) is what should be “in January 2012”.

I would use this to store the data. You may need a database field of type String to do so. It’s a different topic how to conduct a sensible search on that.

Generally, I still store them as dates to general query buisness is still possible even if slightly less accurate.

If it is important to know the accuracy I have in the past either stored an accuracy “window” either as a +/- decimal or as a lookup (day, month, year, etc). In other cases instead of the window I just store the original date value as a string and convert what I can to a datetime, possibly 1978-05-01 00:00:00 and “May 1978” for your given example.

If you split it up into multiple columns, you lose the ability to
query.

Says who? Here’s what you do:

Have 3 columns, Day, Month, Year, each of int type, and a fourth column TheDate of DateTime type.
Have a trigger that uses the 3 columns Day, Month, Year to build TheDate if TheDate is left null but one or more of the Day, Month, Year fields has a value.
Have a trigger that populates Day, Month, Year fields when TheDate is supplied but these fields are not.

So if I do an insert like: insert into thistable (Day, Month, Year) values (-1, 2, 2012); then TheDate will become 2/1/2013 but I will know its really an indeterminate date in 2/2012 because of the -1 in the Day field.

If I insert into thistable (TheDate) values ('2/5/2012'); then Day will be 5, Month will be 2, and Year will be 2012 and because none of them is -1 I will know this is the exact date.

I don’t lose the ability to query because the insert/update trigger makes sure my 3 fields (Day, Month, Year) always produce a DateTime value in TheDate which can be queried.

Another option would be to store the dates as integers of the form YYYYMMDD.

You only know the year is 1951: Store as 19510000
You know the month and year is March 1951: Store as 19510300
You know the full date is March 14, 1951: Store as 19510314
A completely unknown date: Store as 0

Benefits

You can store your fuzzy date in one field instead of two date fields or a date and an accuracy as many of the other answers suggest.

Queries are still easy:

all records for the year 1951 – SELECT * FROM table WHERE thedate>=19510000 and thedate<19520000
all records for March 1951 – SELECT * FROM table where thedate>=19510300 and thedate<19510400
all records for March 14, 1951 – SELECT * FROM table where thedate=19510314

NOTES

Your GUI would need a GetDateString(int fuzzyDate) which is pretty easy to implement.
Sorting is easy with the int format. You should know that unknown dates will come first. You could reverse this by using 99 for the ‘padding’ instead of 00 for the month or day.

ISO 8601 also specifies a syntax for “fuzzy dates”. February 12th, 2012 at 3pm would be “2012-02-12T15” and February 2012 could be simply “2012-02”. This extends nicely using standard lexicographic sorting:

$ (echo "2013-03"; echo "2013-03"; echo "2012-02-12T15"; echo "2012-02"; echo "2011") | sort
2011
2012
2012-02
2012-02-12T15
2013-03

Here is my take on this:

Go from fuzzy date to datetime object (which will fit into a database)

import datetime
import iso8601

def fuzzy_to_datetime(fuzzy):
    flen = len(fuzzy)
    if flen == 4 and fuzzy.isdigit():
        dt = datetime.datetime(year=int(fuzzy), month=1, day=1, microsecond=111111)

    elif flen == 7:
        y, m = fuzzy.split('-')
        dt = datetime.datetime(year=int(y), month=int(m), day=1, microsecond=222222)

    elif flen == 10:
        y, m, d = fuzzy.split('-')
        dt = datetime.datetime(year=int(y), month=int(m), day=int(d), microsecond=333333)

    elif flen >= 19:
        dt = iso8601.parse_date(fuzzy)

    else:
        raise ValueError("Unable to parse fuzzy date: %s" % fuzzy)

    return dt

And then a function that takes the datetime object, and moves it back into a fuzzy date.

def datetime_to_fuzzy(dt):
    ms = str(dt.microsecond)
    flag1 = ms == '111111'
    flag2 = ms == '222222'
    flag3 = ms == '333333'

    is_first = dt.day == 1
    is_jan1 = dt.month == 1 and is_first

    if flag1 and is_jan1:
        return str(dt.year)

    if flag2 and is_first:
        return dt.strftime("%Y-%m")

    if flag3:
        return dt.strftime("%Y-%m-%d")

    return dt.isoformat()

And then a unit test. Did I miss any cases?

if __name__ == '__main__':
    assert fuzzy_to_datetime('2001').isoformat() == '2001-01-01T00:00:00.111111'
    assert fuzzy_to_datetime('1981-05').isoformat() == '1981-05-01T00:00:00.222222'
    assert fuzzy_to_datetime('2012-02-04').isoformat() == '2012-02-04T00:00:00.333333'
    assert fuzzy_to_datetime('2010-11-11T03:12:03Z').isoformat() == '2010-11-11T03:12:03+00:00'

    exact = datetime.datetime(year=2001, month=1, day=1, microsecond=231)
    assert datetime_to_fuzzy(exact) == exact.isoformat()

    assert datetime_to_fuzzy(datetime.datetime(year=2001, month=1, day=1, microsecond=111111)) == '2001'
    assert datetime_to_fuzzy(datetime.datetime(year=2001, month=3, day=1, microsecond=222222)) == '2001-03'
    assert datetime_to_fuzzy(datetime.datetime(year=2001, month=6, day=6, microsecond=333333)) == '2001-06-06'

    assert datetime_to_fuzzy(fuzzy_to_datetime('2002')) == '2002'
    assert datetime_to_fuzzy(fuzzy_to_datetime('2002-05')) == '2002-05'
    assert datetime_to_fuzzy(fuzzy_to_datetime('2002-02-13')) == '2002-02-13'
    assert datetime_to_fuzzy(fuzzy_to_datetime('2010-11-11T03:12:03.293856+00:00')) == '2010-11-11T03:12:03.293856+00:00'

There is a corner case where an event that precisely occurred at 2001-01-01T00:00:00.333333 but the system will interpret as being just “2001”, but that seems very unlikely.

I work for a publishing company that deals in lots of old books where we often cannot get the exact dates for things. We typically have two fields for a given date entry, the date and a circa boolean:

date date
dateCirca enum('Y', 'N')

We use the date field to indicate the date of some event, or a date that is “close enough” in the case where we don’t know the true date. In the event that we do not know the true date, we mark the dateCirca field as Y and give a close enough date, that is marked as the “1st”, such as

1st March, 2013  // We don't know the day of the month
1st January, 2013  // We don't know the month/day of the year
1st January, 2000  // We don't know the month/day/year, we only know the century

Overview

There are many possible representations, and thus database schemas, for storing fuzzy date-times (or even just fuzzy dates):

Date-time and code indicating its precision or accuracy
Date-time and interval where there are several possibilities for representing an interval:
1. Represent all intervals as an integer (or other numeric) quantity of some fixed unit, e.g. days, minutes, nanoseconds.
2. Represent an interval as both an integer (or other numeric) quantity and a code indicating its units.
Start and end date-times
String
Probability distribution:
1. Decimal or floating-point quantities for the parameters that specify a specific distribution in a particular family, e.g. mean and standard deviation of a normal distribution.
2. Probability distribution function, e.g. as a (lookup) code (potentially with parameters of specific values), or as an expression in a sufficiently expressive language, format, or representation.

[1], [2], and [3] are all (implicitly) uniform intervals, i.e. a set of (equally) possible points in time.

[4] is the most expressive, i.e. when allowing any possible (or at least arbitrarily long) written language sentences or phrases. But it’s also the hardest to work with. In the limit, human-level AI would be required to handle arbitrary values. Practically, the range of possible values would need to be restricted severely, and alternative ‘structured’ values would probably be preferred for many operations, e.g. sorting, searching.

[5] is probably the most general compact representation that’s (somewhat) practical.

Uniform Intervals

Uniform intervals are the simplest compact way to represent a set of (possible) date-time values.

For [1], portions of the date-time value are ignored, i.e. the portions corresponding to units finer than the indicated precision or accuracy; otherwise this is equivalent to [2] and the precision/accuracy code is equivalent to an interval with the same units (and an implied quantity of 1).

[2] and [3] are expressively equivalent. [1] is strictly less expressive than either as there are effective intervals that cannot be represented by [1], ex. a fuzzy date-time equivalent to a 12 hour interval that spans a date boundary.

[1] is easier for users to input than any other representation and should generally require (at least slightly) less typing. If date-times can be input in various text representations, e.g. “2013”, “2014-3”, “2015-5-2”, “7/30/2016 11p”, “2016-07-31 18:15”, the precision or accuracy could also be inferred automatically from the input.

The accuracy or precision of [1] is also easiest to convert to a form to be conveyed to users, e.g. ‘2015-5 with month accuracy’ to “May 2015”, versus “May 13th 2015 2p, plus or minus 13.5 days” (tho note that the latter can’t be represented by [1] anyways).

Strings

Practically, string values will need to be converted to other representations for querying, sorting, or otherwise comparing multiple values. So while any written natural (human) language is strictly more expressive than [1], [2], [3], or [5], we don’t yet have the means of handling much beyond standard text representations or formats. Given that, this is probably the least useful representation by itself.

One advantage of this representation tho is that values should, in practice, be presentable to users as-is and not require transformation to be easily understandable.

Probability Distributions

Probability distributions generalize the uniform interval representations [1], [2], [3], and (arguably) are equivalent to the (general) string representation [4].

One advantage of probability distributions over strings is that the former is unambiguous.

[5-1] would be appropriate for values that (mostly) conform to an existing distribution, e.g. a date-time value output from a device for which measurements are known (or thought) to conform to a specific distribution.

[5-2] is probably the best (somewhat) practical way to compactly represent arbitrary ‘fuzzy datetime’ values. Of course the computability of the specific probability distributions used matters and there are definitely interesting (and perhaps impossible) problems to be solved when querying, sorting, or comparing different values, but a lot of this is probably already known or solved somewhere in the existing mathematical and statistical literature so this definitely stands as an extremely general and un-ambiguous representation.

I really like James Anderson’s solution – Accurately bounding the dates is the way to get the most flexible query structure. Another way of achieving the same is to use a start, end or even center date plus an interval (available at least in PostgreSQL, Oracle and SQLAlchemy).

I would simply store the exact time for normal dates and make the time portion of the fuzzy date generic like 00:00:00. I would then make all the fuzzy dates the 1st of the month.

When you query, you

check for date ranges where the time is also equal to 00:00:00 (fuzzy)
check for date ranges where the time is NOT equal to 00:00:00 (real)
check for date ranges but ignore the time portion (combined)

There are better solutions than this, but I personally hate metadata (data about my data). It just has a habit of getting out of hand after a while.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 12:36

Thẻ: database, date-format

Thiết kế website giá rẻ

Danh mục