Assume we are getting some statistics from a customer which we analyse and we send our comments to the customer. Now, the customer tells us that the statistic they computed between January and March are based on a wrong methodology and sends us corrected series. We want perform analysis with the wrong and with the correct set of data, which are huge and only differ from January to March, so that we want to avoid data duplication.
Therefore, we need something like synthetic database records implementing the following logic:
synthetic[1] = wrong_data
synthetic[2] = correct_data between Januar and March, wrong_data otherwise
With this, we can easily perform our analyses on synthetic records.
Should such synthetic records be implemented in the application logic or on the side of the database? What are common pitfalls of such an implementation?
Since synthetic records are actually data, I would … um, put them in the database.
The logic for retrieving them is relatively straightforward. You do an outer join on the synthetic table, and substitute the synthetic value if the outer join returns something other than a null. The simplest way is just to replace the entire record.
If you want to “version” the data, you can provide a sequence number field, at the cost of a little more complexity in the SQL to retrieve the most current version of the synthetic data. Since you probably only want the most recent revision in most cases, the most sensible approach is to put the current record data in a “history” table, and modify the original record in place.
Further Reading
How to version control a record in a database
SQL Strategies for Versioned Data
2