I’m designing an enterprise application that produces output based on some input data from another application. In this case, I happen to have also built and thus do control the source application and data, database structure, etc., but this question also applies to cases where I am using input data from a third-party application, so I’ll assume that going forward.
The new application should only produce output once for a given set of input data; it will pull data from the source, produce output, and maintain state so that the same input data is not processed the next time it runs.
What I’m wondering about is what is a good design approach to take in cases where I need to store state so that I’m not pulling the entire universe of source data when I only need to operate on a small subset of that data.
Three options I can think of are:
- Modify the schema of the third-party source database so I can store my state there.
- Query the data via a join between the third-party database and my new database and filter the results within that query based on state that is stored in the new database.
- Warehouse the source data within my new database and perform all queries against the warehoused data.
I have worked with commercial enterprise applications that use all of these approaches, but each of them has its downsides:
With option 1, one obvious downside is that modifying a third-party database is, at best, discouraged by most vendors and, at worst, explicitly forbidden in license agreements. Furthermore, future vendor updates to the schema could wipe out data or fail due to conflicts caused by your modifications. The bottom line for me, though, is that I just don’t want to store data for a “destination” application in a “source” application’s database, regardless of whether I “own” both databases or not, but especially if the source is a third-party application. If the data is not used by the application itself, it doesn’t belong in its database.
With option 2, one downside is that the databases may be migrated to different servers, and if joins are not constructed properly with fully-qualified object names — including dynamically-provided DB server names (because of the aforementioned possibility of them being migrated to different servers) — then the application will break. Another downside is that the source data might not even be stored in the same format as the new database (e.g. a different DBMS) thus precluding the possibility of a join. Most importantly for me, though, is the fact that this approach violates Separation of Concerns because either the data layer has to perform some awkward and arguably-inefficient surgery to rend the two entities into separate domain objects or the domain layer has to have some gnarly beast hybrid of the two disparate entities. Either way, it smells.
Option 3 sounds the best to me, but it still has the downside that there is overhead involved in populating and maintaining the warehouse. Some “process” has to do this. Where does that process run…from some middle-tier server? How often does it run, and what happens if your application wants data that hasn’t been updated yet? Who administers it?
Does anyone know of another approach I haven’t considered, or can anyone offer any insight into any of the above approaches I have considered (for instance, suggest modifications to the approaches that preclude the downsides or explain to me why one approach is best despite its downsides)?
There are some questions to be asked:
- Does external data have a timestamp PK or an incremental PK that would allow you to know what were the last data items you already processed?
- Do you need to process only new data or old data that has been modified also?
If the external data has a timestamp PK or incremental PK, the state you need would be the last timestamp or incremental key value processed. You can either store it in one table in the destination database.
I would suggest the following approach:
- A crontab job A reads external database (the consumer should be the responsible of pulling the data) and populates a temporal table.
- Another cron job B reads that data, which is already in your database, and processes it, generating output data in the formal tables. Then it saves the “state” in a table (the last timestamp or incremental key of external database processed).
- Another cron job C deletes from the temporal table any data that was already processed (you can infer which rows to delete by comparing with the “state” value saved by job B)
This approach works for new data only. If you need to process old data (data that was previously processed but has changed in the origin) then things get complicated.
Note that this approach is different from your third one (datawharehouse) in that data is only stored temporarily.
The ideal solution would be for your source application to expose a SOA layer to give external applications access to the data.
Failing this (or where the query volume is large and or intensive) then database replication is a good solution — always assuming the technology is in place to do this reliably.
Failing the above two solutions I would go for a full data dump! Its easier to code up (often just using DBMS utilities), there are no messy re-synchronization, timing and control issues to deal with. Besides any solution which attempts to pull the latest changes will need a full data dump process for initial load and the inevitable re-sync after operational screw ups.
I came up with a “hybrid” solution. And you may give it a thought. I understand it’s not desireable to modify 3rd party database, or, sometime it is not even possible because you don’t have access to client sources, right?
In my case, source data did not have anything I can “reference” like “updatedOn” or any reliable timestamps or logs.. I also needed to track CRUD. Not just additions.
So, what I ended up doing – I placed triggers on those tables that you need to monitor and all this triggers do – they populate MY INFO TABLES with what happened to which record. I monitor those tables every 30 seconds, but in your case it’s daily. Point is – you know exactly what was changed and don’t process any extra data. After processing – wipe data you processed from your INFO tables..
In my case the only “foreign” objects on 3rd party database is triggers and additional table for change logging. This approach survived upgrades from 3rd party, just run your script to re-create triggers if needed after upgrade.