Context
I’m designing a database which, simplified, should be able to handle users sending job requests to each other, and after that a job can be started, finished, and reviewed. The design should be scalable (think millions of users).
Approaches I’ve considered:
Gargantuan table
One approach, probably not the best one, would be to simply store ALL jobs in one, huge table jobs
. This table would need a state
column to represent which state the job is currently in (e.g. ACCEPTED
, STARTED
, FINISHED
, REVIEWED
e.t.c.). The biggest problem with this approach that I can see is that jobs in different states have different types of data that are relevant to them. For example, a job request has a preliminary agreed upon price, but that could change before the job is started, and change again before the job is finished. This could of course be solved by just adding more columns to the table and naming them properly, but it will probably become a huge bottleneck performance-wise very early to have one table containing all the different types of possible data for all the different possible states of a job.
Different tables for different states
This approach would be to have multiple tables, for example job_requests
, jobs_started
, jobs_finished
, tables who in turn can have substates, e.g. job_requests
could have the sub-states PENDING
, ACCEPTED
, while the jobs_finished
table would have the substates COMPLETED
, CANCELLED
, REVIEWED
.
With this approach each table only contains data which is relevant to the current job state, but on the other hand some data might be duplicated (for example the user ids of the job requester, and job receiver — on the other hand this information could be stored in yet another table?).
The problem with this approach is that I can’t think of a good solution on how to archive all the information when transitioning between states. For example, once a job request has been accepted, and then started, it should be deleted from the job_requests
table and moved into the jobs_started
table, but it’s just a matter of time before a stakeholder wants to know for example how long the average time is between a job request being created, until it’s been started, at which point I’d need the data from the job_requests table to be able to calculate it.
It feels like this type of problem should be easy to solve, but I really can’t think of any good solution which “feels right”, any solution I come up with feels ugly and I can immediately think of a number of things which makes the solution bad.
Very grateful for any feedback or tips on approaches I could take. Thanks in advance!
3
Sounds like you have 3 major categories of data you are trying to store:
- General job data (job id, job requester id, job receiver id etc)
- State transitions (job started, job finished)
- State-specific job data
- (optional) job-related events (price changed, job receiver user reassigned etc)
The key is to separate event-like data from everything else.
Schema design
Here are some details:
1. General jobs table
All of the information that’s NOT state-specific goes into the (let’s say) “jobs” table. Auto-generated primary key: job_id
2. State transitions table
All information about state transitions goes into “job_state_transitions” table, which might have the following columns:
- job_transition_id
- job_id
- created_at
- from_state
- to_state
Ideally this table is append-only. Nothing is every updated or removed here.
Using such a table, you can find out the latest status of any particular job by selecting the latest row for a given job_id from the job_transitions table. You can further denormalize that and introduce a “job_state” column, the contents of which are updated every time a new row is inserted in job_transition table (stored procedures might help here if that’s your thing).
You can also do all sorts of analysis on state transitions, because the timing data is preserved (created_at is a date/time field that can help with that)
3. State-specific job data
All of the state-specific data goes into “[state]-jobs” tables. Primary key: some sequence id. Main index: job_transition_id
4. Optional event data
You can also introduce an “audit trail” table that allows you to keep track of various changes that users might request for each job, like your “change agreed upon price”. This is a generalization of the state transition table: one main table that contains events and a one supplemental table for each event type (for example, price_changes table with job_id, created_at, from_price and to_price columns).
Scaling to millions of users
If the main “jobs” table grows to be unwieldy, you can shard it by job_id or requesting_user_id or something like that.
Likewise, events table should be append-only and can be rotated or purged of events related to jobs that have been finished.
6
How about a hybrid of the two? A large table with a JobID and state (and perhaps other information common to all jobs and states), with other tables to manage the additional state information. That reduces (or eliminates) the duplication of data, but keeps things more manageable.
You might want to consider using a noSQL database for this specific data.
This way you can just put your object
in a record, no matter what the ‘columns’ are.
Depending on the chosen database system (and implemented solution), you would be able to ‘overwrite’ a record with different data if you want. Of course, you can also save your historic data and just PUT
a new record in the database.
You can start out with this record:
myStartedObject{
id: 1,
state: 'started',
agreedUponPrice: 100.00,
someOtherData: 'x'
}
and over time it would change to something like this, or if you want to keep the history records, the id
would change.:
myFinishedObject{
id: 1
state: 'finished',
finishedPrice: 209.00,
someFinishedData: 'y'
}
Storing all state date inside the object is also an option of course, something like this:
myObject
{
id: 1,
states: [{ state: 'started', agreedUponPrice: 100.00, someOtherData: 'x'},
{ state: 'finished', finishedPrice: 209.00, someFinishedData: 'y'}]
}
Keep in mind, this surely isn’t the silver bullet, but it might be something to consider as you state your ‘schema’ is quite loose and you don’t want to pollute your RDBMS with all of those columns.
Also keep in mind, any RDBMS can do this also. I know in MS SQL you have the XML data type. With this, you can store your data in this XML column. This XML column can contain your deserialized object for instance.
It’s not a ‘real’ noSQL solution, but it looks a bit like one.
If you treat this problem as a workflow problem, you can employ one of the workflow tools such as jBPM or Windows Workflow Foundation and focus on the business logic and let the tool handle the persistence for you.
1