Imagine we have a CRUD system with products. Each Product
can have Attributes
that represent properties of a product, like size, color, weight, etc. It’s a typical EAV database case (Entity, Attribute, Value).
Keeping it simple, the database scheme is as follows:
product
id | name | description
attributes
id | name | data_type
eav_relations
id | entity_id | attribute_id
decimal_values
id | eav_relation_id | value (column type decimal)
varchar_values
id | eav_relation_id | value (column type varchar)
…_values
Similar as the other values tables but for different data types, like bool_values, file_values, etc.
To give context:
The table eav_relations keeps a connection between a product and its attributes.
The …_values tables contain an eav_relation_id, that points to a row in eav_relations. An attribute can have multiple values (like a collection of ‘tags/keywords’ or multiple files).
When loading a product, we can build the whole product aggregate from the database structure which is obvious.
But when creating/updating a product, the client sends the product representation as follows:
{
"name": "The product name",
"description": "Example product.",
"attribute_groups": [
{
"group_name": "Product propeties",
"attributes": [
{
"attribute_id": 3,
"values": [433.9]
},
{
"attribute_id": 9,
"values": ["wifi", "LAN"]
}
]
},
{
"name": "Shipping",
"attributes": [
{
"attribute_id": 19,
"values": [12.5]
},
{
"attribute_id": 24,
"values": ["ECommercePost Ltd."]
}
]
}
]
}
When creating a product, it’s relatively simple to create the relational data in the database.
But when updating it’s more complex. Attributes could have been added or removed, values of attributes could be changed/added/removed. The easiest way is to just delete and re-create all rows in eav_relations that correspond to the product, and delete and re-create all rows in the …_values tables. When re-creating, auto increment will generate new id’s. The disadvantage is that row ids change on every update of the product while this is technically not very important as long as id<->id couplings stay true.
Otherwise, you have to provide row id’s of all eav_relations and …_values in the JSON of an update request. Then check if the rows with those id’s exist and if so, update them. When there are “null” id’s, it means the attribute is newly added to the product, and if an attribute value has a “null” id, it means the value is new. Also, before processing the request, you have to load the whole product aggregate, and check if there are rows in the database with an id that no more occurs in the request JSON. In that case, the row should be deleted (because an attribute or an attribute value is dropped from the product).
The JSON for an update request would then become something as follows:
{
"name": "The product name",
"description": "Example product.",
"attribute_groups": [
{
"group_name": "Product propeties",
"attributes": [
{
"eav_relation_id": 2, // Existing eav_relation id provided in request, the attribute is preserved.
"attribute_id": 3, // attribute_id is the same. If the attribute id would have been changed, the column will be updated in eav_relations. It's like the attribute would be swapped with some other...
"values": [
{
"id": 291, // Existing decimal_values row id is provided, the row will be kept but the value column will be updated.
"value": 123.9
}
]
},
{
"eav_relation_id": null, // A null id means the attribute is added.
"attribute_id": 98,
"values": [
{
"id": null,
"value": "orange"
},
{
"id": null,
"value": "blue"
}
]
}
// There was an attribute with id 9, (and values "wifi", "LAN") but this is no more present in the
// request. It's considered as dropped and thus the corresponding rows will be deleted.
]
},
// The group "Shipping" is no more present in the request, and so the eav_relation_ids do not occur in the request.
// All eav_relations and ..._values rows for the product, that have a row id that is no more present in the request
// will be deleted.
]
}
It’s obvious that this brings a lot of complexity and it’s even confusing to understand.
My idea is to always keep the product row in products table and thus the product id never changes, because this is the actually important identifier that could be used in an URL for example. The other ‘deeper’ id’s are actually just for keeing relations between data internally.
My question is, is it common practise to just delete and re-create the whole aggregate, or should we update and “recycle” database rows to avoid auto-increment and keep the original row id’s that are internally used to connect rows.
2
Removing all attributes and recreating them is like shooting a sparrow with a cannon. If some other tables or information rely on the records that you have and cascade upon their removal, then you might have data loss. Also, if the script goes wrong midway, then you will have an unfixable state and you will have nothing better than roll back until you figure out what’s wrong.
Instead, you could add a unique identifier that you can easily reproduce at eav_relations
for the product and attribute and you can create temporary tables for attributes
, decimal_values
and varchar_values
where you will insert the current state of things as well as a temporary eav_relations
, where you will also have this new unique identifier (which should match the correspondent value in the actual table if it’s the same attribute of the same product, essentially). So you can remove from eav_relations
and cascade all records whose correspondent is unmatched in the temporary counterpart, perform the updates and insertions.
Also, if you already had decimal or varchar values and the value table where the attribute is stored has changed, then you will need to convert and migrate the attribute values too.
4