We have a entity Item
which has these attributes:
product
type
subId
typeOfPublish
semver
(when this was released)
dateTimepublished
and a few more
The key is :
product
type
subId
typeOfPublish
They are all strings. TypeOfPublish is an enum in code but a string in dynamo Db.
And we have a entity History
which tracks every time an item is published. We are using single table design. So the SK prefix tells me whether its an item or history. The PK for all items and related history is "<product>#<type>#<subId>"
The sk is the prefix (“item” or “history”) plus "#<typeOfPublish>"
, for item and for history it is "#<typeOfPublish>#<dateTimeTillMillisecondsUtc>"
example "history#pub#2024-09-01T20:59:47.780886800Z"
In addition to this we are using amazon-dynamodb-lock-client on the Pk so make sure we do not insert duplicate rows. We have done some stress testing with similar request going to 3 nodes from 4 different client VMs – its working.
Now in most cases this will make the Pk-SK combination unique. I was thinking of an edge case where 2 VMs can have slightly off times and they get similar requests at different times but due to their clocks being off the same value for "#<typeOfPublish>#<dateTimeTillMillisecondsUtc>"
is created?
So for this it seems that I need to check when inserting the history item to make sure we only create an item and fail if its an update (item with same pk and sk already exists).
For this is this check enough:
val putItemRequest = PutItemEnhancedRequest.builder(History::class.java)
.item(history)
.conditionExpression("attribute_not_exists(sk)")
.build()
try {
historyTable.putItem(putItemRequest)
} catch (e: ConditionalCheckFailedException) {
log("Item with pk: $pk and sk: ${history.sk} already exists", e)
//TODO retry with a new SK, in a loop say max 5 times, maybe add a random number after the time or a UUID?
}
Questions:
- is the possibility of two VMs having different times valid?
- is this way of correcting it good?
- any data on how often this happens if we dont take any extra mitagating steps?
- we are using Zulu time. Mitigation is :
(a) Use the Amazon Time Sync Service, which is available by default on AWS instances.
(b) Regularly check and configure NTP settings. (how/ commands/ automation)
Any other?
We have over 400 VMs and just 4 dev ops, so automation and one time set up/ monthly check needed