Automated notification system :- Detects outages in system, and triggers notification to all users whose affected in outage period, when outage settles.
Assumption : In 96% of cases, outage would last <= 60-mins.
So we would only support to send notification to users falling in last 60 mins of outage window, before that no user will be notified once system recovers.
Constraints : In an hour around 2*10^5
orders could fail at max which is = order_api
req/hour.
I am planning to use a state machine kind of approach, which will track live order failure rate over 5m using a scheduler (it runs every 5th_min to check failure rate of last 5 min orders). Based on that we detect is there’s an outage in the system.
-
If at any point in time on
5 * nth
min. order failures ratetotal_5xx_order_api_5m * 100/total_rps_order_api_5m
exceeds a certain thresholdt
, we updates state of order outage in DB as{order_outage: active}
, which represents system is down for most of users. -
On every 5xx of
order_api
for a user, system savesuser_id | timestamp
in table namedfailed_orders
, where timestamp represents time whenorder_api
failed. -
As we are supporting 60 min outage, we want to keep time-to-live of 60-min on every record of
user_id | timestamp
on insertion. -
Identify if outage settled (tricky) :- System should be able to identify when the outage has settled. For that I am planning to keep a list of last 3 order-failure-rate in-memory, example:
[..., r1, r2, r3]
, wherer1,r2,r3 > t
. Now at any5 * nth
minute if the list state turns to be [r3, r4, r5], wherer3,r4,r5 < t
, which states for the last 15 mins outage rate is under threshold, in this case system assumes that outage must have settled, so it turns the state from active -> inactive:{order_outage: inactive}
-
As soon as
order_outage
state movesactive -> inactive
, system will trigger notification to all the users whose order failed during this period. we’ll traverse users with query :
select * from failed_orders where timestamp < cur_time - 10min
& trigger notification to all these users, parallelly keep deleting users from table once notification is triggered.
Few design decisions :
-
We need to keep inserting order failed users in table (even though when outage is not detected), because if we start capturing the order failures at some
5 * nth
minute than we will have no data for users whose orders failed before that minute & were in outage time period. -
Due to first we need to keep TTL to our rows, If we don’t keep TTL, then failed order data will always keep growing, if outage threshold is never met.
Better alternative suggestions :
- Any suggestions to improve on current design further OR make it simpler ?
- Could there bean edge case which this system would miss ? Like missed sending notifications to some x type of users.
- We have evaluated redis (for TTL) + MySQL (for traversing), but issue is we traverse table
order_failed
which is on MySQL & keep checking on redis (n/w call-1) if key(user_id) has expired or not, if expired than we have to delete the record from MySQL as well(n/w call-2), while we traverse the table. we want to avoid this kind of setup. - Can anyone suggest better single setup DB alternative which has good TTL support & good query performance on range queries(timestamp in this case). I have good things about BigTable & syclla DB, but both seems to be suitable when we have a lot of data, In our cases our data is small around
2 * 10 ^5
records, but we need similar features of TTL and range based queries.