I have a system that checks the status of a number of systems, such as, is http up, does the system respond to ping, that sort of thing.
I check all of these systems every 5 minutes. If something is broken, I send out an e-mail with a notice. If the problem is cleared I send out an “all clear, problem resolved” notice too.
I want to send out a notice 30 minutes after a problem is detected and the problem is still ongoing and then at failure +1h, +2h, +4h, etc. I do not want to flood the user with messages every time the event “problem still ongoing” occurs which is every 5 minutes.
I could store a “last notification sent at” somewhere and just work with that but I would rather just solve it in the code itself. Am I missing something very obvious or is this the only feasible way?
You could first of all store the status of all systems connected with a given recipient, e.g.
[ {
recipient: '[email protected]',
situation: [
{ 'system': 'alnitak',
'failed': [
{ 'subsystem': 'http',
'failures': [
{ 'code': '80701',
'text': 'SYN unacknowledged',
'timestamp': 1375189804,
'alertno': 1,
'nextalert: 1375193404 },
...
},
...
]
},
...
]
},
...
]
Every five minutes you save each recipient’s extant block, and generate a new one.
You now need a function that will walk two such blocks, and divide all items in three blocks with the same structure as the “status block”:
- present in new only: add new issue to “NEW ISSUES” block.
- present in old only: add old issue to “RESOLVED” block.
- present in both: checks the
nextalert
, andalertno
. Ifnextalert
has not expired, do nothing. If it has expired, the item is added to the “UNRESOLVED” block,alertno
is incremented, and its value is used to decide how much to add to current timestamp to get the new value fornextalert
.
The important thing is the do nothing in the above loop. This ensures that if nothing happens, no alert will be sent even if there are ongoing issues.
This architecture allows little tweaks – for example: if there are no new nor resolved issues (this is important, for both might require timely decisions, such as to inform a customer of a service being back on line), you can do an additional walk of the status blocks and approximate the “next alert” time in the new block to the nearest multiple of, say, 15 minutes, before calling the function that builds the unresolved block.
This has the effect of posticipating all alerts so that they come together at intervals of 15 minutes. In this example:
10:05 alert 1 goes out, and is scheduled for 15 minutes from now: 10:20
10:10 alert 2 goes out, and is scheduled for 15 minutes from now: 10:25
10:15 nothing happens
10:20 alert 1 is sent ("Still unresolved") and rescheduled for 10:50
10:25 alert 2 is sent ("Still unresolved") and rescheduled for 10:55
what would happen is that both alerts would be sent at the exact time the first time, and then both would be sent together at 10:30, thereby sending three emails instead of four. The alert 1 rescheduling is delayed by 10 minutes in this example.
(Other time-tweaking tricks can be played). Notice: if you snooze the nextalert
value, remember that this value being different does not count when comparing old and new status blocks.
Now that we have the three (possibly empty) blocks, a third function can compose an email by walking the three blocks and spewing out text — or just returning if all three blocks are empty:
Dear mr. Serni,
The following issues are now marked SOLVED:
System: albutain
Reason: fail to respond to PING
Alert : 27 Jul 2013, 20:27:33 UTC
The following issues are NEW and require your attention:
...
The following issues are still unresolved:
...
Possibly, if all three blocks are “{}” for a given period, you may want to send an email all the same — just to remind that the checking system is alive and well. A subject line might suffice. This periodic email could be sent also to deal with unresolved issues, which otherwise would only be sent upon resolution of an issue or creation of a new one, and “starve” otherwise.
1
I think the most straight forward way of achieving what you want is to store the state of the monitored systems as well as whether a notification was sent at the point in time that your process runs. Depending on how many systems and metrics you are monitoring this could be as simple as writing a few entries to a flat text file.