I’m developing a site to monitor web services. The most basic type of check is sending a ping, storing the response time in a CheckLog
object. By default, PingCheck
objects are triggered every minute, so in one hour you get 60 CheckLogs
and in one day you get 1440 CheckLogs
.
That’s a lot of them, I don’t need to store such level of detail, so I’ve set a up collapsing mechanism that periodically takes the uncollapsed CheckLogs
older than 24h and collapses (averages) them in intervals of 30 minutes. So, if you have 360 CheckLogs
that have been saved from 0:00 to 6:00, after collapsing you retain just 12 of them. The problem.. well, is this:
After averaging the response times, the graph changes drastically. What can I do to improve this? Guess one option could be narrowing the interval duration to 15 min.
I’ve seen the graphs at the GitHub status page and they do not seem to suffer from this problem.
I’d appreciate any kind of information you could give me about this area.
2
As @Uri Agassi pointed out, there looks to be quite a bit of variance. To that end, I think you mostly care about the “shape” of the curve, rather than the actual points themselves. So then the question becomes different: how do I preserve the “shape” to within a certain tolerance? Fortunately there are good algorithms around this type of question, one of the standards being this one which you can use to reduce the number of polylines in your graph. The unfortunate reality here, though, is that the amount of storage will no longer be deterministic because it is now based on variance – unless you pick a different polyline reduction technique.
1
Your graph before averaging shows that you have a lot of variance. This means that the graph before averaging may actually show you less information – mostly noise. To get valuable information about problems you may have (like latency), you might want to keep a running window average, and watch for trends in your data, which might point to problems.
Of course, keeping score of your minimum, maximum and standard deviation will also give valuable information.
You can use third party services (like NewRelic for example), which can receive your reports, and give you all this information, as well as alarms to notify you about suspicious changes.
1