I am trying to write an alert based on runaway thread growth and while playing around with PromQL. And the numbers don’t add up at all.
For confidentiality I can’t write the exact metrics I was using, so I will use placeholders, but it shouldn’t matter because this applies to any gauge metric.
First I tried getting the thread growth rate using idelta idelta(metric[$__range])
, which should give me the change between each data point, but apparently 4 – 3 = 2?
(ignore the normalized, I forgot the change the label)
I then tried doing it manually, by doing metric - metric offset 15s
, where 15s is our sampling interval. This was working fine, and I compared it with the idelta function and I was astonished to find out it’s more often wrong than it is correct
(Yellow line is true, verified manually; blue line is idelta)
Whatever I thought, at least I have the true metric I wanted. I then proceeded to sum it over time to get the rolling sum of added threads over 1min (4 data points) sum_over_time((metric - metric offset 15s) [1m:])
. Come to find out, it’s utter nonsense too!
I guess 1 + 0 + 6 + (-2) = -2 and (-1) + (-18) + (-2) + 2 = -2. It doesn’t even matter if there’s some timing boundary error, because if I include an extra one or two points on either side, the math doesn’t work out.
Can someone explain why this happens? Am I just using it completely wrong or something? This looks like the most basic integer algebra to me.