Our service has multiple nodes, and each node is producing a request_duration_histogram
for each path it serves request on, so that our scrape looks something like this (just using a couple of paths, nodes and three buckets, but the real implementation has a lot more):
request_duration_histogram_bucket{le="Infinity", path="path1", source="source1"}
request_duration_histogram_bucket{le="Infinity", path="path2", source="source1"}
request_duration_histogram_bucket{le="1", path="path1", source="source1"}
request_duration_histogram_bucket{le="1", path="path2", source="source1"}
request_duration_histogram_bucket{le="0.1", path="path1", source="source1"}
request_duration_histogram_bucket{le="0.1", path="path2", source="source1"}
request_duration_histogram_bucket{le="Infinity", path="path1", source="source2"}
request_duration_histogram_bucket{le="Infinity", path="path2", source="source2"}
request_duration_histogram_bucket{le="1", path="path1", source="source2"}
request_duration_histogram_bucket{le="1", path="path2", source="source2"}
request_duration_histogram_bucket{le="0.1", path="path1", source="source2"}
request_duration_histogram_bucket{le="0.1", path="path2", source="source2"}
request_duration_histogram_count{path="path1", source="source1"}
request_duration_histogram_sum{path="path1", source="source1"}
request_duration_histogram_count{path="path2", source="source1"}
request_duration_histogram_sum{path="path2", source="source1"}
request_duration_histogram_count{path="path1", source="source2"}
request_duration_histogram_sum{path="path1", source="source2"}
request_duration_histogram_count{path="path2", source="source2"}
request_duration_histogram_sum{path="path2", source="source2"}
Now, we’re trying to calculate in grafana, the 0.95 quantile (we’re aware of the limitations of the quantile calculations starting from a histogram) of the service as a whole (so aggregating both path and sources) with the following query:
histogram_quantile(
0.95,
sum(
rate(
request_duration_histogram_bucket[$__rate_interval]
)
) by (le)
)
but we’re getting the following warning: PromQL info: input to histogram_quantile needed to be fixed for monotonicity (and may give inaccurate results) for metric name ""
Looking at the docs my understanding is that the message is warning us against the fact that by aggregating across nodes and paths, we’re breaking a requirement of the function input.
The rate for a path can decrease between two scrapes can decrease, and this breaks monotonicity between the vector passed to the histogram_quantile
?
This seems to be a possibility for pretty much any metric that is being rate
d, so I’m not sure if the message is overzealous or we cannot actually use histograms for this type of query.
Another couple of notes:
- using
sum(increase(...))
makes the message go away - we cannot add the
path
orsource
to theby (..)
function, since we want to see the quantile across ALL the paths and nodes
Should we revert to using a Summary
with quantiles for this (even if they need to be aggregated server-side in grafana)?
Thank you!