I have a complex Prometheus query that runs successfully and quickly in Thanos but consistently times out in Grafana on big time range like 120 days. The query is as follows:
avg(sum_over_time(
(sum by(container)(kube_pod_container_status_running{namespace=~"keycloak|dev", container=~"keycloak|test-server|test-web"}) >= bool 1)[$__range:]
) / count_over_time(
(sum by(container)(kube_pod_container_status_running{namespace=~"keycloak|dev", container=~"keycloak|test-server|test-web"}))[$__range:]
) * 100)
with this query I want to get the global availability of my applications.
The query works perfectly in Thanos without timeouts.
Grafana still times out despite increasing the timeout settings.
The memory usage of thanos-storegateway is high, but it’s managed within the increased resource limits.
Why does the query time out in Grafana but not in Thanos?
Are there other settings in Grafana or Thanos that I should adjust?
or is there a way to simplify the query ?
I would be really thankful if anyone have any ideas to help me
5