in our organization, we got number of systems running on flink 1.16.
We use PrometheusReporterFactory.
To expose our metrics to promethues scrape.
Due to the dynamic labels definitions of the flink system metrics, we experience cardinallity explosion on our promethues, due to the hige amount of time series created.
When having lots of operators with many taskmanagers and taskslots, the number of metrics is gigantic due to the dynamic metrics labels, such as task_attempt_id, task_id, tm_id and more, when most of them are not even being used or queried by the SRE team.
Is there any possible way to reduce the cardinallity? Maybe some way to exclude specific labels from being exported by the flink.
Thanks.
We tried to reduce the cardinallity by disabling the latency metrics, as presented in this issue
But without any significant decrease in the cardinallity.
user25698741 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.