2.2 Disaster Recovery with etcd
This lab guides you through working with etcd in a Kubernetes cluster — from inspection and backup to simulating data loss and recovery.
0. Preparing
Before you begin, make sure you’re on the control plane node (the master node that runs etcd). You’ll need to install the etcd-client, a command-line tool that allows you to interact with etcd.
Install etcd-client on control plane node
sudo apt install etcd-client
This gives you the etcdctl command you’ll use throughout this lab.
1. Inspect etcd
Let’s first check if etcd is healthy and running properly.
Access etcd from the control plane node:
sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint status --write-out=table
This command connects to the local etcd service securely using certificates and prints the status in a table format. You should see information like the endpoint address, version, and database size.
Now let’s see what’s actually stored in etcd.
List all keys:
sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
get "" --prefix --keys-only
This shows all keys stored in etcd, which represent objects in your Kubernetes cluster (like pods, deployments, etc.). It’s a good way to see how Kubernetes stores its state.
2. Create a Backup
Backing up etcd is very important! It contains the full state of your Kubernetes cluster. Let’s create a snapshot (a full backup) of the current etcd database.
Save a snapshot
sudo ETCDCTL_API=3 etcdctl snapshot save /root/etcd-backup.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
You’ll now have a backup file saved as /root/etcd-backup.db.
OPTIONAL
To protect your backup, you can encrypt it using GPG:
Encrypt the backup with GPG:
sudo gpg -c /root/etcd-backup.db
Deploy a new nginx pod (after backup)
Next, let’s make a change to the cluster after the backup — so we can later verify that our restore worked correctly. Add one more nginx deployment after Backup
kubectl create deployment nginx-after-backup --image=nginx
kubectl get pods
This creates a new deployment. After we restore the cluster later, we’ll check if this deployment is gone — a good indicator that the restore succeeded.
4. Simulate etcd Failure
Now we’ll simulate a disaster by removing etcd data and stopping Kubernetes components.
⚠️ Be careful — this will break the cluster temporarily.
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/kube-apiserver.yaml.bak
sudo mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/etcd.yaml.bak
sudo mv /var/lib/etcd /var/lib/etcd.bak
Explanation
- The first two commands are removing etcd and kube-apiserver manifests - the kubelet will remove those pods.
- The last command simulates etcd data loss by renaming the etcd data directory.
After a few moments, you’ll notice that kubectl no longer works. That’s expected — the API server is not deployed by the kublet anymore.
Sometimes it can take some time until the kubelet will recognize the removed manifests. To speed up the process, we will restart the kubelet and check the remaining containers on the control-plane node.
sudo service kubelet restart
sudo crictl ps
4. Restore the Backup
Let’s bring your cluster back to life using the backup you made earlier.
sudo ETCDCTL_API=3 etcdctl snapshot restore /root/etcd-backup.db \
--data-dir /var/lib/etcd
This command recreates the etcd database from the backup and places it in the correct directory.
Restart the API server and ETCD:
Now restart the control plane components:
sudo mv /etc/kubernetes/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml
sudo mv /etc/kubernetes/etcd.yaml.bak /etc/kubernetes/manifests/etcd.yaml
sudo service kubelet restart
Kubelet will automatically restart etcd and the API server once these manifest files are back.
Verify the cluster:
Wait 1–2 minutes for everything to come up again.
OPTIONAL
If you want to check the startup logs for etcd and kube-apiserver, use these commands:
sudo crictl ps -a | grep etcd
sudo crictl logs <etcd-container-id>
sudo crictl ps -a | grep kube-apiserver
sudo crictl logs <apiserver-container-id>
Let’s see if the cluster is back up and running:
kubectl get nodes
kubectl get pods
You should see that the nginx deployment we created after the backup is not there anymore — because it wasn’t part of the restored etcd data. This confirms that the restore was successful!
Congratulations!
You’ve successfully:
- Inspected etcd
- Created a backup
- Simulated a disaster
- Restored your cluster using the backup
This is a valuable skill in real-world Kubernetes operations.
End of Lab