2.2 Disaster Recovery with etcd

This lab guides you through working with etcd in a Kubernetes cluster — from inspection and backup to simulating data loss and recovery.

0. Preparing

Before you begin, make sure you’re on the control plane node (the master node that runs etcd). You’ll need to install the etcd-client, a command-line tool that allows you to interact with etcd.

Install etcd-client on control plane node

sudo apt install etcd-client

This gives you the etcdctl command you’ll use throughout this lab.

1. Inspect etcd

Let’s first check if etcd is healthy and running properly.

Access etcd from the control plane node:

sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out=table

This command connects to the local etcd service securely using certificates and prints the status in a table format. You should see information like the endpoint address, version, and database size.

Now let’s see what’s actually stored in etcd.

List all keys:

sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  get "" --prefix --keys-only

This shows all keys stored in etcd, which represent objects in your Kubernetes cluster (like pods, deployments, etc.). It’s a good way to see how Kubernetes stores its state.

2. Create a Backup

Backing up etcd is very important! It contains the full state of your Kubernetes cluster. Let’s create a snapshot (a full backup) of the current etcd database.

Save a snapshot

sudo ETCDCTL_API=3 etcdctl snapshot save /root/etcd-backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

You’ll now have a backup file saved as /root/etcd-backup.db.

OPTIONAL

To protect your backup, you can encrypt it using GPG:
Encrypt the backup with GPG:

sudo gpg -c /root/etcd-backup.db

You’ll be prompted to enter a password. This keeps your backup secure.

Deploy a new nginx pod (after backup)

Next, let’s make a change to the cluster after the backup — so we can later verify that our restore worked correctly. Add one more nginx deployment after Backup

kubectl create deployment nginx-after-backup --image=nginx
kubectl get pods

This creates a new deployment. After we restore the cluster later, we’ll check if this deployment is gone — a good indicator that the restore succeeded.

4. Simulate etcd Failure

Now we’ll simulate a disaster by removing etcd data and stopping Kubernetes components.

⚠️ Be careful — this will break the cluster temporarily.

sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/kube-apiserver.yaml.bak
sudo mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/etcd.yaml.bak
sudo mv /var/lib/etcd /var/lib/etcd.bak

Explanation

The first two commands are removing etcd and kube-apiserver manifests - the kubelet will remove those pods.
The last command simulates etcd data loss by renaming the etcd data directory.

After a few moments, you’ll notice that kubectl no longer works. That’s expected — the API server is not deployed by the kublet anymore.

Sometimes it can take some time until the kubelet will recognize the removed manifests. To speed up the process, we will restart the kubelet and check the remaining containers on the control-plane node.

sudo service kubelet restart
sudo crictl ps

4. Restore the Backup

Let’s bring your cluster back to life using the backup you made earlier.

sudo ETCDCTL_API=3 etcdctl snapshot restore /root/etcd-backup.db \
  --data-dir /var/lib/etcd

This command recreates the etcd database from the backup and places it in the correct directory.

Restart the API server and ETCD:

Now restart the control plane components:

sudo mv /etc/kubernetes/kube-apiserver.yaml.bak /etc/kubernetes/manifests/kube-apiserver.yaml
sudo mv /etc/kubernetes/etcd.yaml.bak /etc/kubernetes/manifests/etcd.yaml
sudo service kubelet restart

Kubelet will automatically restart etcd and the API server once these manifest files are back.

Verify the cluster:

Wait 1–2 minutes for everything to come up again.

OPTIONAL

If you want to check the startup logs for etcd and kube-apiserver, use these commands:

sudo crictl ps -a | grep etcd
sudo crictl logs <etcd-container-id>

sudo crictl ps -a | grep kube-apiserver
sudo crictl logs <apiserver-container-id>

This helps you debug issues if the components don’t start properly.

Let’s see if the cluster is back up and running:

kubectl get nodes
kubectl get pods

You should see that the nginx deployment we created after the backup is not there anymore — because it wasn’t part of the restored etcd data. This confirms that the restore was successful!

Congratulations!

You’ve successfully:

Inspected etcd
Created a backup
Simulated a disaster
Restored your cluster using the backup

This is a valuable skill in real-world Kubernetes operations.

End of Lab