Kubernetes backup and restore done right

Imagine starting your day as a DevOps engineer with everything calm and under control. You’re sipping your pumpkin spice latte and planning out all the things you want to get done. Suddenly alarms start blaring and all the dashboards flash red.

The control plane of your production Kubernetes cluster is down and services are rapidly failing. Colleagues Slack you left and right with increasing panic, including a message from your CEO that just says “?”, asking what’s going on.

In an instant, backups are your only lifeline. In this blog, we’ll see why a Kubernetes backup is essential for your clusters, the different types of backup solutions and popular tools to perform them, as well as how Palette solves the problem of protecting application critical components to help you achieve business continuity.

Why back up a cluster if you can recreate it with IaC?

If you are here reading this blog, I doubt that I need to tell you about the importance of backing up your IT systems. Having a robust disaster recovery plan is a well-established best practice to create a fault tolerant system and maximize service uptime.

Many adopters of Kubernetes also use Infrastructure-as-Code (IaC) to provision infrastructure, with manifests and helm charts to deploy the Kubernetes cluster. IaC enables you to quickly re-create infrastructure after a disaster, so you might wonder if this negates the need to maintain backups. Long story short: It helps, but no.

Using IaC does indeed allow you to quickly recreate clusters and the infrastructure they run on, in a consistent ‘cattle, not pets’ way that cloud native aficionados love. But this is often insufficient to return the cluster to its original state before the disaster struck.

What’s missing from a newly recreated cluster?

To start, IaC is not able to restore the data in stateful applications such as databases or data stored in Persistent Volumes (PV), so if your cluster has any stateful workloads, that data will not be recovered even if you recreate the cluster.

Even completely ephemeral clusters will often have stateful dependencies, which limits your options for the restore process. If you recreate the cluster with IaC instead of restoring it, you will need to tear down all the stateful dependencies and recreate them too, which can be an intractable operation.

Let’s say you have an application running in your Kubernetes cluster that processes incoming orders and saves them to a database. Even if you use IaC to recreate all the infrastructure and the cluster resources following a disaster, you won’t be able to recover all the data in your database: in other words, all your customer order data will be lost.

Even if your Kubernetes application is completely stateless, many services may rely on annotations, labels, or specific configurations that are dynamically generated at runtime by third-party integrations or other services. None of these will be included in the IaC definitions. In addition, your application may integrate with external dependencies such as messaging queues, which won’t be connected to your newly rebuilt cluster, leading to additional manual effort to restore a working application.

By now it’s clear that you can’t rely on IaC alone to provide your disaster recovery plan when system failures happen. IaC doesn’t restore the data in your cluster, nor is it likely to return your cluster to a fully functional state. However, the ability to recreate parts of your workloads is something that you can certainly take advantage of when devising your backup and recovery strategy.

With that out of the way, let’s look at your Kubernetes backup options.

What are your Kubernetes backup solutions?

Generally, there are two types of backups for Kubernetes clusters:

Etcd backups

Etcd is a distributed key-value database that stores the state of the cluster, including configurations, secrets, and API objects. Some lightweight Kubernetes distributions, such as K3s, use databases other than etcd, but the data stored in the database is the same. In those cases, you would backup the respective backend database that was used, and it will still contain the state of the cluster.

If you are using a managed Kubernetes service, such as Amazon EKS, etcd backups are often handled by the service provider, and you will not have access to the nodes themselves to perform your own etcd backups anyway.

Full cluster backups

Full cluster backups can back up every object in a Kubernetes cluster as well as any persistent volumes. Full cluster backups are often performed using tools that are specifically designed for Kubernetes.

Unlike backups for most other IT systems, a full Kubernetes backup does not take the form of a single file such as a tar file. In most cases, your backup tool queries the Kubernetes API to produce JSON files that represent the state of the cluster, with all its resource objects and metadata.

For persistent volumes, your backup tool communicates with a Container Storage Interface (CSI) or directly with the cloud APIs such as the AWS EBS snapshot API, to produce volume snapshots. The JSON file and the volume snapshots are then uploaded to locations of your choosing.

What’s the difference and which backup solution is right for you?

The table below summarizes the main differences between etcd backups and full cluster backups.

Aspect	Etcd backup	Full cluster backup
Scope	Only backs up etcd data, which includes cluster state, configuration, and resource definition data.	Backs up entire Kubernetes cluster resources, including pods, services, deployments, and associated data in persistent volumes.
Use case	Restoring etcd in case of data corruption or loss.	Migrating workloads between clusters. Restoring after accidental deletion or corruption of Kubernetes resources.
Restoration target	Typically used for restoring etcd on the same cluster	Can be used to restore on the same cluster or migrate to a different cluster
Typical file size	Relatively small (megabytes to low gigabytes).	Usually larger, depending on the size of cluster and volume data (likely high gigabytes).
Availability	Requires direct access to nodes and is often unavailable for managed services.	Requires access to Kubernetes API and is generally available across platforms.

‍

Etcd backups are generally much smaller than a full-cluster backup, because they don’t include any data from persistent volumes or many other stateful application data. It also takes relatively little time to perform an etcd backup. Even for large clusters, etcd backups rarely take more than 10 minutes to perform.

This means you can make etcd backups relatively frequently, and if a disaster were to occur, you can minimize the amount of data loss. Etcd backups are often only used to restore a failed cluster in place, and are not used to bring up a new cluster.

A full cluster backup, on the other hand, includes etcd snapshots, persistent volume backups, and resource definitions, to ensure the entire cluster can be restored, including data, state, secrets, and configurations.

Full cluster backups tend to be more expensive than etcd backups: they take longer to perform, and consume much more storage space. It also takes longer to restore a cluster from a full-cluster backup than an etcd backup. Full cluster backups can be used to restore a failed cluster as well as bring up a new cluster.

Kubernetes backup best practices

If you’ve backed up any IT system before, you’ll know the common best practices:

Schedule backups and optimize your retention policy by data type and cost
Back up to multiple locations for redundancy
Test your backups (it’s not a backup until you’ve restored from it…)
Encrypt your backups for security

Many of these best practices apply to Kubernetes, too. But because of the difference between etcd backups and full cluster backups, and the ability to recreate non-stateful workloads with IaC, you can adopt a few new strategies to optimize your backup and storage plan to get the most bang for your buck.

Backup frequency

Since etcd backups consume less space, and hold critical state data for your clusters, you should perform them more frequently to minimize data loss.

On self-managed clusters, etcd backups are performed by the etcdctl snapshot save command. This command will take a snapshot of the current state of etcd on the node where the command is executed and save it to a location of your choice. You can configure a job to perform the backup at regular intervals. Ideally, you should be backing up etcd at least daily for a production cluster.

The more expensive full cluster backups can happen less frequently to reduce compute and storage costs associated with performing and storing the backups. For production clusters, you may choose to create full cluster backups on a weekly basis.

Backup granularity

While most tools for full backups allow you to back up everything in your cluster, that doesn’t mean you should. If an object can be easily recreated using IaC, you can consider skipping it in your backup plans.

For example, say you have a three-stack application in your cluster with a frontend web app, an API layer, and a backend database. You can choose to back up the API layer and the database layer to retain the credentials and stateful application data, but you may not need to back up the frontend layer because it can be easily recreated.

This will allow you to reduce the size of your backup, leading to reduced storage costs and better backup performance.

Backup locations

In a disaster event, whole data centers or even regions may lose availability. Storing multiple copies of your backup data in multiple, geographically distributed locations helps ensure that a backup is always accessible.

For example, you might keep one copy of your backup in a primary public cloud region, another in a secondary region, and perhaps an additional copy in a separate cloud provider or an on-premises storage solution. This redundancy adds a layer of resilience, protecting your data from localized failures and reducing the risk of total data loss.

Backup testing

A backup is only as good as its ability to restore. Even if you verified at one point that your backup can restore your system, changes to your cluster may impact the viability of your existing backup configurations. To make sure that your backup is always capable of restoring your systems, conduct tests of your backups by restoring a backup in a non-production environment regularly.

For example, you can conduct weekly backup testing by restoring the latest backup in a non-production environment. If you discover that the backup fails to restore the cluster to full functionality, you know you have work to do.

Backup tools for Kubernetes

Plenty of tools exist to help you manage backups for your Kubernetes cluster. When evaluating a backup tool for Kubernetes, make sure that the tool supports the following features:

Full cluster backup.
Scheduled backups.
Integration with your Kubernetes environment.
Granular backup based on namespace and labels.
Backup testing.

Some of the leading enterprise-grade commercial Kubernetes backup tools include Kasten K10 and Portworx Enterprise, while Velero is the leading free open-source option. All three tools offer multi-cloud support, full cluster scheduled backups, granularity based on namespaces and labels, as well as testing.

One reason to choose commercial solutions like Kasten K10 and Portworx is their application-centric approach to backups. Kasten and Portworx backup and group resources together by application to ensure that the application can be restored as a cohesive whole.

Velero, as a free open source solution, is resource-based. You can use namespace and labels to group resources together and create backups that contain all the resources needed for an application, but you’ll need to do the legwork yourself.

If your cluster has complex applications with large stateful workloads, and you want more support in managing the operational overhead of backing up those applications, commercial solutions such as Kasten and Portworx may be worth the investment.

However, if your workloads are already clearly delineated by namespace and labels, and you have the DevOps capacity to manage the backups for your applications, choosing Velero could yield considerable cost savings.

Kubernetes backups made easy with Palette

Palette has built-in support for both etcd and full cluster backups, as one of the cluster lifecycle management features we provide.

Etcd backups are enabled by default for every cluster, and you can easily adjust the frequency of the backup through the Cluster Profile. Cluster profiles are composed of layers using packs, Helm charts, and custom manifests to meet specific types of workloads on your Palette cluster deployments.

For full cluster backups, Palette leverages the open-source tool Velero. However, Palette abstracts away the complexity of administering Velero manually. From the Palette web user interface you can freely adjust the frequency of the full cluster backups, as well as adding multiple backup locations to secure your backup data. Once you do, Palette will do the heavy lifting for you behind the scenes.

The following steps are all you need to create regularly scheduled backups in Palette.

Log in to Palette.
Add a backup location by providing Palette with credentials to your desired backup environment. Palette supports all major cloud providers, as well as MinIO for backup locations.

Screenshot of the Palette UI, adding a backup storage location.

Select the cluster for which you want to configure backups, and click Settings.
Click Schedule Backup.
Fill the form with your desired backup configurations. You can configure the backup location, backup schedule, backup retention period, whether or not to backup cluster-scoped resources, as well as which namespaces to back up.

Screenshot of the Palette UI, showing how to schedule a backup.

Click Save Changes.

You can create multiple scheduled backups to multiple locations. Palette also allows you to easily restore a backup to a new cluster, which makes it easy to test your scheduled backups.

What are you waiting for?

Creating a backup strategy for your Kubernetes clusters involves evaluating the needs of both your cluster workloads and your organization.

Of course, it’ll require an upfront investment to set up the disaster recovery infrastructure and backup pipeline. While it’s tempting to leave this work until another day, when alarms blare and your services go down, you'll be grateful you took the time.

Backups aren’t just about recovery — they’re about resilience and confidence. With a strong backup strategy in place, you can handle a crisis with the calm assurance that you’ll be prepared for whatever the night has in store for you.

If you’re interested in seeing for yourself how Palette helps with backups and other day 2 operations tasks, check out our docs or get in touch for a quick 1:1 demo.

Tags:

Operations

Observability

Security

How to