Two-node HA Kubernetes for edge computing cost savings

Edge computing is mission critical

Enterprises today appreciate the benefits of edge computing for all kinds of workloads, particularly machine learning and real-time applications. In these scenarios, you need to perform data processing locally.

The volumes of data involved are high, and network connectivity may be limited. It’s infeasible to deliver fast response times while conducting processing and analysis back in a centralized data center or cloud computing environment.

When you’re running these software applications in edge computing use cases, high availability is a must. Relying on a single node (with single CPU, disk, etc.) is unacceptably risky.

For example, imagine a busy fast food restaurant’s point of sale application going down due to a cluster failure, costing a day’s revenue (or more).

Or an edge server running IoT devices on a manufacturer’s factory floor dying, causing production to stall.

Or a computer vision app managing drones in a logistics warehouse suffering a hardware failure, risking worker safety.

High availability on Kubernetes is expensive!

But implementing a high availability architecture for edge applications conventionally requires a jump to three nodes per site. This is because Kubernetes’ default key-value store, etcd, requires a minimum of three nodes to ensure consistency and availability.

And when you consider that edge deployments in sectors like quick-service restaurants and retail are often scaled to tens of thousands of stores, the cost of deploying three-node HA enterprise-wide quickly adds up: hardware, cabling, power, and sensors are all triplicated!

Three dragons in a row (meant to represent nodes). The third dragon is making a goofy face. — *It’s all etcd’s fault!*

Availability without the cost

What if it was possible to provide cost efficient, (mostly) highly available Kubernetes clusters at the edge, using two edge nodes instead of three?

Last fall the Spectro Cloud team set out to answer that question, and along the way we’ve shared our progress both in blogs and a presentation at KubeCon Paris (where the demo Gods weren’t quite on our side).

Today we’ll walk you through our latest 2 node HA architecture, a truly viable solution that unlocks a ~33% cost reduction for edge kubernetes applications running at scale in production!

Finally, a caveat: fear not, CAP theorem enthusiasts, no one is claiming to have overcome Brewers’ inviolable constraint. Read to the end for clarification...

Solution overview

Our 2-node HA architecture uses Spectro Cloud’s existing, battle-tested edge solution, which builds upon open source components including kairos, k3s, kube-vip, harbor, and system-upgrader-controller.

We distribute the solution via immutable, A/B partitioned bootable-OS images (thanks to Kairos). Our images contain a Kubernetes distribution (often K3s), proprietary Go agents, and whatever additional custom software you require on a case-by-case basis. All configuration is specified declaratively in cloud-config syntax and executed by cloud-init.

Edge devices are initially provisioned in registration mode. Subsequently, they’re added to a cluster — either via Spectro Cloud’s Palette platform or a local management GUI — at which point they’re rebooted into installation mode using the desired bootable-OS image. Once Kubernetes is online, any addons are installed and reconciled by a Go agent.

Immutable upgrades are performed by streaming a new image to the node (or via USB for air-gapped use cases), writing it to disk, and rebooting from the B partition. If anything goes wrong, we automatically fail back to the A partition with known-stable configuration.

For two node support, we layered kine and postgresql into our existing edge solution and introduced a handful of additional mechanisms for lifecycle management, including liveness checks, a liveness client and server, and a finite state machine to ensure correctness of the kine endpoint configuration.

As always, a picture is worth a thousand words:

Two node edge kubernetes architecture. One leader, one follower. — *Two node cluster: healthy operation*

Two edge hosts are provisioned: one being the leader, the other the follower. Both hosts include a Go agent, configured as a systemd service, which continuously executes a liveness reconciliation with a configurable period. The liveness reconciliation loop is a finite state machine (FSM) that determines whether, and how, to modify each host’s role.

The diagram above indicates separate Kubernetes components for clarity, but in reality the control plane, kubelet, and kube-proxy are all embedded within a single K3s server process on each host. Both K3s servers are configured to use http://localhost:2379 as their datastore endpoint, which is the address of the local kine process.

Kine and postgres are configured on each host via systemd, however, the leader’s kine endpoint points to the localhost postgres process and the follower’s kine endpoint points to the leader’s postgres process.

Lastly, unidirectional logical replication is configured between the two postgres processes to synchronize the content of the kine tables on both hosts. Since both kine processes direct API server traffic to the leader, all writes are first committed to the leader’s database, then replicated to the follower in an eventually consistent manner. A one-shot systemd unit, kine-endpoint-reconciler, executes on each host on boot prior to starting kine and k3s. It reconciles logical replication configuration and ensures that the kine processes are configured with the correct endpoints.

Initialization: edge host metadata

When the system is first provisioned, each host initializes a metadata file (hereafter referred to as hostMeta) containing the roles and addresses of both hosts, the control plane endpoint, and a last modified timestamp. The hostMeta content is inferred from a combination of local device configuration (network interface(s)) and edge cluster configuration — which is either fetched from Spectro Cloud’s Palette platform or loaded from an air-gapped export.

YAML metadata indicating the role of each edge host — *An example hostMeta file*

‍

If the host is in registration mode — or if a particular sentinel file, /oem/.two-node-pause, exists — the liveness reconciliation is a no-op.

Another sentinel file, /oem/.two-node-initialized, is created once the hostMeta has been initialized successfully. Subsequently, only the hostMeta files are referred to for the purposes of role assignment and state transitions, not the edge cluster configuration from the Palette platform. This means that the cluster operates autonomously in a fully disconnected manner.

Kine endpoint reconciler

The kine-endpoint-reconciler first checks for the existence of the /oem/.two-node-initialized file. If it doesn’t exist, the local hostMeta file is derived and /oem/.two-node-initialized is created.

Next, if the host in question is the follower, it executes a series of PostgreSQL operations to verify and/or configure logical replication, consisting of a publication and a replication slot on the leader, and a subscription to the leader’s publication on the follower.

Depending on the state of the system, the kine-endpoint-reconciler may perform an edge host replacement.

Finally, local and remote hostMeta files are reconciled. If they are in disagreement, the host’s role may change, i.e., it may be promoted or demoted. If a demotion occurs at this stage, a series of PostgreSQL operations are executed which drop the local Kubernetes database and reconfigure replication. This triggers a full resync of the local Kubernetes database.

Once the kine-endpoint-reconciler completes, the system enters its normal operational mode and control is transferred to the liveness service.

Liveness reconciliation

Every liveness period, each Go agent acquires both hostMeta files: one from the local filesystem and another from the alternate host’s liveness server (retrieved provisionally, if/when the alternate host is available and healthy). Following that, a series of health checks are executed, potentially resulting in a state transition:

TCP connection to control plane endpoint
TCP connection to alternate host’s Kubernetes API server
ICMP ping alternate host
Health check script (optional) - an arbitrary shell script to check disk utilization, memory pressure, etc.

For every failed check, a counter is incremented.

Promotion

If the leader goes down and the follower’s liveness service detects four health check failures, a promotion is initiated:

Diagram indicating a failed leader node — *Leader goes down, health checks start to fail*

During a promotion, the following series of events occurs on the follower host:

The following files are updated to reflect the change in roles:
- Local hostMeta file
- Various cloud-init configuration files
- The kine endpoint is reconfigured to point to localhost
Logical replication subscription to the leader is dropped
K3s and kine are stopped
The following “deregistration” SQL is executed:

 
DELETE FROM kine
WHERE name LIKE '/registry/masterleases%'
OR name LIKE '/registry/leases/%'
OR name = '/registry/services/endpoints/default/kubernetes';

Deleting the endpoints for the kubernetes service in the default namespace forces the K3s server to drop its websocket tunnel to the impaired host.

Additionally, all K8s leases are deleted from the kine table to accelerate lease acquisition for all local controllers once K3s is restarted.

5. The following “sequence remediation” SQL is executed:

 
SELECT setval ('kine_id_seq', (SELECT MAX(id) FROM KINE), true);

This is required, as sequence data is not replicated during logical replication. If this step does not occur, new rows are inserted into the kine table with prev_revision > id, which violates a key invariant assumed by kine and causes database corruption.

6. K3s and kine are restarted

Diagram indicating the follower node promoting itself — *Follower detects the failure and promotes itself*

Demotion

If a host that was originally the leader goes offline temporarily, the follower promotes itself in its absence, then the original leader comes back online, its kine-endpoint-reconciler will detect the discrepancy and initiate a demotion.

During a demotion, the following series of events occurs on the leader’ (original leader) host:

The leader’ (hereafter referred to as the follower) updates the following files to reflect the change in roles for both hosts:
- Local hostMeta
- Various cloud-init configuration files
- The kine endpoint is reconfigured to point to the new leader, rather than localhost
A publication is created in the current leader’s database (if one does not already exist)
The follower drops the content of its kine table
The follower subscribes to the leader’s publication with copy_data = true, causing the entire kine table to be resynchronized.
- If an inactive replication slot already exists, it is used, otherwise a new replication slot is created on the leader
K3s and kine are restarted

Edge host replacement

In the event that one of the two hosts becomes permanently impaired (e.g., a hard drive failure, NIC failure, etc.), a replacement device will be shipped to the edge site. Once it boots, it will seamlessly replace the degraded follower, which can then be decommissioned.

A nerdy reference pretending that the replacement edge host is a Pokemon

The replacement host boots and accesses cluster configuration indicating the address of the current leader and that it is the current follower. In connected environments, the cluster configuration is fetched from the Palette platform. In air-gapped environments, the new replacement host is pre-initialized with a “cluster configuration export”.

The replacement host’s kine-endpoint-reconciler requests the leader’s hostMeta, sees that it is not considered the follower by the leader, and announces its presence to the leader via the replacement API. This causes the leader to update its own hostMeta to register the new host as a follower.

Finally, the replacement host subscribes to the leader in the same manner as described above under Demotion, which initiates replication of the entirety of the leader’s kine table.

Upgrades

During an upgrade, there are no promotions or demotions (hosts retain their original roles), however, there is mandatory API server downtime while the leader reboots. The following table indicates the sequence of events that occur during an upgrade:

Table indicating the upgrade procedure for the two node architecture

The two hosts are updated via a system-upgrade-controller plan that targets all control plane nodes with a random ordering. Prior to initiating the upgrade, a Go agent pauses the liveness service on each host, which prevents the follower from promoting itself in the event that the leader is upgraded first. Once both hosts are upgraded, their liveness services are resumed.

Alternative upgrade flow

Once may envision an alternative upgrade flow, wherein the follower is upgraded first, then the hosts’ roles are swapped (follower promoted, leader demoted) and then the original leader is upgraded, but such a flow is quite complex and error prone. For one, it requires a drop and re-sync of the database due to the demotion, therefore we opted for the above flow.

Note: there is still downtime in the more complex flow, due to the time it takes for pods to rebalance. The following K8s configuration options dictate how long it takes for an unhealthy node to be detected and pods to be evicted from that node:

 
kube-controller-manager-arg:
- node-monitor-period=5s
- node-monitor-grace-period=20s

HA and the CAP theorem: how well did we do?

In the introduction I promised to expand on the CAP theorem and the (mostly) highly available nature of our solution — what I’d call A*P:

Since it’s only eventually consistent, we drop the C for consistency.
The A for availability gets an asterisk due to the fact that there is brief API server downtime (benchmarks indicate an average of 4.5 minutes) during promotions, demotions, and upgrades. However, any applications deployed on the follower will remain fully available. Therein lies the crux regarding achieving high availability with this architecture: all applications must either be DaemonSets, or have topology spread constraints to ensure that at least one replica runs on each node. Longhorn can easily be installed to provide crash-consistent block storage for stateful applications.
The P for partition tolerance is handled using the lastModified timestamp in the hostMeta as a tie-breaker. In the event of a network partition, both hosts would promote themselves. Once the partition is remediated, whichever host was updated most recently “wins”. The losing host will demote itself and copy the content of the winner’s database.

Without a doubt, three node Kubernetes clusters provide stronger guarantees with arguably less architectural complexity, yet they impose massive capital expenditure at scale, not only in the cost of the boxes themselves, but cabling, shipping, software, power consumption and other factors. If you’re looking to optimize costs or an edge compute use case, a two node solution can instantly cut costs and materialize serious savings.

Finally, I would be remiss to not mention the inspiration we received from the Synadia team. The work they did in pursuit of a two node solution using K3s, kine, and NATS inspired our own initiative.

If you made it this far, thank you for sticking with us. If you have questions, feel free to reach out directly at tyler@spectrocloud.com or check out the Spectro Cloud community Slack.

And if you’re working on your own edge computing project with Kubernetes, please take a look at Palette Edge to see if we can help.

Tags:

Edge Computing

Operations

Enterprise Scale