Robust Edge Computing - How to Plan for the ‘What Ifs’

Robustness isn’t something you can bolt on, it must be baked into your edge architecture from the start.

When a box (or hundreds) goes down, taking critical apps with it, suddenly you’ve got angry users and the prospect of flying a specialist engineer out to a remote (or customer) site to troubleshoot. It’s costly, it’s slow and it might make you regret getting yourself into this edge stuff in the first place.

Planning for the “What Ifs”

As our partners at Intel put it, the edge isn’t just the cloud in a different location. It has unique, diverse risks that threaten your service levels.

You can think of these risks in terms of “what ifs”:

What if a barista spills a hot coffee into the server and kills a node?
What if a dental receptionist unplugs the router and prompts a refresh of the site’s IP address?
What if a security patch bricks the cluster?

Except all those “what happens if” scenarios are really “what happens when.” Unexpected events happen daily or weekly when you’re talking about hundreds or thousands of sites.

So when you’re building out your edge stack, look for tools that anticipate and plan for as many of these failure scenarios as possible — whether that’s preventing the situation from occurring, mitigating its impact, or simplifying and reducing the cost of remediation.

You probably know this in your gut already. Our latest research with Kubernetes adopters found broad and growing concern about the challenges of operating edge devices, from security to dealing with disconnected environments to performing rollouts and daily operations tasks without field engineering visits.

Deploying Kubernetes on an edge device — *Purple bar shows 2023 results; blue bar shows 2022 results.*

So let’s take a tour of some of the “what ifs” and show you some ways to answer them.

Handling Hardware Failures

At the edge, you probably don’t have the luxury of a whole rack of servers with redundant disks, network interface controllers (NICs) and power supply units (PSUs), and the hardware you have deployed may face temperature extremes, dust, vibration and other abuse.

Hardware failure is unavoidable, no matter which ruggedized box you buy, even if you follow best practices such as burn-in testing to weed out early failures.

Consider that annual failure rates of over 2% are typical for disks, and the risks stack for every component.

Make Hardware Replacement Easy

If hardware failure is inevitable, that means hardware replacement is unavoidable too. And what you don’t want to happen is that every time a box needs swapping, you send an experienced (costly) Kubernetes expert to every remote site.

There they sit with a 200-page runbook, spending two or three days setting up the cluster before moving on to their next job. That’s exactly what one of our customers in the health care sector was doing.

Instead, you want to make the replacement experience as close to plug and play as possible, so a soldier or a store manager or a subcontractor or a general IT person onsite can get the device powered on, connected to the network and onboarded into management within your Kubernetes environment.

We solved this with our low-touch onboarding options, from QR code scans for trusted device onboarding to full preprovisioning at staging sites, enabling a GitOps approach to replacing broken devices.

Keep Apps Up When Boxes Go Down

Even the most streamlined hardware replacement process takes hours to days. And you have to keep critical application services available in the meantime.

While there are plenty of single-node edge Kubernetes deployments out there, using K3s and other lightweight distributions, you’ve probably investigated traditional high-availability (HA) Kubernetes architectures and discovered it means having three nodes per site, with all the cost, space and power that entails.

For some use cases, that’s a dealbreaker, and it’s why we built the world’s first two-node HA Kubernetes architecture, overcoming some tough engineering challenges along the way. It enables your applications to survive a node failure uninterrupted and costs a third less than conventional three-node HA. Watch the CNCF demo to learn more about how we made this real.

Navigating the Wild Wild West of Networks

One of the biggest differences between edge and the cloud or data center is the networking situation.

Air Gap? Time to Go Local

Some edge sites don’t connect to a private WAN or even the internet because they’re in really remote locations or because they have to be isolated — air-gapped — for operational reasons.

For these scenarios, it's not enough to support building fully air-gapped clusters. It's equally necessary to have the freedom to deploy an instance of the management platform into an air-gapped environment to enable secure centralized management, for example in a military base or a factory, as you can do with Palette.

For day-to-day operations in air-gapped settings, Palette offers a local management console for deploying and managing clusters. It comes with a local container registry that allows users to manually import update contents from a USB stick through the local Palette UI.

Transit from locally managed to centrally managed, or vice versa? No problem. Palette’s UI fully supports switching between two management modes, and you can also have a policy to lock down the management mode if required.

DDIL? No Problem

Other edge clusters have low-bandwidth connections by design or are operating in challenging environments. NIST has a phrase — denied, disrupted, intermittent or limited (DDIL) connectivity environments — that fits here. Some people like to use other words like degraded or interdicted, but you get the picture. You can’t count on a packet getting in or out.

If your network goes down unexpectedly, it’s important to know that your edge clusters will continue working. Palette Edge has a decentralized architecture with local policy enforcement, so each cluster can continue to enforce desired state even if it can’t reach the management plane. When the connection comes back up, it’ll seamlessly resync with the management plane.

Recognizing that many edge environments operate under limited network resources, we optimized Palette Edge for low-bandwidth conditions. This ensures efficient operation and reliability, even when network connectivity is constrained, such as in remote or underserved locations. You can even upload large content locally to the cluster to avoid lengthy network transfers entirely.

Adapt to IP Changes

Even when there is an always-on, high performing network connection for each edge device, that doesn’t mean you can relax.

Kubernetes clusters rely heavily on stable IP addressing for the control plane, and they will easily break if the network IP address changes.

A simple power outage or router change can trigger this, and they are almost impossible to protect against, particularly when you’re deploying clusters at sites with networks you don’t control, like a connected product in a customer’s factory or point-of-sale terminals for retail or restaurant franchisees.

For this, Palette Edge offers a dynamic overlay capability that provides IP stability to the cluster, even if the host IP addresses change. It is completely self-contained, needs no assistance from external systems and doesn't rely on multicast (mDNS), which is commonly disabled on secure networks. It's also self-healing, allows nodes to come and go from the cluster and adapts continuously.

Managing Patches and Updates

Edge Kubernetes devices in the field usually need regular patches, updates or new software, either to correct security vulnerabilities or add features. But upgrading remote devices can be a real challenge.

No More Progress Bars

How do you get new software versions onto a device when you’re dealing with limited hardware and slow or intermittent connectivity?

We solve this by basing Palette Edge on the open source project Kairos, which uses containers as its transport mechanism. Container images are composed of multiple layers, and each layer can be updated independently. By comparing the image layer’s hash, the software determines which layers need to be pulled from the remote registry and prevents downloading those that match. This can significantly reduce required bandwidth and shorten upgrade windows.

It also supports local registries, so if you need to add a new software package to the stack, it may already be on local disk. Having a local repository also means the update content needs to be downloaded only once and can be used to update every node in the edge cluster.

If at First You Don’t Succeed…

Once the package has transferred to the edge box, there’s the hold-your-breath moment as you wait to see if the update was successful or if you bricked your device and have to send out a field engineer.

To avoid this, Palette Edge uses A/B partitioning for system upgrades. Each device has a second partition where upgraded software is downloaded and applied. The device attempts to switch over to the new partition after an update. If that fails, the system reverts automatically to running on the previous partition. The platform’s ability to automatically recover to a known-good state in case of update failures or system issues is a core component of its robustness.

Preventing Security Threats

Any discussion of robustness must include security. Robustness means not just shrugging off hardware failure but maintaining service levels under deliberate attack, particularly if your edge use case is covered by compliance requirements like FIPS (which Palette Edge supports right to the edge).

Edge security issues are different from those encountered in the public cloud or data center, including physical theft, tampering and untrusted networking. Our SENA security architecture provides a wide range of capabilities specifically designed to address these issues.

Harden Against Tampering and Theft

When you can’t stop attackers from gaining physical access to the edge device, how can you be sure they haven’t tampered with the software or exfiltrated sensitive data?

We tackle this in a few different ways. First, Palette Edge uses immutable edge operating system and Kubernetes images. This means essential system components, like the kernel, can’t be altered on the device.

It also protects edge nodes using a hardware verified secure boot process. It verifies that each component of the system's startup sequence is signed and validated, preventing unauthorized code execution. It forms the basis for the platform’s security, enabling features like data partition encryption.

To cater to diverse security requirements, Palette Edge allows highly customizable partition encryption. This flexibility helps safeguard sensitive data in a manner that aligns with specific security policies and compliance requirements.

For authenticating access to encrypted data, Palette Edge supports local and remote encryption key options. For local encryption, it uses the hardware Trusted Platform Module (TPM), working in tandem with the secure boot process to protect data even in cases of physical theft. This helps keep data secure and practically inaccessible, even in stolen devices.

For even greater security, you can require the device to reach a specific key server via the network. This protects against data theft when devices are removed from their intended location.

Can You Afford to Take the Risk?

We believe edge is the future, with almost limitless potential. But it can fulfill that potential only if businesses and users can trust the availability and security of the infrastructure powering their experiences.

Many of our customers have been burned by other vendors who have sold them the promise of edge. Often, those vendors simply haven’t thought through the many inevitable “what if” scenarios that the edge presents, and they don’t have a credible answer.

The best way to see the robustness of Palette Edge is to get started with a demo by one of our experts.

Tags:

Edge Computing

Security

Enterprise Scale

Developers

For Robust Edge Computing, Plan for the ‘What Ifs’