Time for a home truth about doing edge computing: The one thing most likely to kill your edge Kubernetes project is a brown cardboard box.
“Huh?” I hear you say. Well, grab your popcorn and lean in, because things are about to get real.
Does the Cloud Native Community Know Enough About Logistics?
We’ve spent more than a decade containerizing and moving to the cloud (and we’re still at it).
The siren call driving this mission is speed of execution: Focus on your business case and your software, and don’t worry about infrastructure. Software is eating the world, it’s where the value is. It’s where the innovation is. It’s where you should pay the most attention.
Infrastructure — and by that I mean all those servers, networks, storage — that’s all abstracted away, taken care of by hard-working people doing magic that you no longer have to worry about.
We’ve all been so focused on beautiful GitOps, software supply chains, infrastructure as code, and a raft of other things. When it comes to our apps in the cloud, infrastructure is just something at the end of an API. We snap our fingers and it’s there.
For edge, suddenly we have to care about non-abstract things again. The real thing: hardware.
There are groups out there that never forgot how important hardware is, of course.
They’re the communication service providers building base stations on remote Scottish islands and laying undersea cable.
They’re the IoT companies worrying about ingress protection ratings when they’re bolting toughened sensor boxes to oil rigs.
They’re the SD-WAN (software-defined wide-area network) or SASE (secure access service edge) companies stringing cable between racks in local points of presence.
Edge Kubernetes is, in a sense, a marriage of our treasured cloud native paradigms and this good ol’ fashioned iron.
So what can we learn from those groups that know all about the hardware?
Hardware is Hard and Failure is the Norm, Not the Exception
Anyone involved with maintaining servers or fleets of devices at the edge will tell you the same thing: Hardware failures are business as usual.
It boils down to statistics. Take a standard Intel NUC (Next Unit of Computing) or compact server typically targeted for use at the edge. It’s composed of several discrete components: CPU, RAM, hard drive or SSD, fans, power supply and more, each with minimal redundancy, unlike what we might be used to in a data center.
Now imagine a fleet of 10,000 of these small devices. That means 10,000 or more of each of those components, each of which has a rated MTBF** **(mean time between failure).
Making the Best of MTBF
MTBF becomes one of the key variables in the equation of whether your edge K8s deployment makes money or loses it, in downtime and the cost of hardware and labor. So it pays for you to understand how to make the most of it.
Sure, you can procure components with better reliability, but that increases cost and may not be feasible for your design requirements. You can build in redundancy, but that again increases cost and may move you beyond the space, power and cooling envelope for your edge location.
You can do what you can to control the environment. MTBF might be worsened by suboptimal operating conditions in hostile edge locations. Maybe the retail store manager likes to pile paperwork on top of the mini server and blocks the cooling vents. Tell them not to do that, of course, but humans will be humans.
So you should also plan to monitor mundane but crucial hardware components like fans and power supply units right along with the applications running on your edge stack. They’re just as crucial. Boxes running hot should generate alerts and be used to take proactive measures. These things should be included in any holistic edge Kubernetes system and come out of the box along with sensible alerting rules.
Software can affect MTBF, too. Mundane things such as how the Kubernetes stack reads and writes for behaviors like logging can quickly kill an SSD, and tuning app behavior can’t completely avoid it.
The upshot is this: Although the odds of a single unit failing on a given day are very low, when there are enough units, you can practically guarantee that one of them will fall over … today, tomorrow and the next day. It would be normal to have several device failures every single week.
So if you expect your edge device deployment to be “one and done,” you’re in for a rude awakening.
Planning for hardware failure, and how to handle that failure efficiently, is the name of the game if your return on investment calculation is to come out looking good.
Logistics and Spares-Holding: More Complex Than You’d Think
To be great at handling failure requires thinking logistics, and specifically to answer the question: When a node fails, how do I get a spare swapped in and running, fast?
It’s obviously not cost effective to keep a full set of spare hardware for every device you’ve deployed, and certainly not on site for a hot swap. Your finance people would be furious at tying that much money up in inventory, and often there isn’t space or security at remote sites.
So we instead need to store a smaller pool of spares in inventory somewhere else — central warehouses or regional distribution centers — and ship them to the affected edge site when needed.
I’m not a Markov model expert, but the optimal amount of hardware required to be stored centrally/regionally can be calculated primarily based on the anticipated failure rates of the devices and the MTTR (mean time to restore/recovery). A more efficient process for replacing devices lowers the MTTR, which in turn lowers the number of spares required in warehousing.
However many spares you keep, they can’t be box-fresh from the supplier, either — although you may be used to doing so for end-user devices or data center compute and networking equipment, which can be set up by an on-site team at the destination.
Edge sites like restaurants or wind turbines won’t have an IT expert there to install operating systems and configure the K8s stack on site, and you certainly don’t want your high-salaried K8s ninjas driving for six hours to a different remote facility every day with a keychain full of USB sticks and a keyboard and monitor on the back seat.
No: The hardware needs to be prepared to some extent somewhere centrally, preloaded with an operating system and software, so that when it’s powered on, it can join the K8s cluster and start working.
Still with Me? There’s More
But even here we’re only just scratching the surface of the complexity you need to engineer for when building a spares process for a production edge Kubernetes deployment.
Different deployment destinations (say for example, a small retail branch vs. a megastore) might have different workloads, meaning they are issued with a different device variant and a different software stack.
Indeed, regardless of such hardware variation, certain elements of the device configuration will always be site-specific: secrets, IP addresses and network configurations, location-related user data on the device, etc.
As a result, you simply can’t fully stage the spare device until you know exactly which site that device is going to — and by definition that will be after a mission-critical edge device has already failed, and the pressure is on.
So we’re building up to a list of requirements:
- If failures happen daily, the staging process and tooling for device preparation needs to be simple and lightweight, and able to be automated.
- If spare provisioning is only triggered by a mission-critical failure, the whole process has to happen quickly and consistently, with few manual steps.
- If we can’t send engineers to each site, we need the ability to seed devices securely with the information needed for them to join a cluster — and to remotely troubleshoot and observe the status of devices after they power up at a site, before and during attempting to join or form a cluster.
- If we are keeping a ratio of 1:x spares per live site, we need the ability to provide site-specific information as late in the logistics chain as possible, to enable all devices that are stored to be homogeneous in terms of software configuration.
- If we can’t rely on totally interchangeable hardware (because we have variants), we shouldn’t require pre-provisioning of different roles, such as a master node or a worker node, labels for GPUs, etc. This should be detected automatically and assigned dynamically when a device arrives at a site and is ready to form or join a cluster. Without this, it requires additional human involvement either during staging or during onboarding, and introduces more room for error.
Securing Your Brown Cardboard Boxes
We at Spectro Cloud talk a lot about Kubernetes security, and in particular edge device security. We imagine deliberate tampering and attack of edge devices in situ, by masked assailants.
But as anyone who has ordered an expensive present for a loved one will know, the biggest risk is from supply chain “shrinkage”: when a brown box mysteriously drops off the radar somewhere after it’s dispatched.
You need to account for this, and not only by choosing a trustworthy logistics partner to operate your warehousing and courier your parcels.
Your edge devices — especially since they’ve been preconfigured with data! — need to be encrypted, with use of Trusted Platform Modules to secure boot, remote wipe capabilities and all the other protection capabilities you can lay your hands on.
A lost or stolen device should be red-flagged so it can’t join the corporate cluster, and an assailant certainly shouldn’t just be able to carry your data away under their arm.
Wake-up Call Officially Over
If only meatspace was as easy to orchestrate as our cloud native abstractions, eh? But the reality is that edge is very physical. It’s about humming boxes installed in cupboards, under desks, in thousands of forgotten locations that only a whistling delivery driver or a guy in a hi-viz vest will ever see.
If your edge project is ever to move successfully from a pilot to mission-critical production at scale, you need to be thinking about failing fans, piles of boxes in warehouses and everything that needs to happen to get them from A to B when every second counts.
Need Some Help?
We’ve been building features into our Palette Edge Kubernetes management platform to tackle these kinds of scenarios.
For example, Palette offers persistent data encryption and immutable operating systems to address security concerns such as tampering and theft.
Most excitingly, we’ve made big advances in staging and provisioning of edge devices at remote edge sites through features such as auto-registration for headless deployment, low-touch QR code onboarding, and network operations center (NOC)-style remote monitoring. For a quick overview of how all that works, watch my colleague Justin Barksdale walk you through in this 25-minute video. It’s pretty cool.
If you’d like to talk about what we’ve learned, and your own edge projects, give us a shout. Or to read more about edge Kubernetes, check out our edge resource hub.