Published
August 8, 2024

CrowdStrike Outage: What Can Cloud Native Teach Us?

Cornelia Davis
Cornelia Davis
VP, Product

Last week, cybersecurity company CrowdStrike pushed a software update and caused what some are calling “the largest IT failure in history.”

While the magnitude of this particular outage was extreme, software updates cause outages all the time. The annual outage analysis report from the Uptime Institute says that 65% of IT provider issues are caused by a software or configuration error.

Why is this so commonplace? And how can cloud native practices help us avoid outages like these?

There Is No Perfect World, And No Perfect Test

For decades, software teams have relied on extensive pre-release testing as the primary way to avoid issues when deploying code to production. Testing is the lynchpin of a rigorous change-management process that aims to predict and avoid any possible problems.

But no testing suite, no matter how comprehensive, is perfect — not least because it’s impossible to accurately replicate the intricacies of real production environments.

CrowdStrike’s own just-published preliminary post-impact report goes into detail about the many kinds of testing it does, and puts the blame for this specific outage on a bug in its testing software (ironically).

Crowdsreike blames a new software bug for that giant global mess

But a lot of what it says it’s going to change in the future essentially boils down to “do more testing.”

 How Do We Prevent This From Happening Again?

This huge burden of testing and change control sits uneasily with today’s rapid pace of software updates. Not all the software you use or deploy will need to have such frequent updates as a security tool like CrowdStrike, but many software teams deploy new features, bug fixes, interface updates and other code changes daily or even multiple times per day.

Code velocity is how engineers are measured, rightly or wrongly, and many software companies see speed and frequency of release as key to their competitive advantage and customer satisfaction.

Is “more testing” really going to give us the stability we crave and maintain pace? No. So what’s the alternative?

Frowny face on blue background and words: Still relying on testing to avoid outages? Here's how to stop a bug from turning into a really bad day — the Kubernetes way.

Plan for the Inevitable

Modern software engineering and operational practices take a different approach.

Instead of believing that with the right tests it’s possible to avoid all problems in production environments, they acknowledge that some bugs or unexpected interactions will always find their way through to the real world beyond the test environment.

Acknowledging that bad things will happen, they put safeguards in place to minimize the impact of those inevitabilities when they occur.

What are those safeguards? Let’s go through a few things that every team should consider from an operational perspective, then we’ll finish up with a couple of powerful architectural concepts that we believe change the game.

Slow and Steady Wins the Race

It’s easy to poke fun at CrowdStrike for pushing an update to millions of devices all at the same time, but it’s a security company, and nobody wants to wait 24 hours for new security definitions.

For more pedestrian software updates, there are alternatives: You can push the update to just a small group of devices and wait to see what happens — that’s a canary deployment. Or you can phase the rollout, perhaps across time zones. Both give you the opportunity to stop the rollout when the first bug reports start to hit. CrowdStrike says it plans to do this more.

As an operations team, you can make a conscious decision to hold back from software updates, whether by signing up to a “stable” release channel or turning off automatic patches and sticking with n-1 or n-2 version releases (a strategy we discuss in this blog post). CrowdStrike already offers that to some extent and plans to increase the control it offers customers.

Cloud Native to the Rescue

Those operational practices we just described are nothing new, and they’re certainly not rocket science. You probably follow them already. But in the world of cloud native, there are further, and more fundamental, techniques to protect against the impact of a buggy release.

Declarative Configuration Makes it Easy to Revert

One of the most valuable elements of cloud native technologies like Kubernetes is declarative configuration, where you as the administrator document your desired state of software deployment, and Kubernetes continuously works to ensure that the actual state of the deployment matches that desired state.

When you want to update the software or its configuration, you simply update the declaration of the desired state, and Kubernetes will see to it that the new version is rolled out.

But if I don’t like something about the new version of the software — say, there’s a nasty bug — if you’ve done things right, you can quite easily go back to the previous version by simply reinstating the prior declarative configuration and pushing it over the top.

That’s a pretty good safety net, and it’s not something you can easily do in the “imperative” model of traditional IT: Reversing an upgrade is like trying to unbake a cake.

What About Local Edge Devices?

But the CrowdStrike scenario brings added complexity to the picture because it mostly affected edge computing environments.

We all saw the photos on social media: Windows boxes that no longer boot up. Blue screen of death. Train stations and retail stores ground to a halt. It’s not so easy to push a new configuration to a device that’s bricked.

In the cloud, if I’ve somehow managed to entirely break my “machine” like this, I can simply get a new one with an API call. To a lesser extent, I can do the same thing in the data center, whether it’s by provisioning a new VM or hot-swapping a new server. Modern infrastructure is treated like cattle; each machine is interchangeable.

But edge computing devices, like the ones running Windows behind the airline check-in counter, are not so easily replaced.

And let’s be clear here: the CrowdStrike outage absolutely took down a load of individual users’ laptop and desktop PCs that have a hodgepodge of actual data, applications and configurations on them locally. These are pets, and unless something dramatic changes in the way the world manages personal devices, they will remain pets.

But many of these affected Windows machines are not personal devices: they’re mission-critical edge compute locations acting as the brain for industrial equipment, digital signs, gate check-in apps, etc. They should not be pets. They don’t even really need to be running a desktop OS like Windows, and you absolutely should be able to re-image the device without anyone getting upset about losing their spreadsheets and photos.

So let’s talk about how we can architect to achieve the reversible, declarative upgrade experience within the confines of the edge box itself.

Edge Upgrades Should Be Resilient, Atomic and Failsafe

Here’s an architecture that delivers the failsafe resiliency we’re looking for:

Each edge device is configured with two partitions (A and B), one of which is actively running the current version of the software, while the other is passive.

What we have running in a partition is immutable: It doesn’t change. We don’t “patch in place” when we want to upgrade it. It is the realization of a declarative configuration: the desired state of the OS and applications running over the top. We don’t swap bits in and out of it once it is running.

Instead, during a software upgrade, we deploy a whole new instance of the OS and applications into the passive partition, leaving the active partition unchanged. That is, when we express a new desired state (one that has upgraded software and configuration) that goes into the passive partition.

And then we reboot the system, switching the active and passive partitions.

If all goes well, the system simply continues running as is, with the new arrangement of active and passive partitions. The passive partition is now ready to be used for the next software update.

If, however, the new version fails to boot, the device will roll back automatically, making the passive partition active again: It held the last known good state, which remained entirely untouched by the upgrade process.

The application’s persistent data, which is provisioned on a separate partition, is then mounted to the booted system for the application to consume regardless of whether it was booted from the A or B partition.

This A/B system partition scheme provides immutable updates, easy rollback and a last line of defense for system resiliency.

What if both the A and B partitions failed to boot? Well, it will then boot into a recovery partition, which can still maintain a network connection to enable remote manual recovery. This saves frazzled field engineers from making time-consuming and expensive visits to every affected site.

Note that this is exactly the architecture we’ve built in our Palette Edge management platform.

Bugs Are Inevitable. Headlines Are Avoidable.

If CrowdStrike, Windows and the enterprise IT landscape had followed the practices we describe above — particularly the use of immutable atomic A/B upgrades — the impact of last week’s ill-fated software update would have been far, far less. In fact, I doubt whether many of us would even have heard about it.

Even without staged canary rollouts or stricter testing, a catastrophic outage would have been avoided. All those corporate edge devices running ticket kiosks, digital signs and smart building controllers would have updated, failed to reboot and flipped back to their previous unchanged OS instance, resulting in only the briefest of outages, and giving CrowdStrike’s engineers an opportunity to fix the code and try again.

In today’s software world, with its unimaginable variety and rapid code changes, the truth is that all of us are experimenting in production.

How we do it will determine whether we make headlines or not.

This blog originally appeared on The New Stack

Tags:
Cloud
Edge Computing
Best Practices
Security
Subscribe to our newsletter
By signing up, you agree with our Terms of Service and our Privacy Policy