Crowdstrike outage post-mortem: lessons for IT leaders

As the world recovers from the worst cyber incident in history, what insights can be drawn from the Crowdstrike IT outage and how can IT leaders ensure it doesn’t happen again

240801 Crowdstrike Outage

Thousands of flights grounded, payments systems frozen, digital services knocked offline and 8.5 million devices locked in endless loops of the dreaded Blue Screen Of Death (BSOD).

These were just some of the disastrous consequences of the largest IT outage of all time, estimated by one insurer to have cost Fortune 500 companies $5.4bn (£4.2bn), and all due to a single buggy update rolled out by security vendor Crowdstrike on 19 July.

Unlike major ransomware or supply-side attacks, there was no malicious culprit. It was all caused by an unfortunate mistake at a previously well-regarded security company but it had huge repercussions.

So, what exactly went wrong and how can businesses avoid disastrous events such as this in the future?

What caused the Crowdstrike IT outage?

In its preliminary post-incident review, Crowdstrike divided its report in two. It highlighted the difference between ‘Sensor Content’ – the code written to “deliver longer-term, reusable capabilities” for threat detection engineers – and ‘Rapid Response Content’, which is delivered via the cloud and is designed to detect anomalous patterns on client devices.

The systems failure, Crowdstrike says, was caused by an automatic software deployment linked to the Rapid Response Content side of its Falcon security suite. A bug in Crowdstrike’s automated testing process meant the software was approved even though the data contained in it was erroneous.

Because the affected product sat in the kernel – which is the central core of any operating system – the software had access to memory on the client device, along with pretty much every other critical file.

Just because something’s passed an internal validator, doesn’t mean it’s correct

So, when the instance was issued to client devices, the update caused a memory fault, meaning systems were locked into a never-ending loop of BSODs. Worse, many devices had to be rebooted manually to remove the corrupted file. The update was also issued on a Friday, ruining the weekends of IT staff everywhere as they raced to undo the damage.

“Crowdstrike sends these rapid updates out in order to configure the product to see some latest threat they’ve detected,” explains Alan Woodward, cybersecurity professor at the University of Surrey. “When they send those out, they’re supposed to validate they’re in the right format.”

Although Crowdstrike did do this, its validator didn’t catch the problem. “That’s a big lesson,” Woodward says. “Just because something’s passed an internal validator, doesn’t mean it’s correct.”

Additionally, because the update was issued by a trusted vendor, it had been digitally signed. For all intents and purposes, this meant the update was trusted by Microsoft, and it therefore passed the software giant’s only check. The update was “perfectly valid” according to Microsoft’s systems, Woodward says.

Crowdstrike IT outage: a split in testing processes?

Although Crowdstrike was quick to publish its preliminary report, there are too few details in the document, according to Conor O’Neill, CEO of penetration-testing firm OnSecurity, who describes it as “a bit of a nothing sandwich”. 

“It didn’t tell us anything that, on the face of it, wasn’t known already,” he adds.

IT teams may have to wait a while for the full root-cause analysis of the incident to be published. 

However, buried in the linguistic choices of the report are some intriguing possibilities, O’Neill suggests, and it’s curious that Crowdstrike pointed to the difference between its Sensor Content and its Rapid Response Content deployment processes.

The preliminary report was a bit of a nothing sandwich

The Sensor Content, which did not cause the outage, includes automated testing, prior to and after merging to the code base, along with unit testing, integration testing, performance testing and stress testing.

Then, there’s a staged rollout process that starts with internal use, followed by early adopters, before being made available to customers. Customers can select their comfort levels with these updates, whether that’s installing the latest release or one version older.

This is all “really thorough and robust,” says O’Neill. “It’s pretty much a best-in-class type of approach.”

But testing appears to have been “far less stringent” for the offending Rapid Response side of the equation, O’Neill says.

Crowdstrike merely said that updates here are “stress tested across many aspects, such as resource utilisation, system performance impact and event volume”.

“This release process doesn’t sound anywhere near as definitive,” O’Neill says. “If anything it sounds hand-wave-y.”

Although he admits he can’t be certain, O’Neill says that describing the sensor content as “newly released” suggests it may have been rolled out prior to any robust testing.

Crowdstrike says it will employ a several new tests on its Rapid Response Content moving forward and that it will add more validation checks to the automatic content validator for Rapid Response Content. A new check, it says, is in process to “guard against this type of problematic content from being deployed in the future”.

In hindsight, O’Neill imagines employees at Crowdstrike now realise its prior approach was a disaster waiting to happen. “I think this is further confirmed by the remediation steps which you could summarise as: ‘copy that great process we have in place over to Rapid Response releases’,” he says.

How not to botch an update

In historic threads on Reddit and Hacker News, thousands of systems administrators were figuratively tearing their hair out in real-time as the outage occurred.

Many commenting in the initial posts couldn’t believe that basic processes – such as deploying software on any time but the cusp of the weekend, and staging rollouts rather than forcing them automatically on all customers at once – did not seem to have been followed.

Unfortunately for the unlucky recipients of the Crowdstrike bug there’s not a great deal that could have been done to mitigate the impact of the release. That’s because it was issued to all devices simultaneously via the cloud and had immediate access to vital systems. By the time it had corrupted the devices’ systems memory, it was too late and the painstaking, manual remediation process would have to begin.

But the dramatic incident should be a wake-up call to the unsettling vulnerability of digital supply chains. IT infrastructure is so complex today that organisations must “expect the unexpected”, advises Ryan Thornley, security lead at Google Cloud partner, Appsbroker CTS.

Organisations can protect themselves by creating robust disaster-recovery processes that help businesses improve their resilience, whether in the face of dodgy updates or in supply-side attacks. Thornley adds: “Using third-party tooling should factor into disaster recovery scenarios to ensure you cover home-grown issues as well as those that third parties may suffer.”

If you’re sending out something that could have a profound impact on the receiving system, then you need to be double-sure

The Crowdstrike incident also underlines the importance of exercising caution when issuing updates. But there is a balancing act to be struck. Given the rapidly evolving cybersecurity landscape, no organisation should dally either, whether on the issuing or receiving end of an update, according to Thornley.

“Being on the bleeding edge can have unintended consequences,” he says. “This will vary from business to business, but the main thing is to avoid taking a one-size-fits-all approach to your infrastructure.”

Instead, take a proactive approach to updates but refrain from applying updates on mission critical systems as soon as they’re released. While it might sound obvious, automatic updates should really be reserved for less important environments that are backed up with a strong set of processes before applying them to the rest of the business, Thornley advises.

As Crowdstrike acknowledges in its report, organisations should thoroughly test before deployment, especially if their software will have the highest possible access to critical system components.

“If you’re sending out something that could have a profound impact on the receiving system, then you need to be double-sure,” Woodward says.

Although Crowdstrike was the culprit of the outage, more attention may also need to be paid to the interaction between cybersecurity vendors and Windows systems.

One pressing question is why cybersecurity businesses are granted access to the most critical core components of Windows. Cybersecurity vendors, Woodward explains, say they need this access in order to detect emerging “root kit” threats – referring to where the core system components are compromised by threat actors.

Yet this is not the case for Linux or MacOS systems, in which cybersecurity vendors usually interact with kernels via an API instead, rather than tinkering directly with the system. “If something goes wrong, you don’t take everything else down with you, which is what happened here,” he adds.

What next?

Crowdstrike’s shares are slowly recovering and financial analysts are confident this will continue. But the fallout is not yet over. Other cybersecurity vendors are fighting over the scraps from the damage to Crowdstrike’s reputation and taking out ads to sweep up disgruntled ex-customers.

American airline Delta wants more than the $10 Uber Eats vouchers offered by Crowdstrike as an apology to its affected clients. The airline estimates the outage cost the company $500m (£389m). Customers may be covered by insurance but whether those payouts sufficiently account for the financial or reputational damage is another question.

Whatever shape the final root-cause analysis takes when it is released, the cybersecurity industry, and business leaders more generally, would do well to draw lessons from the incident.