What caused the Microsoft IT outage and what might it mean for payroll?

A critical bug in a Crowdstrike update has grounded flights, stopped emergency services and hit payroll providers, putting employees at risk of receiving their salaries late

BSOD
An example of Microsoft’s so-called blue screen of death

Companies around the world are reeling from what is being described as one of the largest IT outages ever. 

The failure has left many workers struggling to access their computers, as well as grounding flights and knocking banks, hospitals, card payments providers and emergency services offline, all around the world. 

On Downdetector, which reports digital service problems at major companies, many household names, including Natwest, Lloyds, Morrisons, Amazon, Nationwide, Three and Halifax, were experiencing problems 

The Global Payroll Association has also confirmed that the IT outage has impacted a number of its clients and affected businesses’ ability to pay their employees.

This could present a particular problem for HR teams as the end of the month approaches, when salaries are usually paid.

What caused the Microsoft IT outage?

The culprit has been identified as an auto-update from cybersecurity vendor, Crowdstrike, which sells threat detection software to enterprise customers across a wide array of sectors, including retail, banking and government.

Speaking with NBC’s Today Show, Crowdstrike CEO George Kurtz ruled out a cyber attack, admitting there had been a mistake in a software update. Kurtz apologised for the failure but conceded it “could be some time” before normality is restored.

The update appears to have corrupted Microsoft Windows system files and locked machines into a “bootloop” cycle of errors, where machines start up only to display the so-called blue screen of death – the infamous error message that is the scourge of system administrators everywhere.

Sysadmins took to social media such as Hacker News and Reddit to discuss the outage as it happened – a real-time piece of digital lore in the making – where they drew comparisons to what the Y2K panic at the turn of the millennium might have looked like it had it actually come to pass.

Commenters speculated that the only way to fix the problem, at least on end-devices such as laptops, will be to manually boot every single device and remove the corrupted file in question – a mammoth task that could take days, weeks or even longer to complete, given the size of IT estates. 

The current, laborious workaround for restoring systems involves booting Windows machines in ‘safe mode’, the pared-back administrator version of Windows, and deleting the offending C-0000029*.sys file from the Crowdstrike systems folder.

Theoretically, the attack could have happened to any vendor rolling out unsafe updates in a live environment, said Ilkka Turunen, field CTO at Sonatype, because infosecurity platforms need to interface with core systems files.

Businesses lose access to payroll systems

Melanie Pizzey, CEO and founder of the Global Payroll Association, warns that the outage could have “serious implications”, depending on how long it lasts. This is particularly worrying for businesses that pay staff on a weekly basis and could cause backlogs and delays for any workers receiving their monthly wage. “At best, it will require those managing payroll to work overtime to rectify the issue,” she says. 

Pizzey advises that impacted HR and payroll professionals consult their contingency plans. For any businesses that issue salaries via Bacs payments, they should confirm the submissions deadline. “This will give them a clear timeframe to work to when rectifying any issues,” she adds. “They also need to communicate any issues to key stakeholders so that they are both aware of the problem and the process in place to solve it and consider faster payment options if needed.”

Although IRIS Software Group confirmed that its payroll products are “fully operational and unaffected”, it warned that some support services may experience delays as a “limited number” of its employees had been impacted by the outage.

Its senior product director of payroll Fran Williams says the incident “highlights the importance of regularly reviewing your technology infrastructure and business continuity plans” and advises businesses to get regular updates from their IT teams and payroll providers.

Keeping employees informed about any payroll processing issues or delays and using secure back-ups to store any manually processed data is also important, he adds.

In the future, it will be good practice for businesses to develop payroll contingency plans, adds to Malc Coton, head of sales for HR, payroll and finance consultancy Phase 3. He says: “It’s impossible to legislate for all eventualities and while today’s events have been rectified quickly, it has put business continuity plans at the top of everyone’s agendas again.” 

“There’s always a people problem to mitigate as well as software,” Coton adds.

The one positive for HR teams is that the cause of the outage is not thought to have been a cyber attack. This means that sensitive employee details held within payroll should not have been compromised. However, Pizzey adds that some companies are taking additional precautions to minimise such threats.

Outage highlights supply-chain risks

Friday’s events highlight the interconnected nature of business and the fragility of many of the IT systems the global economy relies on, according to Turunen. “The outage shows that one popular vendor botching an update can have a huge impact on its customers,” he says.

Co-founder of the Government Digital Service, Dafydd Vaughan, warns that the outage also shines a spotlight on supply-chain risk. “Companies and national governments need to be prepared and take mitigating actions to minimise exposure,” says Vaughan, who is also CTO at transformation consultancy Public Digital. 

“Today’s crisis could have been avoided by companies rolling out computer updates on a few machines first to check they work, rather than sending them to all machines at the same time.” 

He adds that governments should consider the risk of so few companies controlling so much essential infrastructure.

To avoid outages such as this in future, being able to control rollouts through testing early is key, says Jeff Watkins, chief product and technology officer at digital consultancy CreateFuture. Organisations should have the ability to control test environments and keep them segregated from the rest of the corporate network, in case of malicious or defective updates.

The outage has also had a significant impact on Crowdstrike. The company was recently valued at $80bn (£62bn) – however, the failure has caused shares to plummet, dropping by 27.8% at time of writing.