Inside the 78 minutes that took down millions of Windows machines

On Friday morning, shortly after midnight in New York, disaster started to unfold around the world. In Australia, shoppers were met with Blue Screen of Death (BSOD) messages at self-checkout aisles. In the UK, Sky News had to suspend its broadcast after servers and PCs started crashing. In Hong Kong and India, airport check-in desks began to fail. By the time morning rolled around in New York, millions of Windows computers had crashed, and a global tech disaster was underway.

In the early hours of the outage, there was confusion over what was going on. How were so many Windows machines suddenly showing a blue crash screen? “Something super weird happening right now,” Australian cybersecurity expert Troy Hunt wrote in a post on X. On Reddit, IT admins raised the alarm in a thread titled “BSOD error in latest CrowdStrike update” that has since racked up more than 20,000 replies.

The problems led to major airlines in the US grounding their fleets and workers in Europe across banks, hospitals, and other major institutions unable to log in to their systems. And it quickly became apparent that it was all due to one small file.

At 12:09AM ET on July 19th, cybersecurity company CrowdStrike released a faulty update to the Falcon security software it sells to help companies prevent malware, ransomware, and any other cyber threats from taking down their machines. It’s widely used by businesses for important Windows systems, which is why the impact of the bad update was so immediate and felt so broadly.

CrowdStrike’s update was supposed to be like any other silent update, automatically providing the very latest protections for its customers in a tiny file (just 40KB) that’s distributed over the web. CrowdStrike issues these regularly without incident, and they’re fairly common for security software. But this one was different. It exposed a massive flaw in the company’s cybersecurity product, a catastrophe that was only ever one bad update away — and one that could have been easily avoided.

How did this happen?

CrowdStrike’s Falcon protection software operates in Windows at the kernel level, the core part of an operating system that has unrestricted access to system memory and hardware. Most other apps run at user mode level and don’t need or get special access to the kernel. CrowdStrike’s Falcon software uses a special driver that allows it to run at a lower level than most apps so it can detect threats across a Windows system.

Running at the kernel makes CrowdStrike’s software far more capable as a line of defense — but also far more capable of causing problems. “That can be very problematic, because when an update comes along that isn’t formatted in the correct way or has some malformations in it, the driver can ingest that and blindly trust that data,” Patrick Wardle, CEO of DoubleYou and founder of the Objective-See Foundation, tells The Verge.

Kernel access makes it possible for the driver to create a memory corruption problem, which is what happened on Friday morning. “Where the crash was occurring was at an instruction where it was trying to access some memory that wasn’t valid,” Wardle says. “If you’re running in the kernel and you try to access invalid memory, it’s going to cause a fault and that’s going to cause the system to crash.”

CrowdStrike spotted the issues quickly, but the damage was already done. The company issued a fix 78 minutes after the original update went out. IT admins tried rebooting machines over and over and managed to get some back online if the network grabbed the update before CrowdStrike’s driver killed the server or PC, but for many support workers, the fix has involved manually visiting the affected machines and deleting CrowdStrike’s faulty content update.

While investigations into the CrowdStrike incident continue, the leading theory is that there was likely a bug in the driver that had been lying dormant for some time. It might not have been validating the data it was reading from the content update files properly, but that was never an issue until Friday’s problematic content update.

“The driver should probably be updated to do additional error checking, to make sure that even if a problematic configuration got pushed out in the future, the driver would have defenses to check and detect… versus blindly acting and crashing,” says Wardle. “I’d be surprised if we don’t see a new version of the driver eventually that has additional sanity checks and error checks.”

CrowdStrike should have caught this issue sooner. It’s a fairly standard practice to roll out updates gradually, letting developers test for any major problems before an update hits their entire user base. If CrowdStrike had properly tested its content updates with a small group of users, then Friday would have been a wake-up call to fix an underlying driver problem rather than a tech disaster that spanned the globe.

Microsoft didn’t cause Friday’s disaster, but the way Windows operates allowed the entire OS to fall over. The widespread Blue Screen of Death messages are so synonymous with Windows errors from the ’90s onward that many headlines initially read “Microsoft outage” before it was clear CrowdStrike was at fault. Now, there are the inevitable questions over how to prevent another CrowdStrike situation in the future — and that answer can only come from Microsoft.

What can be done to prevent this?

Despite not being directly involved, Microsoft still controls the Windows experience, and there is plenty of room for improvement in how Windows handles issues like this.

At the simplest, Windows could disable buggy drivers. If Windows determines that a driver is crashing the system at boot and forcing it into a recovery mode, Microsoft could build in more intelligent logic that allows a system to boot without the faulty driver after multiple boot failures.

But the bigger change would be to lock down Windows kernel access to prevent third-party drivers from crashing an entire PC. Ironically, Microsoft tried to do exactly this with Windows Vista but was met with resistance from cybersecurity vendors and EU regulators.

Microsoft tried to implement a feature known at the time as PatchGuard in Windows Vista in 2006, restricting third parties from accessing the kernel. McAfee and Symantec, the big two antivirus companies at the time, opposed Microsoft’s changes, and Symantec even complained to the European Commission. Microsoft eventually backed down, allowing security vendors access to the kernel once again for security monitoring purposes.

Apple eventually took that same step, locking down its macOS operating system in 2020 so that developers could no longer get access to the kernel. “It was definitely the right decision by Apple to deprecate third-party kernel extensions,” says Wardle. “But the road to actually accomplishing that has been fraught with issues.” Apple has had some kernel bugs where security tools running in user mode could still trigger a crash (kernel panic), and Wardle says Apple “has also introduced some privilege execution vulnerabilities, and there are still some other bugs that could allow security tools on Mac to be unloaded by malware.”

Regulatory pressures may still be stopping Microsoft from taking action here. The Wall Street Journal reported over the weekend that “a Microsoft spokesman said it cannot legally wall off its operating system in the same way Apple does because of an understanding it reached with the European Commission following a complaint.” The Journal paraphrases the anonymous spokesperson and also mentions a 2009 agreement to provide security vendors the same level of access to Windows as Microsoft.

Microsoft reached an interoperability agreement with the European Commission in 2009 that was a “public undertaking” to allow developers to get access to technical documentation for building apps on top of Windows. The agreement was formed as part of a deal that included implementing a browser choice screen in Windows and offering special versions of Windows without Internet Explorer bundled into the OS.

The deal to force Microsoft to offer browser choices ended five years later in 2014, and Microsoft also stopped producing its special versions of Windows for Europe. Microsoft now bundles its Edge browser in Windows 11, unchallenged by European regulators.

It’s not clear how long this interoperability agreement was in place, but the European Commission doesn’t seem to believe it’s holding back Microsoft from overhauling Windows security. “Microsoft is free to decide on its business model and to adapt its security infrastructure to respond to threats provided this is done in line with EU competition law,” European Commission spokesperson Lea Zuber says in a statement to The Verge. “Microsoft has never raised any concerns about security with the Commission, either before the recent incident or since.”

The Windows lockdown backlash

Microsoft could attempt to go down the same route as Apple, but the pushback from security vendors like CrowdStrike will be strong. Unlike Apple, Microsoft also competes with CrowdStrike and other security vendors that have made a business out of protecting Windows. Microsoft has its own Defender for Endpoint paid service, which provides similar protections to Windows machines.

CrowdStrike CEO George Kurtz also regularly criticizes Microsoft and its security record and boasts of winning customers away from Microsoft’s own security software. Microsoft has had a series of security mishaps in recent years, so it’s easy and effective for competitors to use these to sell alternatives.

Every time Microsoft tries to lock down Windows in the name of security, it also faces backlash. A special mode in Windows 10 that limited machines to Windows Store apps to avoid malware was confusing and unpopular. Microsoft also left millions of PCs behind with the launch of Windows 11 and its hardware requirements that were designed to improve the security of Windows PCs.

Cloudflare CEO Matthew Prince is already warning about the effects of Microsoft locking down Windows further, framed in a way that Microsoft will favor its own security products if such a scenario were to occur. All of this pushback means Microsoft has a tricky path to tread here if it wants to avoid Windows being at the center of a CrowdStrike-like incident again.

Microsoft is stuck in the middle, with pressure from both sides. But at a time when Microsoft is overhauling security, there has to be some room for security vendors and Microsoft to agree on a better system that will avoid a world of blue screen outages again.

Source link