On July 19, millions of Windows users encountered the dreaded “blue screen of death.” A bug in a critical piece of cybersecurity software, called CrowdStrike, was causing the operating system to crash. For some people and companies, the issue is ongoing, and costs are projected to be in the billions.

There’s little we can do to protect against bugs in the software we’re using, says Zakir Durumeric, who is an assistant professor of computer science.  “In general though, one of the best things that people can do to protect themselves against attacks is to regularly update their computers and phones.” He shares his insights on the outage.

1. In simple terms, what happened?

The outage that started July 19 was caused by a malformed update that was sent to a piece of security software called “CrowdStrike Falcon.” While CrowdStrike may not be a household name, it is a major enterprise security company that builds what we call Endpoint Detection and Response (EDR) software. EDR is the enterprise successor to antivirus – it’s software that continuously runs on every workstation within a company and monitors for abnormal behavior that might indicate the computer has been infected (e.g., with ransomware). EDR is ubiquitous and is thought by many folks in the security industry to be one of the best tools for protecting users’ computers against attacks.

The update that was sent to CrowdStrike software on Friday was malformed, which caused the software to crash every time it started and tried to parse the update. Now, usually, when an application like Google Chrome or Microsoft Word crashes, only that one application crashes. However, a lot of security software – including CrowdStrike Falcon – is special in this regard. Because CrowdStrike needs to detect malicious activity on the whole computer, it runs as part of the Windows operating system instead of on top of it. Unfortunately, this also meant that when it crashed, it caused the Windows OS to also crash.  

Zakir Durumeric | Courtesy of Zakir Durumeric

2. Why was the impact so significant – and why is it taking so long to resolve?

The fix to get CrowdStrike and Windows running again is simple – one just needs to delete the malformed file that was shipped as part of the update. Unfortunately, however, because the Windows operating system crashes every time it boots, this cannot be done remotely or in any automated fashion. Instead, IT staff need to manually boot Windows machines into a troubleshooting “Safe Mode” to delete the problematic update. Further complicating fixes, when computers use BitLocker Full Disk Encryption, which is strongly recommended, IT staff additionally need the associated BitLocker recovery keys to apply the fix, which some organizations are realizing they don’t have recorded or accessible. 

3. What happened with air travel?

Many organizations use CrowdStrike EDR software to protect their Windows workstations and servers, including airlines. As a result, the computers for some airlines, most notably Delta, no longer booted starting Friday. Delta has noted that upwards of half of their systems run Windows and that their crew scheduling system, in particular, was heavily impacted. We don’t yet know why it’s taken Delta longer than other organizations to get these systems back online; the U.S. Department of Transportation has opened an investigation into Delta over the issue. 

4. Are there any lessons we can learn from the outage?

This incident serves as a stark reminder of just how reliant we have become on incredibly complex software systems and the large number of dependencies that each system has. While we’re getting better at software development as a field, we are still a long way from being able to guarantee that complex systems won’t have bugs like this. Critical Infrastructure providers need to be thinking about how they’re architecting their systems to be resilient against system failures and how they’re going to recover when a system does fail, because this undoubtedly won’t be the last time we see a bug like this.