CrowdStrike Update Sparks Global IT Outage: What Went Wrong?

CrowdStrike Update Sparks Global IT Outage: What Went Wrong?Global IT Outage Blog - Ebryx

CrowdStrike Update Sparks Global IT Outage: What Went Wrong?

www.ebryx.com

CrowdStrike Outage: A Technical Analysis and Prevention Guide

We’ll be diving deep into the recent CrowdStrike outage that sent shockwaves across the global IT landscape. Our aim is to shed light on the events that unfolded, exploring what went wrong, why it happened and most importantly, how organizations can safeguard themselves against similar disruptions in the future.

Understanding the root cause of such an outage is critical for IT professionals and cybersecurity experts alike, as it offers valuable lessons that can help prevent potential threats and ensure business continuity. So, let’s dissect the CrowdStrike saga and uncover the key takeaways to keep your organization secure and functioning.

What You Need to Know About CrowdStrike

CrowdStrike is a globally recognized leader in cybersecurity, renowned for its expertise in defending against malware and ransomware attacks. Their cutting-edge tools empower organizations to protect themselves from both known and emerging threats, making CrowdStrike a trusted name in the industry.

At the heart of their offerings is the Falcon Endpoint Detection & Response (EDR) software—a flagship product that plays a pivotal role in safeguarding businesses against malicious attacks. However, in the recent global IT outage, Falcon found itself at the center of controversy, as a critical update led to widespread disruptions.

Understanding Falcon EDR and Why It Matters

As cyber threats have evolved, traditional antivirus software has proven insufficient in defending against modern attacks. To counter this, the cybersecurity industry developed Endpoint Detection & Response (EDR) solutions, which collect extensive data from users’ systems (endpoints) and provide a centralized interface for security professionals to analyze this data and respond to threats in real time.

CrowdStrike’s Falcon is a leading EDR solution. It gathers comprehensive telemetry data from endpoints, including network traffic, process activity, file system events, and operating system activities. This data is then made accessible through a web-based interface, allowing cybersecurity teams to quickly identify and remediate potential threats.

At the heart of Falcon's functionality is the Falcon Sensor, a lightweight software agent installed on endpoints. This sensor continuously collects telemetry data and transmits it in real time to CrowdStrike’s servers, enabling swift analysis and response to any detected anomalies or attacks.

How Falcon Sensor Gathers and Transmits Critical Telemetry Data

To effectively protect endpoints, Falcon Sensor requires access to critical system events. However, this level of access isn’t always granted by the Windows operating system to third-party security vendors. To overcome this challenge, CrowdStrike, like many cybersecurity companies, employs a dual-layer approach to data collection.

Falcon Sensor operates through two sets of information-gathering services: one at the user level and the other at the kernel level. The user-level service functions like any other software on the system, while the kernel-level service operates as a device driver, similar to how the Windows operating system itself runs. This dual approach allows Falcon Sensor to capture a comprehensive range of data, providing visibility into all levels of the operating system.

The user-level service is a straightforward executable, but the kernel-level service is more complex, functioning as a device driver named csagent.sys. Unfortunately, it was an error in this device driver that triggered the recent global IT outage, highlighting the risks associated with such deep-level access.

Falcon Sensor Kernel Driver: Behind the Scenes

On Windows, any driver that runs within the kernel must be digitally signed by Microsoft. This signature process involves rigorous testing to ensure the driver's safety and reliability. However, the process of obtaining a digital signature can take anywhere from a few days to several weeks. In the fast-paced world of cybersecurity, where threats evolve rapidly, waiting for a new signature with every update isn’t always practical.

To address this challenge, CrowdStrike implemented a unique solution for their Falcon Sensor kernel driver. Instead of modifying the driver itself frequently, they designed it to act as an interpreter for proprietary Channel Files. These files contain specific instructions for the driver, guiding it on what data to collect and when to trigger events.

These Channel Files are stored in the %WINDIR%\System32\drivers\Crowd
Strike directory and can be identified by their unique naming convention. Each file starts with a "C-" prefix, followed by a number that uniquely identifies the channel file group, and ends with the .sys extension.

Global IT Outage Blog - Ebryx

This approach allows CrowdStrike to update and adapt the Falcon Sensor’s behavior without needing to go through the lengthy driver signing process each time, ensuring that the software remains agile and responsive to new threats.

Potential Risks with Falcon Sensor’s Architecture

While CrowdStrike’s architecture for the Falcon Sensor driver effectively meets the company’s needs, it introduces potential security risks. The design, which relies on interpreting Channel Files rather than frequent driver updates, creates a gap in the standard security measures enforced by Microsoft for device drivers.

This approach places the full burden of ensuring the driver’s stability and security on CrowdStrike’s Quality Assurance (QA) team. However, it's important to note that this team was reported to have been reduced in size in 2023 as part of cost-cutting measures. With fewer resources dedicated to QA, the potential for undetected issues or vulnerabilities in the driver increases, raising concerns about the overall security of the endpoint protection system.

Inside the Falcon Sensor Outage: Causes and Impact

The Falcon Sensor's architecture required meticulous testing by CrowdStrike to prevent endpoint crashes. However, on July 19, 2024, at 04:09 UTC, the company pushed out Channel File 291. Despite passing initial validation checks, this file contained invalid data that triggered widespread issues.

The deployment of Channel File 291 led to the infamous Blue Screen of Death (BSoD) appearing across various critical systems, including airports, hospitals, and Windows Servers.

Global IT Outage Blog - Ebryx

Channel File 291 reportedly included instructions related to named pipes, a Windows Inter-Process Communication (IPC) mechanism often exploited by threat actors for Command-and-Control (C2) activities. Due to inadequate validation, this problematic file was distributed to endpoints, causing the Windows Kernel to crash. The driver executed during system boot, resulting in crashes occurring even before user login.

Global IT Outage Blog - Ebryx

Fixing the Crash: Removing Channel File 291 Using Safe Mode

To address the crash caused by Channel File 291, the immediate solution was to remove the problematic file from the system. However, given that the system was failing to boot properly, traditional methods were not an option. The workaround in this scenario was to use Safe Mode.

Safe Mode is a diagnostic startup mode in Windows that loads only essential drivers, allowing users to troubleshoot and resolve issues that prevent normal booting. Here’s how to use Safe Mode to remove the problematic Channel File 291:
  1. Power on your PC and immediately press and hold the F8 key on your keyboard. This should be done before the Windows logo appears.
  2. On the Advanced Boot Options screen, use the arrow keys to select Safe Mode with Command Prompt and press Enter.
  3. Once the system boots into Safe Mode and the Command Prompt window appears, type the following command and press Enter: -
    del /q %WINDIR%\System32\drivers\
    CrowdStrike\C-00000291*.sys
  4. Restart your computer, and it should boot up normally.
CrowdStrike has since addressed this issue with an update. However, you’ll need to boot your system using the above workaround to remove the problematic file before you can download and install the fix.

Accountability for the Falcon Sensor Incident: Who Is at Fault?

The recent Falcon Sensor outage has sparked a wave of criticism from various quarters, with different parties facing blame for the incident. Non-technical observers are pointing fingers at Microsoft, while technical experts are directing their criticism towards CrowdStrike and Microsoft's handling of driver validation.

CrowdStrike is under scrutiny for not ensuring more rigorous input validation in their driver, which should have prevented crashes from invalid data. Additionally, their Content Validator has been criticized for failing to catch the problematic file during testing.

On the other hand, Microsoft is facing criticism for allowing such drivers to operate within their kernel. The company’s current API offerings may not fully cater to the unique needs of security vendors, leading to suggestions that Microsoft should provide more tailored solutions and consider moving such drivers out of the kernel.

Microsoft is currently exploring potential improvements in collaboration with major security vendors to address these concerns and enhance the security and stability of their systems.

Lessons Learned from the Global IT Outage

The recent global IT outage has provided several crucial lessons for both IT professionals and organizations. Here’s what we can take away from this incident:

1. Even Major Security Vendors Can Make Mistakes

The global IT outage serves as a stark reminder that even leading security vendors are not infallible. Despite their advanced tools and rigorous testing processes, errors can still occur. This incident highlights the importance of not assuming that any single security solution is flawless. It’s essential to continuously evaluate and update your security measures, even when using products from top-tier vendors.

2. The Importance of Backup Services

One of the key takeaways from this outage is to always have a strong contingency plan. The widespread disruption caused by the outage underscores the necessity of having reliable backup and recovery solutions in place. These services can help mitigate the impact of such incidents, ensuring that your organization can quickly recover and maintain business continuity.

3. Be Cautious with Your Security Partnerships

Trusting your system’s security to any single vendor requires careful consideration. While it’s important to rely on experts, it’s equally crucial to be aware of their potential limitations and vulnerabilities. Regularly review and assess your security partners to ensure they meet your organization’s needs and maintain the highest standards of protection.

4. Don’t Allow Security Vendors to Have Complete Control

Finally, it’s important to maintain a balanced approach to security management. Relying solely on a single security vendor can create risks, especially if that vendor experiences issues or errors. It’s wise to implement a multi-layered security strategy that includes diverse tools and practices. This approach can provide better protection and reduce the risk associated with relying on a single solution.

Ensuring Resilience: Ebryx’s Role in IT Disruption Management

At Ebryx, we understand the critical impact that IT disruptions, such as the recent global IT outage, can have on organizations. Our comprehensive suite of cybersecurity solutions is designed to provide robust protection and quick recovery in the face of such challenges. With our expertise in advanced threat detection, endpoint management and data recovery, we ensure that your systems remain secure and resilient. In the event of an outage or security incident, our dedicated team is ready to assist with immediate response and remediation, helping you minimize downtime and restore normal operations swiftly.
Share the article with your friends
Related Posts
Organized ATM Jackpotting
Blog
Ebryx forensic analysts identified an organized criminal group in the South-Asian region. The group utilized an ATM malware to dispense cash directly from the ATM tray.
May 22, 2023
3 Min Read
Cyberattacks on the Rise: 2022 Mid-Year Rport
Blog
Cyber attacks are on the rise in 2022. Despite increased cybersecurity awareness, businesses have not been able to defend themselves from the rapidly changing threat landscape. Compared with the same
May 22, 2023
3 Min Read
How To Land Your First Cybersecurity Job: 5 Tips
Blog
Cybersecurity jobs are growing at a staggering rate and have shown no signs of stopping. According to the New York Times, an estimated 3.5 million cybersecurity positions remain unfilled globally.
May 22, 2023
3 Min Read
Steer Clear of Threats and Mitigate Vulnerabilities with our Zero Trust Solutions
Zero Trust Architecture Assessment
Implement
Universal ZTNA Solution
Adopt Zero Trust with Confidence
Start Your Zero Trust Journey
Contact us