Recovery Steps for EC2 Instances Affected by the CrowdStrike Global Outage

Recently, a global outage has impacted CrowdStrike and Microsoft services. If you’re using AWS cloud services, this recovery procedure will help you address issues with EC2 instances affected by this outage.

The outage has specifically affected Windows EC2 instances, and we know it’s critical to restore your systems quickly. Fortunately, there are two main temporary recovery solutions available to get your instances back online.

Scope

This blog covers recovery steps for EC2 instances affected by the CrowdStrike global outage. It is for AWS users and IT professionals who want to restore their services quickly.

Purpose

The purpose of this tutorial is to help developers and system administrators:

Understand the CrowdStrike outage and how to recover their EC2 instances.
To reduce downtime and improve response to future outages.

By the end of this guide, you’ll be able to recover your EC2 instances and mitigate the impact of the CrowdStrike and Microsoft global outage.

Two Potential Solutions to Fix ASAP

Follow these steps to solved your issue:

Solution 1: Relaunch from a Snapshot or Image

Relaunch from Snapshot or Image:
- If you have snapshots or AMIs (Amazon Machine Images) taken before the issue began at 9:30 PM PDT, you can relaunch your EC2 instance using these. This method is straightforward if you have recent backups.
Verify Update Status:
- Confirm that the problematic update causing the CrowdStrike agent issue is no longer being automatically applied to your instances. This will prevent the issue from reoccurring.

Solution 2: Manual Recovery Steps

Create a Snapshot:
- Take a snapshot of the EBS (Elastic Block Store) root volume of the affected EC2 instance. This ensures you have a backup of your current state.
Create a New EBS Volume:
- Create a new EBS volume from the snapshot you took in the previous step. Ensure this new volume is in the same availability zone as your affected instance.
Launch a New Windows Instance:
- Launch a new Windows instance in the same availability zone. Make sure to use a similar version of Windows as your original instance.
Attach the EBS Volume:
- Attach the EBS volume (created from the snapshot) to the new Windows instance as a data volume.
Delete the Problematic File:
- On the new Windows instance, navigate to the directory \windows\system32\drivers\CrowdStrike\ on the attached volume.
- Delete the file C00000291*.sys that is causing the issue.
Detach the EBS Volume:
- Once the file is deleted, detach the EBS volume from the new Windows instance.
Create a Snapshot of the Clean Volume:
- Create a snapshot of the EBS volume now free from the problematic file.
Replace the Root Volume:
- Replace the root volume of the original EC2 instance with the new snapshot you created.
Start the Original Instance:
- Start your original EC2 instance. It should now be free from the issue caused by the global CrowdStrike outage.

How to avoid this issue again?

Although there are many solutions to address this issue, to me the best approach is to not install any security agents like e.g. CrowdStrike Falcon directly on your servers, as they can cause crashes, especially on Windows machines. Instead, consider dockerizing your applications and hosting them on cloud services like AWS ECS Fargate.

Nowadays, many users have already adopted cloud services. AWS provides a wide range of built-in security features and tools, such as AWS Identity and Access Management (IAM), AWS Shield, AWS WAF (Web Application Firewall), AWS Key Management Service (KMS), and more. These services allow users to secure their applications and data effectively without necessarily relying on third-party cybersecurity platforms that may disrupt your business. With AWS, you can host or back up your system in multiple regions to ensure disaster recovery.

Recovery Steps for EC2 Instances Affected by the CrowdStrike Global Outage — https://twitter.com/elonmusk/status/1814253189333479522

As Elon Musk tweeted on X, “The antivirus was the virus.” Do you agree? This post gained 2.6k ❤️, meaning most people agreed with him. I also agree with Elon. The engineers behind these security platforms are experts in their field and know how their systems operate because they have studied it extensively. However, they still encounter significant mistakes that they didn’t notice.

By isolating security agents in separate containers, you can prevent system issues and ensure smoother operations.

Additional Resources

For more information and updates, you can refer to the AWS Health Dashboard: AWS Health Status.

By following these steps, you should be able to recover your EC2 instances and mitigate the impact of the CrowdStrike and Microsoft global outage. If you need further assistance getting your servers back online, please contact us.

Stay safe and keep your systems secure!