Did you see the blue screen of death on July 19, 2024? If yes, then you are one of the 8.5 million Microsoft users that were affected by CrowdStrike’s faulty software update.
Over the years, software bugs and glitches have been responsible for banking failures, power shortages, and medical device malfunctions, among other circumstances. Also making this list is the CrowdStrike-Microsoft outage, which occurred on July 19, 2024, and cost firms billions due to destructive influence within the cloud.
From rail transportation systems to aviation, small businesses, banking, emergency operations and hospitals, the outage resulted in substantial financial losses. In fact, the downtime not only led to direct revenue loss but also incurred costs related to system recovery and manual intervention.
Long-term consequences on customer trust and operational efficiency are expected following the massive global outage.
Impact of Digital Fiasco
On July 19, CrowdStrike released a Rapid Response Content configuration update for the Windows sensor to gather telemetry on possible novel threat techniques. These updates are a regular part of the dynamic protection mechanisms of the Falcon platform, but due to a bug, it resulted in a Windows system crash.
Telecom Review was fortunately not affected, along with users of Mac and Linux hosts.
CrowdStrike’s security code runs at the kernel level in Microsoft Windows, forming a core part of the operating system that provides deep access, giving CrowdStrike’s Falcon security software the privilege of better threat detection across a computer system.
However, evidently and unexpectedly, it can crash Windows computers if it faces major issues. All the devices were reportedly “infected” by the faulty update within just a 78-minute-long window before CrowdStrike reverted the update.
When the outage hit, a lot of establishments went ‘blue’ worldwide. The incident underlined the critical interdependence between cybersecurity solutions like CrowdStrike and operating systems like Microsoft Windows.
The CrowdStrike-Microsoft outage is expected to cost Fortune 500 companies a total of USD 5.4 billion in direct financial losses, averaging USD 44 million per company, according to data from Parametrix. Approximately 70% of Fortune 100 companies were impacted, and industries such as airlines, banking, healthcare, and media experienced a tough blow.
Moreover, a research report unveiled that the faulty CrowdStrike Falcon Sensor update and subsequent outage resulted in a loss ratio impact of roughly 3% to 10% on current global cyber, totaling USD 15 billion.
This scale of loss could potentially place this global outage as the single largest insured loss event in the history of the cyber insurance industry’s operations over the past two decades.
While only 1% of CrowdStrike's total customer base of Windows devices was reportedly impacted, the broad economic and societal impacts highlight the extensive use of CrowdStrike within enterprises' critical services.
In the aftermath, Microsoft is collaborating with other cloud providers and stakeholders, including Google Cloud Platform (GCP) and Amazon Web Services (AWS), to create awareness regarding the state of impact.
“This incident demonstrates the interconnected nature of our broad ecosystem—global cloud providers, software platforms, security vendors and other software vendors, and customers. It’s also a reminder of how important it is for all of us across the tech ecosystem to prioritize operating with safe deployment and disaster recovery using the mechanisms that exist,” noted David Weston, Vice President, Enterprise and OS Security, Microsoft.
Lesson Learned… The Hard Way
Regulators and researchers have identifed big tech's cloud service consolidation as the root issue. Notably, Microsoft, a key rival to Amazon, saw its Azure platform's market share reach 25% in Q1 2024, according to Statista.
Along with CrowdStrike, Microsoft also dominates the end-point security market, which ensures cybersecurity for devices like desktops, laptops, and mobile devices. This consolidation led to the connectivity spiral on July 19.
The outage highlighted how fragile our global tech infrastructure is, as well as the testing processes of cybersecurity firms. It underscores the urgent need for rigorous testing and accountability in software updates to prevent future failures.
Brands in critical sectors, such as airlines and healthcare, should prioritize strengthening resilience, optimizing vendor management, and ensuring effective contingency plans to protect themselves against future disruptions.
Without a doubt, the incident served as a wake-up call for businesses worldwide, prompting them to re-evaluate their reliance on single operating systems and urgently diversify their IT infrastructure.
In the cybersecurity realm, the outage is expected to have long-lasting impacts on policies and practices, highlighting the vulnerability of even the most secure systems to unexpected issues, and underscoring the necessity for robust incident response plans and fail-safe mechanisms.
There is also a rising concern that the turbulence might prompt some firms to disable their EDR tools, a risky decision as hackers are currently exploiting the situation and targeting affected companies.
Making matters worse is the consequent increased lack of faith in cyber products. This hesitancy is likely to impact the entire cyber community for months to come. Despite the notion that cybersecurity should protect a system from the outside, an internal flaw could also have a disastrous outcome. CTOs and CIOs, who are already trying to convince boards to invest more in security tooling, will now feel more pressure.
It's important to recognize that no company is ever completely secure; even the largest and most established organizations must remain vigilant, continually updating and fortifying their systems.
The Way Forward
To prevent outages of this magnitude from occurring again, the software development processes that IT professionals are enlisting today are, and should continue to be, proactive and built with zero-trust principles in mind.
Ironically, in 2020, Microsoft and Altran collaborated to develop the Code Defect AI tool to predict the likelihood of bugs in source codes created by developers early in the software development process.
In 2021, Telecom Review spoke to CrowdStrike about the introduction of the Falcon Zero Trust Identity Security solution in the Middle East and ascertained that an identity-centric, zero-trust architecture is key to mitigating cyberthreats targeting telecommunications sector in the region.
In the same context, Microsoft emphasized the importance of zero-trust architecture to Telecom Review, citing it as the emerging “global standard for enterprises.” The defining principle of a zero-trust strategy is “never trust, always verify”. Hence, every time a user, device or application tries to establish a connection, that attempt should be strictly authenticated and authorized within the system.
CrowdStrike has since shared some measures to prevent the scenario from happening again. These include: software resiliency and testing, Rapid Response Content deployment, and third-party validation.
To improve Rapid Response Content testing, various testing types can be utilized, including local developer testing, content update and rollback testing, stress testing, fuzzing, and fault injection.
More importantly, it is imperative to implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
This aligns with security experts' recommendations that "a staged rollout procedure" when publishing Rapid Response Content updates could have helped prevent the issue.
For example, when addressing the fault diagnosis phase, Huawei’s Telecom Foundation Model employs sophisticated chain-of-thought intelligent analysis capabilities, swiftly pinpointing the root causes of issues and significantly reducing the number of alarms.
In the ICT landscape, testing and assurance has indeed become prevalent, particularly in telcos who are building private networks, because they want to make sure that these networks are very reliable and provide low latency.
Reailize has taken the lead in ensuring efficient and reliable network operations. The company provides CSPs with cutting-edge tools, services and expertise to disrupt traditional means of network management, including AI/ML-powered autonomous network monitoring with anomaly detection, root cause determination and streamlined remediation.
As part of the evolution to an agile network architecture, Huawei and stc Group partnered to implement the first CDCT (Continuous Delivery Continuous Testing) validation under the Telco Cloud Partnership program in KSA in 2023.
The CrowdStrike-Microsoft outage magnified the importance of having an up-to-date business continuity plan that emphasizes communication procedures, which can get complicated if systems are down. And it pushed some leaders to figure out whether they have enough contingencies in place to ensure that operations can continue should a situation similar to this occur again.
The Nokia Cybersecurity Dome was highlighted as one of the relevant breakthrough innovations addressing the aforementioned issue during a recent Telecom Review webinar. It is an overarching solution for threat identification, detection, and verification that leverages AI.
MYCOM OSI also works collaboratively with customers to understand their business and economic drivers, processes, people, and ambitions. The company’s AIOps solutions offer proactivity and automated analysis, in line with zero-touch assurance and dark NOC programs.
Organizations are also reassessing their emergency staffing, addressing the need for outsourced help, and rethinking the value of storing key recovery data in multiple locations.
From a consumer perspective, a netizen expressed that the global outage is alarming, comparing it to a war situation. “The West goes down (as it did) because we are all using a Microsoft system. Russia and China stay up and ready to go because they don't use a Microsoft system.”
Food for Thought
The shift to cloud computing has led many companies to rely on big tech giants for their server needs instead of building their own infrastructure. Fortinet's 2023 State of Zero Trust report shows a growing number of organizations adopting zero-trust strategies to secure their cloud environments. However, challenges remain.
Industry experts have pointed out that the CrowdStrike-Microsoft issue highlighted the risks of dependency on just a few major players. True resilience means ensuring operations can continue, even if failures occur repeatedly.
In reality, this resilience can be achieved by installing multiple process controllers in a distributed mesh, similar to a controller data center. Control methods are no longer limited to a single physical controller. This is merely the first step toward autonomous operations, which will predictably appear in a variety of settings and across a wide range of activities and disciplines.
Continue Reading:
Internet Safety: A Moving Target
Technology Advancements Are Upon Us: What Will the World Become
Unlocking Cloud Security: How CIEM Practices Safeguard Telcos
Staying Vigilant: Monitoring Online Behaviors That Jeopardize Your Privacy