Last week, on 18-19 July, widespread outages struck Microsoft's Windows operating system (OS) across the globe, causing severe disruptions in multiple sectors. Aviation, railways, media broadcasts, banking, stock exchanges, and hospitals were among those most severely affected. The issue manifested as the Blue Screen of Death (BSOD), a critical error screen displaying white text on a blue background. This error occurs when Windows encounters a fatal issue it cannot safely recover from, forcing the system to halt to prevent hardware damage or data corruption. Affected computers shut down and restarted repeatedly, hampering operations.
The disruption has extended beyond 19 July, with no newer systems being affected but the affected systems remaining out of action in many cases. Even in India, the aviation sector was impacted, resulting in numerous flight cancellations at various airports. Many businesses also experienced significant operational difficulties during this period. Interestingly, the disruptions impacted only Windows OS-based systems, not Linux and Apple machines.
The findings so far indicate that the widespread outages were caused by malfunctioning updates to the Falcon Sensor, a product of cybersecurity firm CrowdStrike. This sensor is integrated into various Microsoft systems, including Azure, Microsoft 365, and Windows OS, as part of a business partnership to enhance cybersecurity coverage. CrowdStrike's role involves providing advanced endpoint security for Microsoft environments and assisting organisations in meeting compliance requirements within Microsoft ecosystems.
The Falcon Sensor is designed to be lightweight, minimising its impact on system performance. It collects telemetry data on endpoint activities and events, such as process executions, network connections, and file system changes. This data undergoes real-time analysis to detect potential threats or suspicious activities. The sensor employs behavioural analysis techniques for threat identification, moving beyond traditional signature-based detection methods.
Microsoft and CrowdStrike have assured the public that these outages are not the result of cyber attacks. This is a failure by two top companies who swear by their security practices but were found wanting. The unprecedented scale and impact of this disruption—the largest cyber outage to date—has raised serious questions about the protocols and practices employed for software and patch updates on large, critical systems. Concerns include whether untested updates were pushed into the system at kernel levels and if and how Microsoft was excluded from the trust process.
Questions arise about CrowdStrike's liability to its customers and Microsoft's responsibility in vetting qualified partners for such critical systems. The adequacy of self-regulatory service level agreements (SLAs), which both Microsoft and CrowdStrike typically rely on, is now under scrutiny. Also, were there disaster recovery (DR) practices in place and why didn’t they function effectively?
This incident has heightened the worries about the growing power of big tech companies and their apparent disregard for customer security. It has alarmed governments, businesses, and the tech community, renewing calls for improved security and stability measures in digital technology systems. The event underscores the need for greater cooperation in protecting critical infrastructure, which is increasingly dependent on digital systems. In our interconnected world, this incident is a stark reminder of the vulnerabilities present and the need for robust safeguards.
The tactical focus is on technical mitigation strategies that combine finding the points of vulnerability and pairing them with the right patching strategies, which both companies have already engaged in. However, this is the time to look at the whole issue from various other angles—the liabilities of the two companies, the software they provide, as well as how to keep the networks functioning during such disasters.
Software companies bear a fundamental responsibility to ensure the reliability and stability of their systems and should be held accountable for outages for several compelling reasons. Such disruptions erode customer trust and damage reputations and taking responsibility for mishaps demonstrates a commitment to quality and reliability. System failures often result in significant financial losses for clients and end-users, and companies should shoulder some of this cost to incentivise better practices. Accountability encourages investment in robust architecture, thorough testing, and effective disaster recovery plans. These companies should also maintain appropriate insurance coverage and establish clear compensation policies for affected customers.
While SLAs between companies and their clients or partners play a role in maintaining security standards, they are often insufficient. Companies may prioritise cost-cutting or rapid development over comprehensive security measures, with shortcomings becoming apparent only after a breach occurs and is investigated. Without external oversight, vulnerabilities may go unreported or unaddressed. SLAs can vary widely between companies, leading to gaps in overall network security. While Microsoft and CrowdStrike might have a good partnership model to address cybersecurity, it may still fall short of the prudent requirements desired by external bodies. Moreover, SLAs typically focus on performance metrics rather than comprehensive security practices.
Regulatory oversight is crucial to approach risk management more comprehensively. Regulators can establish and enforce minimum security requirements across critical sectors and implement audit protocols. Mandatory disclosure of breaches and vulnerabilities would significantly improve overall industry security. Many companies are reluctant to report flaws to Computer Emergency Response Teams (CERTs), necessitating strict penalties imposed by regulators. Regulatory penalties for non-compliance can incentivise companies to prioritise security. Government oversight provides an additional layer of protection and accountability in an increasingly complex digital landscape.
The conducting of regular security audits and penetration testing, implementation of robust access controls and encryption, ensuring defined best practices and employee training on cyber security best practices, adoption of zero-trust security models and collaboration between public and private sectors to share threat intelligence is necessary. These actions often go continually for organisations, and many nations have a regulatory or government agency to mandate and assess regular security audit reports. Cyber security is a very dynamic activity, with round-the-clock managed security services and vulnerabilities being undertaken by large corporations. However, technical solutions cannot only address these risks.
Many of these risks have escalated over the years for multiple reasons. There is a massive growth of digital networks across the globe often conducting critical functions that need to be fail-safe. Sophisticated cyber attacks backed by artificial intelligence (AI) tools have become more widespread. Many ransomware groups regularly target corporate and government networks and bring them down. At the same time, nations and rogue elements, often backed by states’ resources, have set forth many efforts to exploit vulnerabilities and attack critical infrastructures. A significant aspect of recent international geopolitical pursuits is the usage of digital technologies for disruptions and debilitation by giving it a force multiplier punch. Thus, the need for wider global cooperation on protection is prudent. Efforts so far have been very casual and nations look at their own interests rather than the global perspective.
Vulnerabilities and risks exist in digital network infrastructure and have to be constantly guarded against at multiple steps—from deployment to patch management to dealing with internal sabotage or external hacking attempts. These outages are a major wake-up call to consider cyber security and stability more closely and seriously. While Microsoft and CrowdStrike have suffered serious damages to their credentials, they must be thoroughly investigated to see if there were shortcomings in the SLAs, or if they were not followed.
(Subimal Bhattacharjee is Visiting Fellow Ostrom Workshop Indiana University Bloomington USA and a cybersecurity specialist. This is an opinion piece. The views expressed above are the author’s own. The Quint neither endorses nor is responsible for them.)
(At The Quint, we question everything. Play an active role in shaping our journalism by becoming a member today.)