Cloudflare explains cause of widespread internet outage
The outage was resolved in stages, with Cloudflare deploying a fix around 5:42 pm, and its dashboard services coming back online shortly beforehand.
Cloudflare, the internet infrastructure provider, has attributed Tuesday's outage to a misconfigured bot-traffic query in its systems.
The outage disrupted numerous widely used platforms, including X, ChatGPT, Spotify, Canva, Uber, DownDetector, and multiple cryptocurrency services.
More To Read
- Cloudflare outage sparks global internet chaos as major websites affected
- OpenAI adds group chat to ChatGPT for up to 20 users
- OpenAI denies ban on legal and medical topics in ChatGPT
- Top 10 most used languages for web content
- OpenAI, PayPal partner to enable in-chat payments via ChatGPT
- AI chatbots fail at accurate news, major study reveals
According to CTO Dane Knecht, a hidden bug in Cloudflare’s bot mitigation system was triggered after engineers applied for a routine configuration change. This failure caused the company’s core proxy infrastructure to crash, preventing traffic from reaching impacted websites.
“This was entirely an internal failure, not a cyberattack, as some initially feared. We failed our customers today,” Knecht said.
The outage was resolved in stages, with Cloudflare deploying a fix around 5:42 pm, and its dashboard services coming back online shortly beforehand.
While the disruption lasted only a few hours, it underscored the critical role Cloudflare plays in maintaining internet access and the fragility of cloud-dependent systems.
Further details revealed that the issue stemmed from a change in the permissions system of a ClickHouse database, not generative AI technology, DNS, or a malicious DDoS attack as initially suspected.
Cloudflare CEO Matthew Prince explained the technical details of the failure in a blog released on Tuesday: the machine learning model behind Cloudflare’s Bot Management system generates bot scores to help identify automated traffic. It relies on a frequently updated configuration file.
“A change in our underlying ClickHouse query behaviour that generates this file caused it to have a large number of duplicate ‘feature’ rows,” Prince said.
As the configuration file rapidly grew, it exceeded preset memory limits, crashing the core proxy system responsible for processing traffic related to the bots module.
This caused companies using Cloudflare’s bot-scoring rules to experience false positives, blocking legitimate traffic, while customers not using bot scores in their rules remained online.
Prince emphasised the challenge of safely managing automated bot detection at scale: “The system is designed to quickly adapt and update features to accurately detect bots, but in this case, a routine change exposed an edge case that cascaded through our network. We’re committed to learning from this incident and preventing it from happening again.”
Tuesday’s outage marked the third major cloud disruption in less than a month.
Amazon Web Services (AWS) experienced a bug that prevented automatic repairs, affecting Snapchat, Pinterest, Signal, Zoom, and Slack.
Microsoft Azure also faced disruptions impacting Microsoft 365 and Xbox services.
Other Topics To Read
These incidents collectively highlight the cascading risks posed by centralised cloud infrastructure.
While the outage was temporary, its impact on popular services served as a stark reminder of the vulnerabilities inherent in cloud-dependent internet infrastructure.
Top Stories Today