Cloudflare explains cause of widespread internet outage

Cloudflare explains cause of widespread internet outage

The outage was resolved in stages, with Cloudflare deploying a fix around 5:42 pm, and its dashboard services coming back online shortly beforehand.

Cloudflare, the internet infrastructure provider, has attributed Tuesday's outage to a misconfigured bot-traffic query in its systems.

The outage disrupted numerous widely used platforms, including X, ChatGPT, Spotify, Canva, Uber, DownDetector, and multiple cryptocurrency services.

According to CTO Dane Knecht, a hidden bug in Cloudflare’s bot mitigation system was triggered after engineers applied for a routine configuration change. This failure caused the company’s core proxy infrastructure to crash, preventing traffic from reaching impacted websites.

“This was entirely an internal failure, not a cyberattack, as some initially feared. We failed our customers today,” Knecht said.

The outage was resolved in stages, with Cloudflare deploying a fix around 5:42 pm, and its dashboard services coming back online shortly beforehand.

While the disruption lasted only a few hours, it underscored the critical role Cloudflare plays in maintaining internet access and the fragility of cloud-dependent systems.

Further details revealed that the issue stemmed from a change in the permissions system of a ClickHouse database, not generative AI technology, DNS, or a malicious DDoS attack as initially suspected.

Cloudflare CEO Matthew Prince explained the technical details of the failure in a blog released on Tuesday: the machine learning model behind Cloudflare’s Bot Management system generates bot scores to help identify automated traffic. It relies on a frequently updated configuration file.

“A change in our underlying ClickHouse query behaviour that generates this file caused it to have a large number of duplicate ‘feature’ rows,” Prince said.

As the configuration file rapidly grew, it exceeded preset memory limits, crashing the core proxy system responsible for processing traffic related to the bots module.

This caused companies using Cloudflare’s bot-scoring rules to experience false positives, blocking legitimate traffic, while customers not using bot scores in their rules remained online.

Prince emphasised the challenge of safely managing automated bot detection at scale: “The system is designed to quickly adapt and update features to accurately detect bots, but in this case, a routine change exposed an edge case that cascaded through our network. We’re committed to learning from this incident and preventing it from happening again.”

Tuesday’s outage marked the third major cloud disruption in less than a month.

Amazon Web Services (AWS) experienced a bug that prevented automatic repairs, affecting Snapchat, Pinterest, Signal, Zoom, and Slack.

Microsoft Azure also faced disruptions impacting Microsoft 365 and Xbox services.

These incidents collectively highlight the cascading risks posed by centralised cloud infrastructure.

While the outage was temporary, its impact on popular services served as a stark reminder of the vulnerabilities inherent in cloud-dependent internet infrastructure.

Reader Comments

Trending

Popular Stories This Week

Stay ahead of the news! Click ‘Yes, Thanks’ to receive breaking stories and exclusive updates directly to your device. Be the first to know what’s happening.