← Latest Blog Posts

🎵 Spotify Podcast

On November 18, 2025, the global telecommunications infrastructure experienced a severe contraction in connectivity, precipitated by a catastrophic failure within Cloudflare. The outage, which extended beyond four hours, commencing at 11:20 UTC, effectively partitioned a significant percentage of the digital economy from its user base. Critical services, ranging from major e-commerce platforms (like Amazon and Shopify) to banking portals and essential AI infrastructure, were rendered inaccessible, returning widespread HTTP 500-series errors. While the external symptoms resembled a hyper-scale Distributed Denial of Service (DDoS) attack—a theory initially favored due to recent "Aisuru" botnet campaigns—the precipitating event was strictly internal.

To fully comprehend the magnitude of this failure, it is crucial to recognize the internet's current architecture. The notion of a decentralized web has largely been supplanted by a highly concentrated topology dependent on a select few "Edge Cloud" providers. Cloudflare functions as an intelligent, programmable membrane, where the "proxy" is the decisive control plane for traffic. The proxy is responsible for complex operations such as Cryptographic Termination, Zero Trust Identity Verification, and Threat Mitigation, including real-time Bot Management. When this membrane ruptures, the availability of the origin server is irrelevant, as the ingress path is compromised.

The root of the technical failure lay in a database permissions paradox that led to a Configuration-Data Mismatch. Cloudflare utilizes ClickHouse, an open-source column-oriented database management system, to generate operational intelligence and real-time analytics. The Bot Management system relied on a live SQL query against the analytical cluster to generate configuration files. The trigger occurred at 11:05 UTC when engineering initiated a routine database permission change deployment.

The fatal flaw was not a code bug but rather an implicitly broad SQL query. The Bot Management system ran a scheduled ETL job every five minutes to fetch feature metadata using the query SELECT name, type FROM system.columns WHERE table = 'http_requests_features'. Crucially, this query lacked a DISTINCT clause and a database filter (WHERE database = 'default'). Prior to the permission change, the service account only had visibility into the default database. After the 11:05 UTC update, the service gained explicit or inadvertent access to the underlying r0 schema (the physical shards). The result was that the query now saw two tables with the exact same schema (default.http_requests_features and r0.http_requests_features), effectively duplicating the feature list from approximately 60 to over 120.

This inflated query result was serialized into a critical configuration artifact, the "Feature File," which acts as the instruction set for the Bot Management module at the edge. The propagation of this corrupted configuration was the vector for the global failure. The impact diverged based on infrastructure: the legacy architecture (based on NGINX/Lua) likely treated the configuration file more loosely, resulting in "misapplied bot scores" (degraded security). However, the modern FL2 proxies (likely based on Rust and designed for strict safety and performance) enforced rigorous memory bounds checking. When the duplicated feature file violated the hard-coded memory limit (the limit was 200, but the expected count was $\sim 60$), the process suffered a panic—a "fail-closed" mechanism that resulted in a total Denial of Service for legitimate traffic.

The intermittent nature of the failure ("flapping") complicated the response, as the Feature File generator alternated between hitting updated database nodes (generating poisoned files) and non-updated nodes (generating valid files), obscuring the correlation between the 11:05 UTC change and the global outage. The incident response team faced cognitive bias, focusing on the external attack hypothesis (Aisuru), delaying the internal investigation. Mitigation required a high-risk manual intervention to inject a "known-good" configuration file.

The core lesson lies in treating dynamic configuration as infrastructure. The incident validated the systemic risk of centralization. Future recommendations include the strict application of the Principle of Least Privilege by enforcing schema visibility (r0 must be invisible), adopting defensive and strictly scoped SQL queries (WHERE database = 'default'), and mandating a Configuration "Compiler" or "Validator" stage with limit assertion tests (assert(feature_count <= 200)) before distribution. The incident also re-opened the debate on Availability versus Security: the "Fail-Closed" design of FL2 prioritized security (preventing memory corruption) but caused global unavailability, suggesting the need for "Graceful Degradation" or "Fail-Open" options for non-essential modules.