Cloudflare explains today's mega-outage

by 24britishtvJune 21, 2022, 7 p.m. 17
-

A large chunk of the web (including your own Vulture Central) fell off the internet this morning as content delivery network Cloudflare suffered a self-inflicted outage.

The incident began at 0627 UTC (2327 Pacific Time) and it took until 0742 UTC (0042 Pacific) before the company managed to bring all its datacenters back online and verify they were working correctly. During this time a variety of sites and services relying on Cloudflare went dark while engineers frantically worked to undo the damage they had wrought short hours previously.

"The outage," explained Cloudflare, "was caused by a change that was part of a long-running project to increase resilience in our busiest locations."

What had happened was a change to the company's prefix advertisement policies, resulting in the withdrawal of a critical subset of prefixes. Cloudflare makes use of BGP (Border Gateway Protocol). As part of this protocol, operators define which policies (adjacent IP addresses) are advertised to or accepted from networks (or peers).

Changing a policy can result in IP addresses no longer being reachable on the Internet. One would therefore hope that extreme caution would be taken before doing a such a thing...

Cloudflare's mistakes actually began at 0356 UTC (2056 Pacific), when the change was made at the first location. There was no problem - the location used an older architecture rather than Cloudflare's new "more flexible and resilient" version, known internally as MCP (Multi-Colo Pop.) MCP differed from what had gone before by adding a layer of routing to create a mesh of connections. The theory went that bits and pieces of the internal network could be disabled for maintenance. Cloudflare has already rolled out MCP to 19 of its datacenters.

Moving forward to 0617 UTC (2317 Pacific) and the change was deployed to one of the company's busiest locations, but not an MCP-enabled one. Things still seemed OK... However, by 0627 UTC (2327 Pacific), the change hit the MCP-enabled locations, rattled through the mesh layer and... took out all 19 locations.
• Cloudflare outage caused by techie pulling out the wrong cables
• This major internet routing blunder took A WEEK to fix. Why so long? It was IPv6 – and no one really noticed

Five minutes later the company declared a major incident. Within half an hour the root cause had been found and engineers began to revert the change. Slightly worryingly, it took until 0742 UTC (0042 Pacific) before everything was complete. "This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically."

One can imagine the panic at Cloudflare towers, although we cannot imagine a controlled process that resulted in a scenario where "network engineers walked over each other's changes."

We've asked the company to clarify how this happened, and what testing was done before the configuration change was made, and will update should we receive a response.

Mark Boost CEO of Cloud native outfit Civo (formerly of LCN.com) was scathing regarding the outage: "This morning was a wake-up call for the price we pay for over-reliance on big cloud providers. It is completely unsustainable for an outage with one provider being able to bring vast swathes of the internet offline.

"Users today rely on constant connectivity to access the online services that are part of the fabric of all our lives, making outages hugely damaging...

"We should remember that scale is no guarantee of uptime. Large cloud providers have to manage a vast degree of complexity and moving parts, significantly increasing the risk of an outage." ®

-

Related Articles

HOT TRENDS

Stranger Things Midseason-Premiere Recap: Father-Daughter Dance

by 24britishtvJuly 1, 2022, 10 a.m.2
HOT TRENDS

Pinto - Wordle fans claim latest challenge is a 'made up' word

by 24britishtvJuly 1, 2022, 10 a.m.2
HOT TRENDS

Culture Secretary confuses rugby codes in speech

by 24britishtvJuly 1, 2022, 10 a.m.2
HOT TRENDS

Minecraft YouTuber Technoblade dies from cancer aged just 23

by 24britishtvJuly 1, 2022, 9 a.m.2
HOT TRENDS

Kevin Durant requests trade from Brooklyn Nets in NBA bombshell

by 24britishtvJuly 1, 2022, 8 a.m.2
HOT TRENDS

Minecraft YouTuber Technoblade Passes Away from Cancer

by 24britishtvJuly 1, 2022, 4 a.m.2
HOT TRENDS

Swiatek shows first glimpses of vulnerability with rivals lurking

by 24britishtvJuly 1, 2022, midnight2
HOT TRENDS

Lewis Hamilton condemns Ecclestone’s comments about Putin

by 24britishtvJune 30, 2022, 10 p.m.2
HOT TRENDS

Nadal taking extra precautions after holding off Berankis’s best efforts

by 24britishtvJune 30, 2022, 10 p.m.2
HOT TRENDS

Inquiry begins into Captain Tom Foundation

by 24britishtvJune 30, 2022, 10 p.m.2
HOT TRENDS

The Warrington link to young Brit taking Wimbledon by storm

by 24britishtvJune 30, 2022, 10 p.m.2
HOT TRENDS

Teenager moved into Logan Mwangi’s home days before murdering him

by 24britishtvJune 30, 2022, 9 p.m.2
HOT TRENDS

Stranger Things season 4 part 2 episode lengths revealed

by 24britishtvJune 30, 2022, 9 p.m.2
HOT TRENDS

YouTuber Logan Paul Signs On With WWE

by 24britishtvJune 30, 2022, 9 p.m.2
HOT TRENDS

The Undeclared War: Who is in the cast of the Channel 4 drama?

by 24britishtvJune 30, 2022, 9 p.m.2
HOT TRENDS

Rafael Nadal Challenged Again, But Advances At Wimbledon | | Tennis

by 24britishtvJune 30, 2022, 9 p.m.2