What was the root reason for Facebook’s outage?
On Friday, at 11:40a.m ET, people began to notice that they could not access Facebook, Instagram, WhatsApp, or Facebook Messenger. According to a Facebook statement made on Tuesday, a configuration upgrade to the backbone routers, which coordinate network traffic throughout the company’s data centers, had a cascading effect, causing all Facebook services to come to a grinding halt.
Suddenly, all of the services provided by Facebook were no longer available to users. Not only was Facebook affected, but so were all of Facebook’s other services as a result. Several employees have reported being locked out of their workplaces and unable to access Facebook’s internal chat network due to this.
According to Cloudflare, which has experienced its internet outage issues in recent weeks, the following is the cause of the problem. Both the Domain Name System (DNS) and the Border Gateway Protocol (BGP) of the Internet are involved in this process (BGP).
The internet is made up of a collection of interconnected networks that are interconnected with one another. You’ll need a program like BGP to guide you through this process to keep things running smoothly. It’s vital to understand that DNS is essentially a location-address system for websites, but BGP is a roadmap for determining the most direct route to an IP address that has been assigned to that domain.
According to two people familiar with the event, when Facebook’s apps were down for an extended period, its engineers rushed to fix the problem at one of its California data centers.
The outage impacted hundreds of millions of users and advertisers worldwide, which began at 11:40 a.m. ET on Monday lasted several hours, affecting all of Facebook’s platforms, including Instagram and WhatsApp. Almost all of Facebook’s internal communication and work media were rendered useless due to the outage. As of 6 p.m. ET, the majority of services appear to have been restored.
Employees were compelled to use alternative services such as Apple FaceTime and Discord to access Google Docs and Zoom, as Facebook demands them to check-in using their work IDs. Employees who were already authenticated through non-Facebook products such as Google Docs could access their accounts before the outages.
When Facebook employees could not resolve the issue remotely, they were compelled to travel to one of the company’s most important data centers in California. The New York Times was the first to report the outage, which temporarily blocked some employees from accessing commercial buildings and conference rooms.
Following the service restoration, CTO Mike Schroepfer wrote an email to all of his staff informing them of the resolution.
“If you are not actively working on the recovery, please exercise patience and refrain from reloading everything to avoid slowing down the network’s bring up,” the notification reads. External experts believe the outage caused a problem with the networking standard BGP or Border Gateway Protocol. However, Facebook has not explained in full.
According to Santosh Janardhan, Facebook’s VP of Infrastructure, the outage was caused by a “misconfigured configuration change,” and the firm has “no evidence” that user data was compromised as a result of the disruption. According to Janardhan, the connectivity was disrupted due to changes made to the backbone router’s settings. It was revealed that network traffic had been disrupted, affecting how our data centers interacted, failing our services.
Is it feasible to prevent such outages from occurring again in the future?
Even though this is a rare occurrence, it shouldn’t be discounted. Because of the recent Facebook outage and prior incidents such as the Cloudflare and Fastly failures in 2020 and 2016, there are concerns about having a single point of failure for a large number of internet services that people utilize.
Beyond connecting with friends and family, Facebook is being used for logging into other services, such as e-commerce websites, by people throughout the world. It has risen to become the dominant mode of communication in many nations due to applications such as WhatsApp. A source of concern for some is that a single outage can affect so many people over several hours due to a single failure.