How a single typo by an Amazon employee caused a massive internet outage

The S3 service outage reportedly cost companies more than $150 million.

The servers were down for nearly four hours.

S3 is part of Amazon Web Services, which hosts hundreds of thousands of websites and apps.

It reportedly has more than a million users and accounts for 40% of the cloud services market.

The company said that changes were being implemented to prevent mistakes like this from happening in the future.

They are particularly addressing the tool used allowed too much capacity to be removed too quickly.

spot_img

We want to apologize for the impact this event caused for our customers, Amazon said.

The servers that were inadvertently removed supported two other S3 subsystems.

This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.

The placement subsystem is used during PUT requests to allocate storage for new objects.

Removing a significant portion of the capacity caused each of these systems to require a full restart.

While these subsystems were being restarted, S3 was unable to service requests.

S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact.

The index subsystem was the first of the two affected subsystems that needed to be restarted.

The S3 PUT API also required the placement subsystem.

The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST.

At this point, S3 was operating normally.

Other AWS services that were impacted by this event began recovering.

We are making several changes as a result of this operational event.

This will prevent an incorrect input from triggering a similar event in the future.

We are also auditing our other operational tools to ensure we have similar safety checks.

We will also make changes to improve the recovery time of key S3 subsystems.

We employ multiple techniques to allow our services to recover from any failure quickly.

One of the most important involves breaking services into small partitions which we call cells.

During this event, the recovery time of the index subsystem still took longer than we expected.

The S3 team had planned further partitioning of the index subsystem later this year.

We are reprioritizing that work to begin immediately.