Cloud services, particularly Amazon Web Services (AWS), form an indispensable foundation for modern digital society. However, their massive and complex systems can sometimes fail in ways beyond our imagination, causing severe impacts on business and societal activities.
On October 20, 2025, AWS was again hit by a large-scale outage, causing many services to be down for an extended period. This incident once again highlighted the challenge of how to address the risks in a modern society where dependence on the cloud is deepening.
This article, keeping this recent outage in mind, looks back at past AWS Tokyo Region outages that had particularly significant impact and delves deeply into the lessons learned. It goes beyond a simple summary of incident reports to examine what these experiences question about system design and risk management in the cloud era.
August 2019 AWS Tokyo Region Major Outage: The Day a Cooling Bug Broke Multi-AZ
The large-scale outage that occurred in the AWS Tokyo Region (ap-northeast-1) on August 23, 2019, confronted many developers and infrastructure engineers with the harsh reality that “Multi-AZ configurations are not infallible.”
Event and Cause
On the afternoon of August 23, 2019, numerous EC2 servers and their associated EBS volumes became unavailable within one Availability Zone (AZ) of the Tokyo Region.
According to AWS’s official announcement, the cause was a bug in the data center’s cooling control system. This bug reduced cooling capacity, causing server clusters to automatically shut down to prevent overheating. This was a physical equipment issue confined to a single AZ.
Impact: From Digital to Physical
This outage affected a wide range of services, from social games to enterprise core systems. Notably, its impact extended beyond the digital realm.
Impact on the Physical World: The Docomo Bike Share Case
A symbolic example was the suspension of Docomo Bike Share services. The shutdown of the authentication and management system running on AWS triggered the following chain reaction:
- Authentication and Control System Failure: The system managing bike rentals and returns became inoperable.
- Impeded Physical Movement: Even when users attempted to unlock bikes via the app, the system’s “unlock command” failed to reach the bikes’ smart locks.
- Paralysis of Social Infrastructure: Consequently, many users could neither rent bikes nor return bikes they were using, resulting in the physical impact of being cut off from their means of transportation.
This incident demonstrates how critical the cloud is as a lifeline for services that integrate with the physical world—such as smart locks, IoT devices, and mobility.
