The Cause Of Amazon’s Cloud Outage

Amazon Web Services (AWS) has explained the cause of their outage, which took down thousands of third-party online services for hours. Amazon say that, “the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration... As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.” 

While dozens of services were affected, AWS says the outage occurred in its Northern Virginia, US-East-1, region. It happened after a "small addition of capacity" to its front-end fleet of Kinesis servers. 

Amazon Kinesis enables real-time processing of streaming data. In addition to its direct use by customers, Kinesis is used by several other AWS services and these services also saw impact during the shutdown. Kinesis is used by developers, as well as other AWS services like CloudWatch and Cognito authentication, to capture data and video streams and run them through AWS machine-learning platforms.  

The Kinesis service's front-end handles authentication, throttling, and distributes workloads to its back-end "workhorse" cluster via a database mechanism called sharding.  

Amazon’s additions to capacity triggered the outage but wasn't the root cause of it. AWS was adding capacity for an hour after 2:44am PST, and after that all the servers in Kinesis front-end fleet began to exceed the maximum number of threads allowed by its current operating system configuration.  The first alarm was triggered at 5:15am PST and AWS engineers spent the next five hours trying to resolve the issue. Kinesis was fully restored at 10:23pm PST. 

Amazon explains how the front-end servers distribute data across its Kinesis back-end: "Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map." According to AWS, that information is obtained through calls to a micro service vending the membership information, retrieval of configuration information from DynamoDB and continuous processing of messages from other Kinesis front-end servers. For Kinesis communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants." 

As the number of threads exceeded the OS configuration, the front-end servers ended up with "useless shard-maps" and were unable to route requests to Kinesis back-end clusters. AWS had already rolled back the additional capacity that triggered the event but had reservations about boosting the thread limit in case it delayed the recovery.  

As a first step, AWS has moved to larger CPU and memory servers, as well as reduced the total number of servers and threads required by each server to communicate across the fleet.  It's also testing an increase in thread count limits in its operating system configuration and working to "radically improve the cold-start time for the front-end fleet".  

CloudWatch and other large AWS services will move to a separate, partitioned front-end fleet. AWS is also working on a broader project to isolate failures in one service from affecting other services.  

AWS has also acknowledged the delays in updating its Service Health Dashboard during the incident, but says that was because the tool its support engineers use to update the public dashboard was affected by the outage. During that time, it was updating customers via the Personal Health Dashboard.   Amazon has apologised for the impact this event caused its customers.

Amazon:        Down Detector:       ZDNet

You Might Also Read:

The Risks &  Benefits Of Cloud Security:

 

« We Live In A Transient Internet
Orca Security Wants To Streamline Cloud Computing »

CyberSecurity Jobsite
Perimeter 81

Directory of Suppliers

Cyber Security Supplier Directory

Cyber Security Supplier Directory

Our Supplier Directory lists 6,000+ specialist cyber security service providers in 128 countries worldwide. IS YOUR ORGANISATION LISTED?

Practice Labs

Practice Labs

Practice Labs is an IT competency hub, where live-lab environments give access to real equipment for hands-on practice of essential cybersecurity skills.

DigitalStakeout

DigitalStakeout

DigitalStakeout enables cyber security professionals to reduce cyber risk to their organization with proactive security solutions, providing immediate improvement in security posture and ROI.

CSI Consulting Services

CSI Consulting Services

Get Advice From The Experts: * Training * Penetration Testing * Data Governance * GDPR Compliance. Connecting you to the best in the business.

North Infosec Testing (North IT)

North Infosec Testing (North IT)

North IT (North Infosec Testing) are an award-winning provider of web, software, and application penetration testing.

DigiCert

DigiCert

DigiCert is the only provider of enterprise-grade SSL, IoT and PKI solutions. Our certificates are trusted everywhere, millions of times every day, by companies across the globe.

Janusnet

Janusnet

Janusnet develops software and solutions for organisations to enforce and manage data security.

Beyond Security

Beyond Security

Beyond Security is a leader in automated vulnerability assessment and compliance solutions - enabling customers to accurately assess and manage security weaknesses in their networks and applications.

Critical Infrastructures for Information and Cybersecurity (ICIC)

Critical Infrastructures for Information and Cybersecurity (ICIC)

ICIC addresses the demand for cybersecurity for National Public Sector organizations and civil and private sector organizations in Argentina.

Eustema

Eustema

Eustema designs and manages ICT solutions for medium and large organizations.

ClickDatos

ClickDatos

ClickDatos specializes in consulting, auditing, data protection training, accredited by ISO/IEC 27001 certification.

TrustArc

TrustArc

TrustArc provide privacy compliance and risk management with integrated technology, consulting and TRUSTe certification solutions – addressing all phases of privacy program management.

DefenseStorm

DefenseStorm

DefenseStorm is a Security Data Platform that watches everything on your network and matches it to your policies, providing cybersecurity management that is safe, compliant and cost effective.

NFIR

NFIR

NFIR is a specialist in the field of cyber security incident response and digital forensics.

CyCraft Technology Corp

CyCraft Technology Corp

CyCraft is an AI company that forges the future of cybersecurity resilience through autonomous systems and human-AI collaboration.

ColorTokens

ColorTokens

ColorTokens Xtended ZeroTrust Platform protects from the inside out with unified visibility, micro-segmentation, zero-trust network access, cloud workload and endpoint protection.

Safetech Innovations

Safetech Innovations

Safetech Innovations is a team of cyber security experts, always at your service. We use human and cyber intelligence to help your business in uncertain times.

Intrepid Solutions and Services

Intrepid Solutions and Services

Intrepid Solutions and Services provides technology solutions and professional services to key components of the intelligence and national security communities.

Across Verticals

Across Verticals

Across Verticals is a boutique cyber security consulting firm that specializes in holistic, deeply technical and end to end cyber security advisory services based on industry best practices.

Persistent Systems

Persistent Systems

Persistent Systems are a trusted Digital Engineering and Enterprise Modernization partner, combining deep technical expertise and industry experience to help our clients.

RAH Infotech

RAH Infotech

RAH Infotech is India’s leading value added distributor and solutions provider in the Network and Security domain. We are specialists in Enterprise and App Security and Application Delivery.