AWS infrastructure failures and Kafka restoration points quickly halted buying and selling throughout Coinbase.
Coinbase suffered a significant service outage on Might 7 that disrupted buying and selling, change entry, and buyer stability updates throughout a number of platforms. Issues affected spot markets, derivatives, Prime providers, and worldwide buying and selling operations for a number of hours. Engineers later traced the difficulty to a cooling system failure inside an AWS knowledge heart in america. Coinbase stated buyer funds remained secure and no knowledge was misplaced throughout the incident.
Kafka Restoration Issues Deepen Coinbase Outage
Coinbase disclosed that monitoring techniques first detected cascading quote failures at round 23:50 UTC. A number of Sev1 incidents adopted shortly after, prompting emergency response procedures throughout engineering groups. Inner techniques tied to the change’s core infrastructure began failing as temperatures rose inside a subset of racks hosted in AWS us-east-1.
Yesterday @coinbase skilled a multi-hour service disruption affecting buying and selling, change entry, and stability updates. Here is our preliminary learn from Coinbase engineering on what occurred, how we recovered, and what we’re addressing.
At roughly 23:50 UTC on 2026-05-07, our…
— rob (@rwitoff) Might 8, 2026
Based on Coinbase engineers, {hardware} failures struck techniques linked to the change’s matching engine. That engine processes orders and maintains order books throughout Coinbase markets. Infrastructure issues contained in the affected facility left solely a portion of the nodes operational. In consequence, the cluster failed to succeed in quorum, quickly blocking buying and selling for retail and institutional customers.
Engineers additionally confronted issues involving distributed Kafka clusters used for inside messaging. Coinbase stated these clusters course of a number of terabytes of information each day and have been designed to stay operational throughout an information heart outage. Restoration ensures failed throughout the incident, forcing groups to manually restore partitions onto alternative {hardware} brokers.
Devoted {Hardware} Failure Slows Restoration Course of
Clients skilled delayed stability updates whereas Kafka replication recovered. Coinbase stated balances can be robotically synchronized as soon as techniques caught up. Firm representatives added that no buyer or transaction knowledge disappeared throughout the outage.
Automated restoration instruments drained workloads from roughly 10 Kubernetes clusters tied to the affected zone. Most inside providers returned inside about half-hour after engineers remoted the issue.
Restoration took longer for techniques tied on to the change matching engine and Kafka infrastructure as a result of each relied on devoted {hardware} and storage configurations.
After stabilizing the atmosphere, Coinbase reopened markets in phases. Buying and selling first moved into cancel-only mode earlier than groups audited product states. Markets then entered public sale mode earlier than full buying and selling resumed throughout the change.
Coinbase Says No Information Was Misplaced Throughout Multi-Hour Platform Outage
Coinbase acknowledged that elements of its structure concentrated important change infrastructure inside a single availability zone. Engineers acknowledged that standby techniques have been in place for failover situations, although the isolation measures failed throughout the occasion. That prolonged the period and unfold of the outage past meant limits.
Firm executives praised inside coordination throughout the restoration course of. Engineering and on-call groups reportedly adopted established catastrophe restoration procedures whereas testing and validating fixes beneath constrained infrastructure circumstances.
Coinbase apologized to prospects who quickly misplaced entry to their accounts and buying and selling providers. Executives stated a full root trigger evaluation can be launched within the coming weeks, alongside deliberate reliability enhancements aimed toward stopping related failures.
