10/18/23 Aptos Mainnet Incident Report
Overview
On October 18, 2023 the Aptos network delayed transactions at approximately 4:15 PM PDT. Transaction load was not an issue for this incident – no committed transactions were lost and no fork occurred. Non-deterministic code led to the issue and a fix was deployed. The issue was resolved at approximately 9:30 PM PDT.
What Happened
On August 22, 2023 a performance-focused code change was committed to the Aptos-core code base. On October 16, 2023 the FeeStatement event went live – which breaks down the charge / refund for a transaction. The initial code change introduced a non-determinism that was only revealed by the FeeStatement. Specifically, validators agreed that a transaction had an insufficient gas budget to execute a transaction and they were unable to agree upon the amount of gas used up to that point as a result of the non-determinism introduced in the August code change.
How It Was Resolved
The team identified the issue and reversed the August code change, resolving the issue. Validator operators quickly deployed the software with the correct software.
What’s Next
This is the first time since Mainnet launch that the Aptos network has experienced any significant delay on the blockchain, and we take moments like this, while extremely rare, very seriously. This non-deterministic scenario was not reached in any of the testing scenarios (including testnet) and moving forward, ecosystem developers should commit to more strenuous testing of atypical test cases. When we or any other ecosystem team proposes changes to the protocol in the future, we should push forward to test cases where the test can handle unwanted input and user behavior.
Detailed Incident Description
The Aptos blockchain deploys new technology at the pace of market demand and inline with cloud infrastructure. There were 7 major releases and more than 40 AIPs in 2023 with the last release of 2023 expected in the coming months. All code changes undergo a strict process of a minimum of 2 code reviewers accepting code changes. All code must pass a series of tests prior to being committed – unit tests, regression tests, integration tests, compatibility tests, performance tests, security tests, and simulation real world testing in multi-region cloud environments. Security tests include fuzzing, adversarial testing, and failure testing. Additionally, there are nightly tests that also test more expansive cluster scenarios with a broad coverage of different transaction types, including replaying transactions from the mainnet. All releases go through the same release pipeline – first devnet, then testnet, previewnet for major features, and then mainnet. While this much extensive testing infrastructure takes longer to deploy code and is expensive, our experience has shown that it significantly reduces the number of bugs that make their way to mainnet.
On August 22, 2023 a change landed that improved the performance of the data structures that contain output from VM execution. In combination with the FeeStatement event – which breaks down the charge / refund for a transaction – the change from a deterministic map to a non-deterministic map made this summation of I/O-based gas charges non-deterministic in certain cases where the gas costs for transactions exceeded their gas budget. This only occurs in rare scenarios where the transaction gas limit is hit and as a result of incomplete execution combined with the non-deterministic data structure can lead to non-deterministic I/O gas costs in the FeeStatement event. This non-deterministic scenario was not reached in any of the testing scenarios (including testnet) and its first occurrence led to this incident on mainnet.
October 18, 2023 Detailed Response
4:15 PM PDT - An automated alert notifies developers and node operators that the transactions are paused. Developers start a VC war room and immediately begin triaging the source of the issue.
5:15 PM PDT - Developers identify the issue of 4 different execution results (non-determinism) of the same transaction from several validator logs. Developers also rule out parallel execution as a source of non-determinism by disabling it on several machines and observing that the different execution results can occur on the same machine with the same software. At this point, the non-determinism is highly likely to be limited to a sequential source of non-determinism. Developers begin to investigate recent code changes that could lead to this issue. Several parallel investigations kick off to discover the source of the non‑determinism.
7:30 PM PDT - After identifying the actual event output differences of the non-deterministic transaction execution, the issue is traced back in code to the FeeStatement event and the code change. In parallel, two efforts are started. One developer begins running transaction simulation with a code change that reverts the map change – repeating execution to ensure a consistent result occurs. Also, the revert to the code commit is landed and the docker builds begin to be made for validators operators.
8:45 PM PDT - The transaction simulation confirms the fix is correct. The new build is ready and available from docker and as source. Validator operators quickly upgrade their software.
9:18 PM PDT - As the network reaches consensus on the transaction execution, the blocking transaction processes successfully and all pending transactions complete as well. No transactions enqueued for execution are lost. No committed transactions are lost. The network resumes normal operation.