The incident was triggered by our S3 streaming pipeline not being able to keep up with bursts in write event volume. This caused a backlog of events that needed to be processed and uploaded to S3.
The problem was exacerbated by the fact that once the system is in a degraded state, it's possible that some events could be uploaded but not captured by our book keeping system. Once we have a backlog, the system checks for duplicate events before uploading new data. Duplicate checking is expensive and requires reading data from S3 every time new events need to be written. This caused additional load on the system.
To solve the problem we added more capacity in terms of number of processes, as well as made changes in our system to process a smaller batch of events at a time to prevent resource starvation.