Delay in streaming to S3
Incident Report for Keen
Postmortem

At approximately 18:10 UTC on Monday 4/1 our monitoring system alerted us to a growing backlog (or “delta”) in our S3 processing pipeline. Our on-call team immediately began investigating and identified a large spike in incoming event traffic from a single customer. This traffic was for a project with S3 streaming enabled, and the extra volume was overwhelming the capacity of our pipeline.

While working on this investigation we also learned from customer reports and our own monitoring that some event submissions (even for non-S3-streaming projects) were failing outright. We identified the root cause as a slowdown in response times from our Kafka cluster, which was very likely due to the increased load from the S3 streaming pipeline (which uses Kafka as an intermediate work queue).

We reached out to the customer in question and were able to work with them to identify and roll back a change that had included a bug which generated the event volume spike. Once the submissions subsided the API resumed normal operation as pressure was relieved from Kafka, but we were left with a large backlog of S3 data to work through.

The operations team at Keen spent the next ~24 hours tuning, tweaking, and massaging the pipeline to get it caught back up. Eventually we were able to get all delayed events processed and return to normal real-time streaming.

In the short-term we will be updating our runbooks with some of the lessons learned during the tuning of the pipeline, so that if/when we have a similar scenario in the future we can hopefully clear the backlog more rapidly.

Longer term, this incident is valuable feedback into the prioritization of some bigger projects such as (a) more sophisticated rate limiting/load shedding at the API tier and (b) a more efficient and robust S3 streaming architecture that is more resilient to traffic surges and backlog scenarios.

If you have any questions, comments, or concerns please feel free to reach out to us at team@keen.io. Thanks!

Posted Apr 05, 2019 - 12:12 PDT

Resolved
S3 Streaming continues to function normally. We are resolving this incident. Apologies again for the hiccup.
Posted Apr 02, 2019 - 16:51 PDT
Update
The S3 streaming pipeline is now fully caught up. We will be monitoring to ensure continued normal operation.
Posted Apr 02, 2019 - 13:42 PDT
Update
We've implemented changes to accelerate clearing the backlog. The delay for most data should be down to 6-12 hours or less, and shrinking. We hope to be caught up to normal operation within the next few hours.
Posted Apr 02, 2019 - 12:16 PDT
Update
We are still working on catching up S3 streaming to real-time. Currently streaming is ~16-20 hours behind. We are experimenting with adjustments to the pipeline to close that gap. We apologize for the extended delay.
Posted Apr 02, 2019 - 10:11 PDT
Update
We are still working to clear the S3 streaming backlog. We do not have a precise timeline within which it will be caught up, but hope it will be within the next ~12 hours. Thanks for your patience.
Posted Apr 01, 2019 - 22:23 PDT
Monitoring
We've implemented fixes in our S3 streaming pipeline and it is beginning to catch up, but may take 12-24 hours at the current rate. We will continue to monitor and see if we can expedite the process. Thanks again for your patience.
Posted Apr 01, 2019 - 14:49 PDT
Update
The Stream API should be back to normal. We are continuing to work on the S3 streaming delay.
Posted Apr 01, 2019 - 13:42 PDT
Update
We are continuing to work towards a resolution. Thanks for your patience.
Posted Apr 01, 2019 - 12:57 PDT
Update
Some writes to the streaming API are slow or failing, even for non-S3-streaming projects. We are continuing to investigate. Thanks for your patience.
Posted Apr 01, 2019 - 11:58 PDT
Identified
We have identified an issue with streaming events to S3 and are working towards a resolution. Customers who do not use S3 Streaming are not affected.
Posted Apr 01, 2019 - 11:29 PDT
This incident affected: S3 Streaming.