At approximately 18:10 UTC on Monday 4/1 our monitoring system alerted us to a growing backlog (or “delta”) in our S3 processing pipeline. Our on-call team immediately began investigating and identified a large spike in incoming event traffic from a single customer. This traffic was for a project with S3 streaming enabled, and the extra volume was overwhelming the capacity of our pipeline.
While working on this investigation we also learned from customer reports and our own monitoring that some event submissions (even for non-S3-streaming projects) were failing outright. We identified the root cause as a slowdown in response times from our Kafka cluster, which was very likely due to the increased load from the S3 streaming pipeline (which uses Kafka as an intermediate work queue).
We reached out to the customer in question and were able to work with them to identify and roll back a change that had included a bug which generated the event volume spike. Once the submissions subsided the API resumed normal operation as pressure was relieved from Kafka, but we were left with a large backlog of S3 data to work through.
The operations team at Keen spent the next ~24 hours tuning, tweaking, and massaging the pipeline to get it caught back up. Eventually we were able to get all delayed events processed and return to normal real-time streaming.
In the short-term we will be updating our runbooks with some of the lessons learned during the tuning of the pipeline, so that if/when we have a similar scenario in the future we can hopefully clear the backlog more rapidly.
Longer term, this incident is valuable feedback into the prioritization of some bigger projects such as (a) more sophisticated rate limiting/load shedding at the API tier and (b) a more efficient and robust S3 streaming architecture that is more resilient to traffic surges and backlog scenarios.
If you have any questions, comments, or concerns please feel free to reach out to us at firstname.lastname@example.org. Thanks!