Florian, Kevin, Aleksander
RCA Complete, Action Items In Progress
Keen experienced intermittent availability and high latency for 172 minutes due to two large spike's in the rate inbound events.
The first spike began at about 23:00UTC and subsided at about 23:30UTC just as the source of the traffic was identified. The nature of the event and how traffic was balanced across our platform meant that our external monitoring did not reflect the true severity of the spike in traffic. Believing the event was over no remediation was taken.
A second larger spike occurred 30 minutes later at 00:00UTC (6/29). The second spike grew large enough to more consistently trigger internal monitoring systems. At 00:33UTC the traffic was blocked and the platform began to recover.
Additionally, uncertainty around the severity of the issue and lack of clear internal (and external) escalation process led to a delay in opening up an official incident.
Widespread high latency and high rates of request timeouts.
Two large spikes in the number of inbound events overwhelmed the available capacity of the platform to process events.
Blocked inbound event stream.
Mixed. External monitoring systems supplied only partial indicators due to the nature of the event. Operator’s identified customer impact.