Event Ingest and Query Latency

Date: 2018-06-28

Authors:

Florian, Kevin, Aleksander

Status

RCA Complete, Action Items In Progress

Summary

Keen experienced intermittent availability and high latency for 172 minutes due to two large spike's in the rate inbound events.

The first spike began at about 23:00UTC and subsided at about 23:30UTC just as the source of the traffic was identified. The nature of the event and how traffic was balanced across our platform meant that our external monitoring did not reflect the true severity of the spike in traffic. Believing the event was over no remediation was taken.

A second larger spike occurred 30 minutes later at 00:00UTC (6/29). The second spike grew large enough to more consistently trigger internal monitoring systems. At 00:33UTC the traffic was blocked and the platform began to recover.

Additionally, uncertainty around the severity of the issue and lack of clear internal (and external) escalation process led to a delay in opening up an official incident.

Impact

Widespread high latency and high rates of request timeouts.

Root Causes & Trigger

Two large spikes in the number of inbound events overwhelmed the available capacity of the platform to process events.

Resolution

Blocked inbound event stream.

Detection

Mixed. External monitoring systems supplied only partial indicators due to the nature of the event. Operator’s identified customer impact.

Action Items

Define a clear escalation path for Keen Pro customers
Conduct Incident Handling/Response planning session with Engineering and CS orgs
- Update incident handling runbook to provide a detailed order of procedures (i.e. when to open an incident, how to notify customers, etc)
Adjust external monitoring to better detect and alert on periods of high latency
Complete Engineer Note for customers on timeout handling
Support exponential backoff or similar strategy in Javascript SDK
Engineer org conducts load shedding/throttling review

Posted Jul 06, 2018 - 14:36 PDT

Resolved

Request times and platform performance have stabilized and returned to normal levels. We apologize for the delay in getting status updates out and we'll work on getting a Post Mortem/RCA for everyone in the coming days.

Posted Jun 28, 2018 - 19:44 PDT

Monitoring

A fix has been deployed and appears to be working, we'll continue monitoring the situation but request times should be returning to normal.

Posted Jun 28, 2018 - 19:13 PDT

Investigating

We're aware that a large subset of our users are currently experiencing timeouts or high latency while attempting to write (and some cases read) events from Keen. We're currently testing a mediation and will provide an additional update in 30 minutes.

Posted Jun 28, 2018 - 18:58 PDT

This incident affected: Stream API and Compute API.