Event processing is slow

Incident Report for Keen

Postmortem

Earlier this week we made a code change to more accurately measure the size of incoming events. This code change measured the number of bytes rather than the number of characters, more accurately enforcing our limits of event size.

A mistake was made in the patch that set the single-event size limit to the same value as the limit for a batch of events. This mistake allowed a single event to enter our write path that was too large for some of our internal code paths. The manifestation of the problem was very similar to failures we've seen before due to other symptoms, so we were slow in diagnosing the problem. After examining the write queue we realized the event was too large and quickly found the aforementioned error.

A config change was deployed to our API to correct the error and the bad event was skipped. In addition, the following remediation items will be instituted:

Our runbooks will be updated to include instructions for on-call to examine items in the queue and to skip bad events
Our write path code will be modified to drop (with errors) any event over a certain size.

We apologize for the inconvenience. Indexing and making your data available for query quickly is important and thank you for your patience.

Posted Mar 04, 2015 - 18:52 PST

Resolved

The backlog is clear, we are back to normal levels. We've identified the root cause and have a fix in the pipeline to prevent this from happening again.

Posted Mar 04, 2015 - 18:43 PST

Update

Event backlog is still growing. We have a fix in flight to process events faster. We are continuing to investigate the root cause that led to the problem.

Posted Mar 04, 2015 - 18:13 PST

Investigating

Some events are making it into the system but are not available for querying. This is affecting a small percentage of total events. We are working on clearing up the backlog.

Posted Mar 04, 2015 - 17:33 PST