Delay in writing events.

Incident Report for Keen

Postmortem

A disk failure in one of our coordination nodes caused our some of our write processing to pause. A simple restart solved the problem, but we were slow detecting it due to an unrelated failure in our write-queue depth monitoring software from Friday. We rolled back the monitoring software and were able to quickly detect the write delay and solve the problem.

We'll fix up the monitoring software on Monday and add an inverse check to verify that we're getting proper data from this subsystem in our monitoring software.

Sorry for the inconvenience!

Posted Apr 12, 2015 - 11:09 PDT

Resolved

The backlog of delayed events has been completed. Events are now flowing in a timely fashion.

Posted Apr 12, 2015 - 11:07 PDT

Monitoring

Our fix has been deployed and we're processing those events that had fallen behind. We should complete the backlog in about 30 minutes. Note that this only effects some events and is not limited to specific customers.

Posted Apr 12, 2015 - 10:58 PDT

Identified

We've identified a slow down in writing some events. We're deploying a fix now but users can expect that some events are slow in coming available for query.

Posted Apr 12, 2015 - 10:47 PDT

This incident affected: Stream API.