Delayed Writes and Slow Query Performance
Incident Report for Keen
Postmortem

Today our users experienced longer than normal write times for events. Typically events are written and query able within 10 seconds of posting. We were briefly behind for up to an hour today. No data was lost, writes were just delayed. Here's what happened.

We use redundant storage across multiple data centers at Keen. We have several dozen storage hosts spread across these data centers to ensure that losing data is really damned hard. Today we had a handful of nodes fail nearly simultaneously.

We're not sure why these hosts failed at once. It happens sometimes, but it's very odd to happen all at once. Maybe there's a common thread there. We'll investigate that.

Diagnosing this problem was slower than we'd like. We noticed very quickly that a single host had failed but we didn't notice the others for a few minutes.This delay in diagnosis slowed down finding the root cause. We missed the forest for the tree(s).

This is the first time that we've had a failure of this many nodes and the repercussions, while obvious in hindsight, took longer than we'd like to recognize. Writes were delayed, but continued to be accepted. Deletes were temporarily unavailable, so we'll need to rethink how we handle that.

We'll also look at changing our storage monitoring to be more clear about the width of failures, so that we can respond more effectively when multiple nodes fail.

We apologize for the inconvenience and hope that this postmortem helps to reinforce our dedication to keep your event data safe and available.

Posted Dec 07, 2014 - 20:19 PST

Resolved
We've caught up with the backlog and all events have been written. We're sorry for the trouble and are working on a postmortem.
Posted Dec 04, 2014 - 11:34 PST
Update
We are continuing to work the backlog and estimate we'll catch up with old events by noon, pacific time.
Posted Dec 04, 2014 - 11:19 PST
Monitoring
Writes are catching up and progressing normally. We still have a backlog to work out and we will monitor the backlog and update this incident when the backlog has been completed.
Posted Dec 04, 2014 - 11:08 PST
Update
Event writes have resumed and we are now working through the backlog of events. We are continuing to work on storage problems and will report when things are all back to normal.
Posted Dec 04, 2014 - 11:03 PST
Identified
We have identified problems where multiple storage systems have failed and need to be restarted. We are working on the problem now. No data has been lost, they are merely delayed in writing.
Posted Dec 04, 2014 - 10:54 PST
Investigating
We are investigating high latency in our storage layer which is resulting in delays to writes and slower than normal query performance. We are investigating the problem and will update soon!
Posted Dec 04, 2014 - 10:30 PST