Today our users experienced longer than normal write times for events. Typically events are written and query able within 10 seconds of posting. We were briefly behind for up to an hour today. No data was lost, writes were just delayed. Here's what happened.
We use redundant storage across multiple data centers at Keen. We have several dozen storage hosts spread across these data centers to ensure that losing data is really damned hard. Today we had a handful of nodes fail nearly simultaneously.
We're not sure why these hosts failed at once. It happens sometimes, but it's very odd to happen all at once. Maybe there's a common thread there. We'll investigate that.
Diagnosing this problem was slower than we'd like. We noticed very quickly that a single host had failed but we didn't notice the others for a few minutes.This delay in diagnosis slowed down finding the root cause. We missed the forest for the tree(s).
This is the first time that we've had a failure of this many nodes and the repercussions, while obvious in hindsight, took longer than we'd like to recognize. Writes were delayed, but continued to be accepted. Deletes were temporarily unavailable, so we'll need to rethink how we handle that.
We'll also look at changing our storage monitoring to be more clear about the width of failures, so that we can respond more effectively when multiple nodes fail.
We apologize for the inconvenience and hope that this postmortem helps to reinforce our dedication to keep your event data safe and available.