Query Instability for Some Customers

Incident Report for Keen

Postmortem

From 10:40am-12:50pm PST yesterday a percentage of queries processed by our Dallas region experienced high latencies and timeouts. Here's a graph that shows success vs. failure for queries during that time (and before/after):

Query API Error Rate

200's are successful queries and 400s are failures.

Root Cause

The root cause of the instability was traced to an ongoing series of delete API calls made to a single collection. These delete calls caused memory pressure and increased latency within one of our topologies. That in turn caused queries from customers being served by that topology to take unusually long or timeout altogether.

Ultimately we were able to isolate the offending delete operation and move it to its own topology. Once we did that performance returned to normal levels for all other operations.

Finding the Perfect Balance

As providers of a multi-tenant service, we understand how important it is that unforeseen activity by one customer does not jeopardize others. In fact we're always evaluating trade-offs between isolation and performance. Isolation makes things more predictable, but maximizing performance means running each query with all the hardware we can throw at it.

What we really want is the best of both worlds, and every day that passes we get closer to it. Days like yesterday are bittersweet. We learned a lot, but it came at the expense of a service disruption to the very people we're doing this for. Not acceptable.

Going Forward

Here are the steps we're taking to improve our service in light of this incident:

We're beefing up our internal monitoring stack. This work is already in progress, and had it been in place yesterday we could have isolated the bad operation much more quickly.
We're isolating deletes from other operations. Deletes have a different resource utilization pattern than other operations and by putting them in their own topology we can better cater to that. All while avoiding unintended consequences to other workloads.
We're building tools that let us shuffle around workloads in real-time in response to internal and external conditions. Today, shuffling workloads around to different topologies requires deploying code, which means minutes can pass before changes take effect. In the future changes (and resolutions) will be near-instant.

Of course the end goal of these changes is that you, the customer, never have to worry about any of them! All we want you to do is enjoy a performant, highly available service.

But inevitably creating technology and pushing boundaries means a few things will break here and there. When it does, know that we'll be transparent, responsive, and eager to improve.

Thanks

I apologize for yesterday's disruption, thank you for your patience, and look forward to updating you as the improvements we've planned go into effect.

-Josh

Posted Mar 12, 2014 - 14:47 PDT

Resolved

Everything is back to normal. Queries are functioning normally across all data centers. Apologies for any inconvenience.

Posted Mar 11, 2014 - 15:07 PDT

Monitoring

We have the isolated the problem and applied a temporary fix. Some users may still see increased latency for queries. We're monitoring for now while we apply a complete fix.

Posted Mar 11, 2014 - 13:03 PDT

Investigating

One of our data centers is experiencing instability in queries. We're investigating the issue -- data recording is unaffected.

Posted Mar 11, 2014 - 11:02 PDT