From 10:40am-12:50pm PST yesterday a percentage of queries processed by our Dallas region experienced high latencies and timeouts. Here's a graph that shows success vs. failure for queries during that time (and before/after):
200's are successful queries and 400s are failures.
Root Cause
The root cause of the instability was traced to an ongoing series of delete API calls made to a single collection. These delete calls caused memory pressure and increased latency within one of our topologies. That in turn caused queries from customers being served by that topology to take unusually long or timeout altogether.
Ultimately we were able to isolate the offending delete operation and move it to its own topology. Once we did that performance returned to normal levels for all other operations.
Finding the Perfect Balance
As providers of a multi-tenant service, we understand how important it is that unforeseen activity by one customer does not jeopardize others. In fact we're always evaluating trade-offs between isolation and performance. Isolation makes things more predictable, but maximizing performance means running each query with all the hardware we can throw at it.
What we really want is the best of both worlds, and every day that passes we get closer to it. Days like yesterday are bittersweet. We learned a lot, but it came at the expense of a service disruption to the very people we're doing this for. Not acceptable.
Going Forward
Here are the steps we're taking to improve our service in light of this incident:
Of course the end goal of these changes is that you, the customer, never have to worry about any of them! All we want you to do is enjoy a performant, highly available service.
But inevitably creating technology and pushing boundaries means a few things will break here and there. When it does, know that we'll be transparent, responsive, and eager to improve.
Thanks
I apologize for yesterday's disruption, thank you for your patience, and look forward to updating you as the improvements we've planned go into effect.
-Josh