Query durations are up

Incident Report for Keen

Postmortem

At 3:34PM PST our query durations began to spike. Multiple incidents seemed to coincide: * A number of our application servers began to run out of memory * A large number of queries began to be queued, raising the average duration

We initially reacted to the memory problems, as we'd seen this pattern recently and mistakenly associated the memory failure with the query durations. At this point all of our on-call staff and many other team members were in attendance.

After watching query durations improve slightly we realized this wasn't all of the issue. We began to investigate continued problems. We initially suspected a new query dispatcher and spent some time rolling back to an older mechanism. This turned out to be a non-issue.

At 6:00PM PST Finally we isolated a query pattern that was causing an unhealthy number of queue backups and isolated the customer. After some manual query queue flushing, we were able to quickly return to normal operation at 6:50PM PST.

We will be deploying more capacity tomorrow in an attempt to have fewer issues with this type of query pattern in the future. We will also be beginning a project to improve the responsiveness of our rate limiting.

We're very sorry for the duration and depth of this issue. The large number of incidents lately is something we do not take lightly and we will be posting additional post mortem information as well as periodic explanations of work we are deploying to mitigate future issues and to continue to earn your trust!

Posted Feb 16, 2015 - 18:58 PST

Resolved

Query durations hare returned to normal.

Posted Feb 16, 2015 - 18:50 PST

Monitoring

We are seeing a positive change in query durations and are continuing to monitor.

Posted Feb 16, 2015 - 18:42 PST

Update

We have take some steps to protect our query backend from an odd query pattern, rolled back to an older query-dispatch mechanism and are now monitoring the effect of these changes. We are aggressively timing out queries until we can monitor new query durations. This is still only affecting customers in the midwest and east coast.

Posted Feb 16, 2015 - 18:24 PST

Update

We are now working to add capacity to our query analysis path in an attempt to compensate for the query duration slowdown.

Posted Feb 16, 2015 - 17:53 PST

Update

We are continuing to investigate the cause of the slowdown. We've eliminated some parts of our stack and are investigating timeout behavior for long running queries. Our primary and secondary on calls as well as a few other team members are all actively investigating.

Posted Feb 16, 2015 - 17:23 PST

Investigating

We've been experiencing high query durations since approximately 3:43P PST. The impact seems to be only to midwest and east coast customers at present.

Posted Feb 16, 2015 - 16:51 PST