Slow queries and elevated number of query timeouts

Incident Report for Keen

Postmortem

What Happened

Keen runs a background compaction process which optimizes the storage of event data. On Saturday 5/30 a Cassandra node in one of our data centers had a disk failure; this is normal and we have redundancy in place which protected us from any downtime or data loss. However due to a recently introduced code change this caused a portion of events to become effectively un-compactable. Over the course of 2+ days the lack of event compaction began to noticeably affect query performance, and caused some queries to timeout.

How We Fixed It

Once we identified the source of the problem we simply reverted the code change, which allowed the compaction process to make progress on the affected data. Within a few hours the process caught up. A small amount of data was stuck in an intermediate state (still queryable, but not compactable); we are investigating options for manually compacting that data but the impact on query performance should be minimal going forward.

Takeaways

Aside from fixing the code in question, this incident revealed a gap in our monitoring (we didn't notice that the compaction process was having issues until it began affecting queries). We are looking into the best way to detect such a situation in the future.

Posted Jun 01, 2015 - 20:10 PDT

Resolved

Backlog is clear. Queries should be performing normally now.

Posted Jun 01, 2015 - 20:00 PDT

Update

The optimization process is still catching up on its backlog. We are continuing to monitor. Query durations appear stable.

Posted Jun 01, 2015 - 18:36 PDT

Monitoring

A node outage over the weekend caused an optimization process to stop functioning properly. We repaired the process and query times and query timeouts are decreasing. We are continuing to monitor this issue.

Posted Jun 01, 2015 - 16:23 PDT

Investigating

We’re currently investigating a spike in average query durations and an increase in query timeouts. We'll provide updates as we learn more.

Posted Jun 01, 2015 - 15:12 PDT

This incident affected: Compute API.