Keen runs a background compaction process which optimizes the storage of event data. On Saturday 5/30 a Cassandra node in one of our data centers had a disk failure; this is normal and we have redundancy in place which protected us from any downtime or data loss. However due to a recently introduced code change this caused a portion of events to become effectively un-compactable. Over the course of 2+ days the lack of event compaction began to noticeably affect query performance, and caused some queries to timeout.
Once we identified the source of the problem we simply reverted the code change, which allowed the compaction process to make progress on the affected data. Within a few hours the process caught up. A small amount of data was stuck in an intermediate state (still queryable, but not compactable); we are investigating options for manually compacting that data but the impact on query performance should be minimal going forward.
Aside from fixing the code in question, this incident revealed a gap in our monitoring (we didn't notice that the compaction process was having issues until it began affecting queries). We are looking into the best way to detect such a situation in the future.