Inconsistent query responses

Incident Report for Keen

Postmortem

Last week we had a hard disk fail on one of our Cassandra nodes. This is a relatively common occurrence for which we have a standard operational runbook. We removed the node normally, replaced the disk, and prepared it to re-join the cluster. On the morning (PST) of Tuesday 11-15 we began the process of adding the node back into the cluster, and fairly quickly started noticing a number of signs that it was not working correctly. This included reports from several customers of inconsistent query results (thanks for letting us know!), corroborated by automated tests.

We immediately began to investigate why this node may have been different from other nodes that we've used the same procedure on in the past. It turned out that this node was a Cassandra seed, which requires special treatment; by luck, we have never had to replace one of our seed nodes before. We were aware that the seeds were special but had developed a blindspot in our operational runbook and overlooked this detail.

Once we understood the issue we re-removed the node from the cluster and worked to follow the special instructions for replacing seed nodes, which includes updating the cluster-wide seeds list (a somewhat involved process). Once that completed we re-added the node.

We have updated our runbooks to more clearly call out this special case so that we avoid similar incidents in the future.

Posted Nov 16, 2016 - 10:33 PST

Resolved

Queries should be returning consistent results now. Again, no data was lost. Please let us know if you continue to see inconsistent results.

Posted Nov 15, 2016 - 18:29 PST

Update

We took steps to mitigate the query inconsistency; most queries should be returning consistent results now. We are still working on the underlying issue and may have some additional brief periods of inconsistency as we perform this maintenance. We will keep this incident open while we perform that work, just in case. We still do not have any reason to believe that any data has been lost. Thanks for your patience.

Posted Nov 15, 2016 - 14:12 PST

Identified

We have discovered an issue which is causing some queries to return inconsistent results. We believe we've identified the root cause and are taking steps to remedy the situation. We currently believe that no data has been lost, but we will follow up to confirm.

Posted Nov 15, 2016 - 11:01 PST

This incident affected: Compute API.