Last week we had a hard disk fail on one of our Cassandra nodes. This is a relatively common occurrence for which we have a standard operational runbook. We removed the node normally, replaced the disk, and prepared it to re-join the cluster. On the morning (PST) of Tuesday 11-15 we began the process of adding the node back into the cluster, and fairly quickly started noticing a number of signs that it was not working correctly. This included reports from several customers of inconsistent query results (thanks for letting us know!), corroborated by automated tests.
We immediately began to investigate why this node may have been different from other nodes that we've used the same procedure on in the past. It turned out that this node was a Cassandra seed, which requires special treatment; by luck, we have never had to replace one of our seed nodes before. We were aware that the seeds were special but had developed a blindspot in our operational runbook and overlooked this detail.
Once we understood the issue we re-removed the node from the cluster and worked to follow the special instructions for replacing seed nodes, which includes updating the cluster-wide seeds list (a somewhat involved process). Once that completed we re-added the node.
We have updated our runbooks to more clearly call out this special case so that we avoid similar incidents in the future.