Starting around 3:35 PDT on April 18th we woke up to events being written to Keen and being held in our Kafka queue far longer than normal, indicating our database (Cassandra) was not having events written to it quickly enough, causing this backlog. (Note: No events were lost.)
Upon inspection, we discovered our Cassandra nodes were having exceptionally high write latency. This lasted for about 20 minutes. During this time very few events were being persisted to Cassandra. As suddenly as it started, the write latency ceased and our stack began working its way through the backlog. It eventually caught up.
Then it happened again. Our Cassandra nodes were again showing excessive write latency, for another 20 minutes or so. Then it stopped.
Over the course of the day these spikes became closer and closer together, eventually reaching a point where the backlog was not being completely cleared in-between spikes. Additionally, the sudden burst of events being processed every time one of these spikes died down caused additional load across these nodes, somewhat impacting the querying path as well.
Up until things became dire, we hoped to find the root cause during the incident, fix it, and have life return to normal. However, the incident had gone on for all morning and was creeping into the afternoon. We decided to take some more drastic actions to release pressure from our Cassandra hosts to give everything some breathing room -- ideally we would be able to figure out the root cause in retrospect.
We took a number of actions in a short period of time. First, we disabled a maintenance process called "compactions" that continuously runs in the background to improve the speed at which future queries will be able to read data. (For the technically savvy, these are not "Cassandra Compactions" although we do those as well.) Disabling this process decreased the amount of work our Cassandra nodes would have to do. Keeping this process disabled for longer periods of time will degrade query performance; that is why we did not disable them immediately.
Secondly we decreased the speed at which we were attempting to write events to Cassandra. We accomplished this by reducing the number of executors in our Storm cluster that handles writing events by 25%. We are not certain how much of an impact this change made by itself.
Thirdly we lowered our Cassandra read repair chance to 10% temporarily from 100%. We did this to reduce any additional load on our cluster. Having this occur 100% of the time data is read adds a very small delay to our queries and a bit more work for our Cassandra nodes to perform per read which we typically consider to be acceptable just in case something goes amiss.
Finally we halted some other repair work being done in the background on our Cassandra nodes. This work is very intensive for our nodes but will address performance, consistency, and some operational concerns when it is completed. Halting these processes halted our progress toward those goals for a time. We are in the process of putting these back into place and expect this to be completed in the next 2 or so weeks.
After performing our final change, Cassandra traffic was reduced enough that whatever underlying issue was causing everyone grief was resolved.
We have some strong candidates for how to prevent this problem in the future. One is introducing an exponential backoff pattern for our writes to Cassandra. That way, even if there is something strange happening with our database’s performance, our events should get recorded without contributing to the problem. The other is a substantial reworking of our write-path in Storm. We are upgrading an important library to a newer version that has many advantages over our current library’s version. We have high hopes that we will be able to avoid similar incidents with these changes.