Reduced Performance on Event Writes and Queries
Incident Report for Keen
Postmortem

Starting around 3:35 PDT on April 18th we woke up to events being written to Keen and being held in our Kafka queue far longer than normal, indicating our database (Cassandra) was not having events written to it quickly enough, causing this backlog. (Note: No events were lost.)

Upon inspection, we discovered our Cassandra nodes were having exceptionally high write latency. This lasted for about 20 minutes. During this time very few events were being persisted to Cassandra. As suddenly as it started, the write latency ceased and our stack began working its way through the backlog. It eventually caught up.

Then it happened again. Our Cassandra nodes were again showing excessive write latency, for another 20 minutes or so. Then it stopped.

Over the course of the day these spikes became closer and closer together, eventually reaching a point where the backlog was not being completely cleared in-between spikes. Additionally, the sudden burst of events being processed every time one of these spikes died down caused additional load across these nodes, somewhat impacting the querying path as well.

Up until things became dire, we hoped to find the root cause during the incident, fix it, and have life return to normal. However, the incident had gone on for all morning and was creeping into the afternoon. We decided to take some more drastic actions to release pressure from our Cassandra hosts to give everything some breathing room -- ideally we would be able to figure out the root cause in retrospect.

We took a number of actions in a short period of time. First, we disabled a maintenance process called "compactions" that continuously runs in the background to improve the speed at which future queries will be able to read data. (For the technically savvy, these are not "Cassandra Compactions" although we do those as well.) Disabling this process decreased the amount of work our Cassandra nodes would have to do. Keeping this process disabled for longer periods of time will degrade query performance; that is why we did not disable them immediately.

Secondly we decreased the speed at which we were attempting to write events to Cassandra. We accomplished this by reducing the number of executors in our Storm cluster that handles writing events by 25%. We are not certain how much of an impact this change made by itself.

Thirdly we lowered our Cassandra read repair chance to 10% temporarily from 100%. We did this to reduce any additional load on our cluster. Having this occur 100% of the time data is read adds a very small delay to our queries and a bit more work for our Cassandra nodes to perform per read which we typically consider to be acceptable just in case something goes amiss.

Finally we halted some other repair work being done in the background on our Cassandra nodes. This work is very intensive for our nodes but will address performance, consistency, and some operational concerns when it is completed. Halting these processes halted our progress toward those goals for a time. We are in the process of putting these back into place and expect this to be completed in the next 2 or so weeks.

After performing our final change, Cassandra traffic was reduced enough that whatever underlying issue was causing everyone grief was resolved.

We have some strong candidates for how to prevent this problem in the future. One is introducing an exponential backoff pattern for our writes to Cassandra. That way, even if there is something strange happening with our database’s performance, our events should get recorded without contributing to the problem. The other is a substantial reworking of our write-path in Storm. We are upgrading an important library to a newer version that has many advantages over our current library’s version. We have high hopes that we will be able to avoid similar incidents with these changes.

Posted Apr 26, 2016 - 15:37 PDT

Resolved
Things have been looking stable for the last couple of hours. We are continuing to investigate the root cause on the engineering side. We will post an update once we have more information.
Posted Apr 18, 2016 - 15:31 PDT
Monitoring
We have made several changes to impact the slower event writing, and have seen events processed at healthy rates again. We are taking a closer look at some of our infrastructure to establish a root cause and are monitoring the situation overall until we discover one.
Posted Apr 18, 2016 - 14:23 PDT
Investigating
We are still investigating these performance issues. Some customers may experience degraded performance more than others.
Posted Apr 18, 2016 - 11:27 PDT
Monitoring
Events are being written at normal rates now. We have not finished our investigation yet, however.
Posted Apr 18, 2016 - 10:43 PDT
Update
Query speeds have been normalized. We are looking into the causes of the slowdown in recording events. Thank you for your patience.
Posted Apr 18, 2016 - 09:53 PDT
Investigating
We have encountered some degraded performance with both queries and how quickly events are showing up in Keen. We are looking into these concerns.
Posted Apr 18, 2016 - 09:35 PDT
This incident affected: Stream API and Compute API.