Events are delayed

Incident Report for Keen

Postmortem

This last Monday, the Keen.io API had some significant performance problems. There was no data loss or corruption as a result of this event. At 6:40am PST on Monday morning our Cassandra cluster started having read performance problems. We saw increased latency for both processing queries as well as a significant delay in writing events to our storage. Typically when a host is having read performance issues, traffic is automatically balanced so that the request for data is handled by the fastest node. In this case, we had a few nodes degrade at the same time which ended up causing reads from our database to take longer than is healthy. Going into this incident, we had additional pressure on our storage layer from removing an under performing host. When this host left the cluster, it put more pressure on our remaining cluster and ended up causing performance issues.

We believe that the reason for the slowdown (based on our data and observations) was the need for our database layer to read from a much larger number of files to answer queries than we typically end up seeing. As the number of files read went up, the amount of pressure on our storage layer increased. When the node left the cluster, other hosts which had been performing fine with their load ended up taking more load and were taking longer to service reads. GC times increased as well as the CPU utilization. By 9:27 am we were able to re-distribute traffic and make some changes to our cluster to allow reads to complete much more quickly than they had been earlier in the morning. We continued to look at the database performance issues and started making some changes to reduce the number of files that needed to be read. Cassandra has an internal process called compaction that is responsible for consolidating files and improving read performance. Unfortunately this process got behind on a number of our hosts. Thankfully the changes we made improved the performance of this process.

At this point, we had some changes in place that would take a few hours to complete to improve read response times. The team decided to focus on making sure that data was being written to our database so it would be available for querying. We looked at Cassandra write performance but didn’t see any performance issues there. We added more instrumentation to the “write path” (the code responsible for committing events to our database) and found that indeed writes were performing normally, but in the write path we end up doing some reads to make sure we aren’t doing any duplicate inserts, among other things. Based on this data, we increased the number of workers we had writing data. We also came up with a strategy to minimize reads, keep consistency and improve throughput on our write path. Throughout the afternoon we increased parallelism and improved our configuration to allow for faster throughput. This didn’t make a big impact and at 7:31 pm we deployed a significant change to our write path to improve throughput. We immediately saw an improvement in our ability to write events. By 9:20 pm PST we cleared our entire event backlog, and all operations were back to normal.

There are several actions we have taken and intend to take to protect against this occurrence.

To prevent this class of read performance issue, we have instrumented more reads in our system and will very quickly isolate read performance issues far before they become visible to customers. If even a single host is showing slow reads we will be able to take immediate action to address it.

We are upgrading our Cassandra version to move past a number of internal performance bugs and reporting issues which made it more difficult to observe what was happening in our cluster.

The entire engineering team is currently doing a full sprint of work with follow on projects to address any open risk in our infrastructure. The goal of this work is to clean up any significant loose ends in our code and infrastructure that could significantly impact our customers. From this work we are going to have some longer running projects around storage that will further improve our stability and predictability of response times.

Moving forward we are allocating a significant amount of engineering time to engineering health and excellence. This body of work will be driven by the engineering team and will be focused on making sure our service performs well for our customers and that it is predictable.

Lastly, we want to apologize. We understand that delays in processing data as well as slow downs in query processing has a significant impact on our customers. There have been a few incidents over the last month that have made our platform challenging for you. We will do everything we can to learn from this issue and drive improvement in how we serve you, our customer.

Sincerely, Brad Henrickson VP of Engineering, Keen IO

Posted Jun 27, 2016 - 11:57 PDT

Resolved

All services are up and running normally. Thanks for your patience as we have worked through the issues. The team is going to put together a post mortem and make sure it gets to our status page.

Posted Jun 21, 2016 - 08:04 PDT

Update

We were able to work through the incident overnight and recover response times. Our service is currently stable with all reads and writes being processed. We are continuing to monitor as we finish wrapping up some minor internal issues. Thanks!

Posted Jun 21, 2016 - 07:41 PDT

Monitoring

We have cleared the event write backlog and we are now serving all queries. We are monitoring the stability of the service.

Posted Jun 20, 2016 - 23:10 PDT

Identified

We have identified some write performance issues and have applied some fixes to significantly improve our ability to write events. We expect our backlog of events to be cleared within 2.5 hours. We are also continuing to work to improve our query response times.

Posted Jun 20, 2016 - 20:35 PDT

Update

We are still having query performance issues and writes are currently delayed. The team is investigating it's options to improve service response times. Thank you for your patience.

Posted Jun 20, 2016 - 17:23 PDT

Update

We are continuing to both investigate the root cause for today's disturbances as well as take action to resolve the issue as it impacts customers. No data has been lost but many events written today will not be included in queries until we resolve our incident. Query speeds are still far from ideal but should not be failing.

Posted Jun 20, 2016 - 14:53 PDT

Update

We have significantly lowered query durations and have moved to focusing on our event backlog. We will update you as we make additional progress. Thank you.

Posted Jun 20, 2016 - 13:08 PDT

Update

We are working on improving the performance of our storage layer. Events written within the last few hours will continue to be mostly unavailable for querying but have not been lost. Again, thank you for your patience.

Posted Jun 20, 2016 - 11:30 PDT

Update

We are still investigating the query problems. We have re-enabled the persistence of events to our database but there is a significant backlog being processed. The statuspage will likely show extremely high write delays as a consequence.

Posted Jun 20, 2016 - 09:48 PDT

Update

We have paused event writes to our database to temporarily reduce load on our systems while we continue investigating. No data is being lost. Events written during this incident will become visible to queries once we resolve our performance concerns. Thank you for your patience.

Posted Jun 20, 2016 - 08:41 PDT

Investigating

We're investigating an issue where many queries are slow and a few are failing. Writes appear to be somewhat delayed but no data loss is present. Queries on very recent data may be inaccurate during this incident.

Posted Jun 20, 2016 - 07:47 PDT

This incident affected: Stream API, Compute API, and S3 Streaming.