On Jan 20th at 4:18pm PST keen.io's API started to show serious response time degradation. This issue lasted on and off (mostly on) from 4:18pm until 12:16am. The whole keen team and myself in particular take these sorts of issues very seriously. We understand that you count on keen.io for storage and analysis for your data and not having access to the data that you entrust us with has a serious business impact for you. I apologize for the impact to your business and want to be transparent as to what happened and what we plan to do to help mitigate the the problem in the future.
Starting at 4:18pm we started seeing increased response times on the API. Our on-call team was immediately notified and we started our investigation into what was going on. Shortly afterwards we ended up seeing our query infrastructure backing up on requests within our Dallas data center. Around 4:35pm PST response times recovered even though we hadn't found the cause of the issue. We continued to investigate looking at our query infrastructure. We noticed at 4:30pm PST that we received a large number of queries as well as a large number of delete operations against our API.
Our API had easily handled a much larger volume of queries than we were receiving and noticed a corresponding increase in Cassandra read latency which correlated with the increase in delete volume. We choose to focus our investigation on why Cassandra would be slowing down so much. In parallel we deployed more query capacity to make doubly sure that we had enough workers to handle the query volume.
At 5:15pm PST we suddenly saw query times rise once again, but this time to a higher level than previously. We continued to believe it was a Cassandra issue. We also saw a large number of errors in our query dispatcher. More people got involved at this point and we worked to resolve the query dispatcher errors as well as reached out to some specific customers to understand why their queries to keen.io had changed considerably around the start of this outage. There was a smaller recovery in response times due to numerous improvements we made, but the service wasn't fully recovered.
At 6:15pm PST we saw another spike in query times and at this point we were being convinced that the spike of inbound queries was the trigger for the decreased query performance even though we continued to see smaller performance issues in Cassandra.
At 7pm PST we limited how long we would let queries execute for to reduce the number of queued but stale requests in our system. We continued to see a large number of inbound requests and attempt to scale our service by deploying more query infrastructure as well as tuning our configs to allow us to more efficiently service queries.
8pm PST continued optimization changes and reaching out again to our customers to try to see how we can reduce the load from the query side as we work in parallel to support the increased load. Numerous code changes are shipped to continue to improve our API configuration. Heavily backed-up query queues are flushed to allow new queries to be serviced.
9pm Cassandra fully ruled out as the cause of the outage. Continued work to stabilize platform. Periods of corrected response times.
10am the team decides to start limiting very high volume queries from a customer to improve our service to our customers. Response times recover considerably. The team continues to work to get response times corrected.
12:18pm PST response times for all customers have fully recovered due to rate limiting. Full service availability.
Once again I want to apologize for the serious impact this outage had to each of our customers. We take these things very seriously and you trust us to do our best effort to provide a top notch service to you all. Shortly after we restored service, the team got together and held a full post-mortem to analyze what happened, how we responded to the issue and what actions we are going to take moving forward. I wanted to share what we are doing to improve our responsiveness in the future.
We are overhauling how we make decisions that impact a wide range of customers. When an issue like this comes up we now have clearer roles so that we can get to a decision about what action to take sooner. Our API deployment times are longer than we would like and we are scheduling time to decrease deployment times so we can be more responsive. We have already added even more finer grained metrics for organization level query volumes and made the more apparent so that when we see a rise in query volume we can quickly identify where it is coming from. The response time graph on status page doesn't seem to be accurate all the time. We are going to fix that to accurately reflect what response times look like.
We did a lot of research into the underlying cause on our side as to why this spike in queries caused a lack of responsiveness to queries. Our frontend API servers had ulimits which were set too low which caused customers to get throttled at the very top of our request handlers. We have increased these limits which should prevent this specific issue from happening moving forward.