Event writes slow and query durations high

Incident Report for Keen

Resolved

S3 streaming is now caught up as well. All systems operating normally.

Posted May 12, 2016 - 03:20 PDT

Update

We are continuing to clear the S3 streaming integration backlog. Other services are operating normally.

Posted May 12, 2016 - 02:51 PDT

Update

Query performance and write delay are back to normal. We are continuing to work through a backlog for customers using the S3 streaming integration.

Posted May 11, 2016 - 20:10 PDT

Update

We are continuing to investigate and are monitoring both the read (via query durations) and write (via event backlog size) status closely.

Posted May 11, 2016 - 16:41 PDT

Monitoring

We are starting to catch up on writes and are monitoring closely. We expect this process to take some time as there's now a backup of events to process. We are also continuing to monitor query durations.

Posted May 11, 2016 - 15:02 PDT

Update

We are actively working on resolving write delays, and we're monitoring query durations.

Posted May 11, 2016 - 13:54 PDT

Investigating

We're continuing to investigate read and write delays. Query times are stabilizing, and we're working through write performance issues.

Posted May 11, 2016 - 12:57 PDT

Monitoring

We've made some additional configuration changes and continue to see improvement in both read and write delays, but both are still slower than normal.

Posted May 11, 2016 - 11:20 PDT

Update

We made a configuration change to the read traffic to alleviate some of the query API latency, but queries are still temporarily slower than normal.

Posted May 11, 2016 - 09:57 PDT

Identified

We have identified a strong candidate for some of our concerns this morning. We've decided on a course of action and will update you with the results. (For those that are wondering, the component we are primarily examining is Apache Zookeeper.)

Posted May 11, 2016 - 09:08 PDT

Update

We are still looking into this. We apologize for the inconvenience. To be honest, we're not too thrilled with it either. We will try to keep you posted as we learn anything new. Thank you for your patience.

Posted May 11, 2016 - 08:08 PDT

Update

Our initial investigation is indicating increased latency inside one of our hosting provider's data centers. We are working with our provider to debug the issue and get speeds back to normal as quickly as possible. Both our Data Collection APIs and Data Analysis APIs continue to have poor response times, but thus far nothing indicates that there would be any data loss.

Posted May 11, 2016 - 05:36 PDT

Investigating

We're currently investigating issues with events being slow, as well as query durations being high. Users may not receive immediate data for events sent in the last hour and queries may take longer than usual.

Posted May 11, 2016 - 03:58 PDT

This incident affected: Stream API.