Database Issues
Incident Report for Keen
Resolved
All service has been fully restored.

Impact:
Clients experienced elevated (single-digit) error rates from 05:00PST to 06:40PST for both data collection and analysis. During a 9-minute window from 05:35PST to 05:44PST error rates spiked as a patch was applied to one of our database clusters. After the patch was applied, error rates dropped effectively to 0, and service was restored.

Cause:
The errors originated from a database cluster that we now use mainly to store metadata. This cluster became stuck in a loop of electing then re-electing a primary replica node. We were able to pin down this behavior to a known bug for which there was a fix in a future version.

We sincerely apologize for the inconvenience this incident has caused. We understand the reliability you expect from our service and the trust you place in us. We will do better.

A more detailed RCA (root cause analysis) will follow once that analysis is complete.

-Josh
Posted Jan 08, 2014 - 09:09 PST
Monitoring
Data collection and data querying have been restored. We were able to isolate the incident, and are rolling out patches now to bring all database nodes in sync.
-Josh
Posted Jan 08, 2014 - 07:04 PST
Investigating
One of our databases is currently experiencing issues. This is causing errors in both data collection and data querying for some users. Will update when we know more.

-Ryan
Posted Jan 08, 2014 - 05:14 PST
This incident affected: Stream API and Compute API.