All services are back online and operating normally.
We are very sorry for the inconvenience that this outage has caused. This is completely unacceptable to us, and we're going to be working hard to fully restore your confidence in our service. Part of earning your trust is being transparent about service interruptions like these. What follows is a technical explanation of what happened.
At around 12:00am PST we were alerted to a massive spike in open connections to our load balancer - over 100x the normal amount. We believed we were under attack, and began trying to identify the source of the traffic and mitigate it. We were unable to identify a culpable traffic source. Patterns were typical, yet a flood of connections continued to exist.
We then turned our attention internally. We performed a rolling restart of our application pools in an attempt to reset connections, but as they came back up they were again instantly saturated with open connections. We inspected the connections and determined they were not malicious. They were authentic requests, but not being fulfilled and released properly.
It was at this time we made the very difficult decision to take the API offline. Not difficult from a methodology perspective; the API was effectively not doing work even though it was "up". It was difficult because taking the API offline is an acknowledgement that we're temporarily not capturing events. If the client doesn't have queuing or retry logic built in it means those events may be unrecoverable.
I want to be very clear that this is the worse-case, last-resort situation for us. We go to great lengths to preserve the uptime and integrity of the entire API, and particularly the ingestion side. This includes running hot-hot in multiple data centers and adding redundancy at every layer.
Sadly even these measures did not forestall the need to take the API offline around 1am PST. Instantly, three of our engineers began running diagnostics and searching for the root cause.
The root cause was identified as a database deadlock triggered by an atomic update operation. This deadlock was particularly destructive: it not only locked up 1 database, but all write operations for the entire cluster. Based on our understanding of the DBMS, MongoDB, we didn't believe this scenario was possible, and as a consequence it took us longer to track it down. We will be reaching out to MongoDB to figure out what happened and get a bug filed if applicable.
Starting at 5:30am, we removed the code that triggered the destructive operation and cautiously brought the API back online. The API was fully operational as of 7am.
No data was lost that had already been captured before the outage. However, any 500-level response codes that HTTP clients received during the outage indicate that our API did not store the event. If you had the event in a queue, or can regenerate it, you can resend it at this time (remember to override the keen.timestamp property to when the event actually happened). If not, the events will ultimately be missing from your collections, and you may need to use filters to exclude the outage period from certain queries.
We sincerely apologize for the inconvenience this has caused to our customers. We take this very seriously and are very committed to meeting your expectations in the future. Here are a few things we are doing right now to make things better:
- We are replacing our current data architecture with one that's far better suited to writing huge streams of events (lock-free) and running large, parallelizable queries. This new architecture, based on Cassandra & Storm, is already serving our largest customers and will be serving all customers soon.
- We will post more frequent updates during outages. We understand that these updates help you decide how to adapt and respond, and we need to do better than we did this time.
If you have any questions about the outage, or our future plans for robustness, please don't hesitate to get in touch. josh at keen io.
Nov 21, 00:08-11:18 PST
We pinpointed the issue to be with the way we handle customers deleting events in a certain way. It uncovered a number of issues which we've patched, deployed, and are currently monitoring.
Our apologies for the inconvenience.
UPDATE: We've posted a post mortem on our blog here: https://keen.io/blog/66171746436/were-sorry-heres-what-happened
Nov 4, 12:34-15:56 PST