Keen.io Website is Unavailable (API is fine)
Incident Report for Keen
Postmortem

Today during a routine deploy to our website there was an error in the deploy that left us serving a not-so-pretty 500 error page for about 20 minutes. Here's how it happened, what we did to fix it and how we'll prevent it from happening again.

What Happened

We're currently making some fairly large changes to our internal staging environment. Staging had fallen out of use for a few months and some changes we added last week for the new staging environment caused our deploy to fail in its template compilation step. An oversight on our part meant that no deploys were done to production since that change. Bad news! Since no deploys happened we were unaware that we had a time bomb.

At no time was our API service disrupted. Events continued to be written with no problems!

How We Fixed It

After routine deploy at 3pm today we were notified very quickly that something was amiss. Since we had been testing our staging environment it was our assumption that changes to the deploy scripts were the culprit. Our first change was revert these changes and redeploy. This didn't fix things so we continued investigating. This is the part of the diagnosis that took the most time. The problem ultimately became the change from last week that was preventing staging from working. Because of the long distance between the original change and the failure there was a high time to remediation.

How We'll Prevent It

Changes to deploy scripts are infrequent but we'll make it a rule that changes to deploy scripts need to be vetted in each environment quickly after the change is pushed. We also learned about a small improvement we could make to our environment names so that deploys are less error prone.

We're also investigating changes to our deployment processes that have considerably fewer moving parts so that problems are less likely.

Summary

I apologize for today's outage. We appreciate your patience and hope that any disruption to our users was minor. We look forward to updating you in the future as the improvements we're planning go into effect!

Posted Apr 29, 2014 - 15:46 PDT

Resolved
This incident has been resolved.
Posted Apr 29, 2014 - 15:23 PDT
Investigating
Our website is currently unavailable and we are working on it. It will be back shortly. The API is not affected.
Posted Apr 29, 2014 - 15:17 PDT