Amazon S3 and what we can do to prevent and recover from business-critical application errors

Various reports claim that Amazon’s S3 outage on February 28, 2017 “took out 1/3 of the internet.” Kind of a big deal. https://aws.amazon.com/message/41926/ . Naturally, many in the tech world watched Amazon’s recovery and error-proofing with great interest. To its great credit, Amazon was transparent about the outage root causes:

  • Operator error ostensibly caused the outage: “one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
  • But the operator could only have made the mistake because sufficient workflow safeguards were not in place to prevent the outage, allowing “too much capacity to be removed too quickly”.

But the bigger lesson is this:

Even the most reputable companies, employing the most technically savvy staff who are handling the most business-critical applications can (and do) make mistakes.

So how does Amazon prevent such errors in the future? It’s clear from Amazon’s response that they have taken a technology and process approach:  “We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.”

If you’re not Amazon, how can you error-proof your application environment from outages?

Prevent the error at its source: 

Unlike scripting or other open-ended approaches, configuration management tools and application release automation tools like Orca have built-in guardrails on the types and ranges of inputs allowable. For instance, Boolean values are shown as check boxes, flags are presented as lists, numbers stay numbers, and compliance rules can be put on input values to control their impact.

Approvals: 

Configuration changes and app release workflows can be further controlled with formal group or individual approval steps. Orca has built-in role-based-access-control (RBAC) with integrated Microsoft Active Directory for this reason. One approval, many approvals, or “one-of” approvals (i.e., “anyone is this group”) can be applied to ensure governance is as strict or as flexible as needed.

Focus on MTTD/ MTTR:

Preventing mistakes is great but whether they cause poor application performance, security vulnerabilities or outages, some configuration changes will simply need to be undone…quickly. Mean-Time-to-Detect (MTTD) and Mean-time-to Recover (MTTR) are essential. That’s why Orca offers the ability to automatically rollback any previous change and or to restore previous configuration settings from a config backup.

Start your free trial by accessing an online installation of Orca.