<iframe src="//www.googletagmanager.com/ns.html?id=GTM-WNH9CH" height="0" width="0" style="display:none;visibility:hidden">

AWS S3, Blameless Failure and DevOps

Posted by Mourad Trabelsi on Mar 14, 2017 9:48:12 AM


 

Blameless Culture.jpg

A lot of ink has already been spilled over last month's main event, and a lot of conclusions might have been drawn too early. Let’s review what we know, discuss it and go over a few potential solutions to the challenges faced.

 

What happened: AWS S3 Failure

Amazon Web Services released a statement following an outage that happened a couple of weeks ago (28/02/17), which had a rather large impact on the internet. Indeed, the Simple Storage Service (S3) went down for about 3h, leaving all services depending on this storage backend in a delicate position.  A few examples are: Imgur, Github, SoundCloud and many more.

Note: not all of them were completely offline, some just had issues loading content.

We already covered how this impacted us, when the DockerHub Registry Service went down while we were hosting a Rancher-themed DevOps Playground meetup. In this article, we have a look at the bigger picture, not only DockerHub.

Outages happen all the time, so why is this one different? First of all, its scale and impact are not often seen, but more importantly it is an opportunity to see how a top-tier industry leader handles failure on a top-tier scale.

 

The AWS Response

Although no proper post-mortem of the outage has been released, an AWS statement gives us enough information to understand what happened; we learned that a typo was part of the issue and that more machines were deprovisioned than originally planned.

The overall feeling I get when I read this response is that the only actor to blame is the system itself.

In several instances in the 2nd and 3rd paragraph, AWS explains that it is common practice to try to anticipate these kind of failures, and that the systems are designed to minimise any kind of impact. The statement goes on to mention two levels of failure: Firstly at the capacity level, when the machine fails; and secondly, at the process level, when something that shouldn’t have happened happens.

 

We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks.”

AWS response statement.

 

Here, we understand that AWS is forward-facing and focused on improving not only the process issue that allowed the breakdown to happen, but all the maintenance tasks they had to overlook to keep growing their services.

As previously said, the initial typo is not seen as the issue itself, but as an indicator the process was lacking in security checks. At no point, the source of the typo was blamed for the outage or the cascading effect.

 

Being Blameless is Key to Continuous Innovation

One of the key points attached to the DevOps way of thinking is ‘taking risk’. Taking risk might mean breaking everything, or it might mean revolutionizing everything. Most of the time it is neither, but  just a simple step toward a better infrastructure, better process or simply a better understanding of the business requirements.

Here -at ECS Digital- we preach Continuous Innovation, an iterative way of improving the current approach, driven by risk taking and failure acceptance.

In the context of AWS, their blameless handling of the matter enabled them to advance the quality of their automation one step further. The question asked wasn’t “who did this?” rather “why did this happen?”. Nobody was fired, and it is assumed the issue will not happen again.

Unfortunately, this time the outage wasn’t without consequences for customers. However, we’ve provided a few examples of  how you can decentralise an architecture and absorb the outage better next time.

 

Things to Consider:

Many things can be done, at all levels, to avoid being affected by that sort of outage. The below simply focus on a subset of potential solutions or alternatives.

 

Cross Region Replication

CRR is a native S3 feature that enables data (buckets) to be replicated from one region to another. An example would be to to replicate my ECS-DIGITAL-WEBPAGES bucket from us-east-1 to eu-west-1. Having this in place would allow my application to still be able to pull the data from one region.

The two subsystems that went down were crucial in the handling of, amongst others, GET  and PUT requests.

Cross Region Replication would allow the GET requests to still be successful, as the data is available in the second region. However, any PUT request would fail, as they are aimed at the original region.

Note: for the above to work, some logic should be applied at the DNS level, such at AWS Route 53, to automatically failover to the replicated/backup S3 bucket.

Nginx Caching

A great solution, documented by Sentry (Sentry.io), is using Nginx as an S3 cache. In other words, data frequently queried by Sentry’s application is pulled down from S3 and stored in cache in the same datacentre, cutting down latency and direct S3 dependencies.

An interesting feature in this case is that S3 had been defined as a failover solution to the S3 cache, with the help of HAProxy load-balancing everything.

Paraphrasing them, they managed to live through the outage, while the Nginx cache itself served all the data to the servers, a good resiliency design!

For guidance on how to implement this yourself, try reading this Nginx blog post. And for a well-documented guide to load balancing with HAProxy, try here.

 

Chaos Engineering

According to the Chaos Engineering principles, for a system to be resilient it must have been designed to endure and survive most possible failures (both external and internal, expected and random).

It is often a good idea to try to challenge the system in unpredictable ways, to confirm its solidity and to hopefully point out gaps in the design.

Netflix has developed a suite of tools (The Simian Army) to implement this principles and make their own infrastructure as resilient/resistant as possible.  Amongst the Simian Army, The Chaos Monkeys tool is probably the most important as it introduces random infrastructure failures.

This goes back to the risk and failure acceptance necessary for the ecosystem to mature enough to be resilient and truly trustworthy.

If you’re ready to challenge the resilience of your system, start here :-)

 

Sources :

AWS outage response : https://aws.amazon.com/message/41926/

DevOps Playground #10: http://www.ecs-digital.co.uk/blog

Sentry’s Blog post on Nginx caching : https://blog.sentry.io/2017/03/01/dodging-s3-downtime-with-nginx-and-haproxy.html

Netflix’s Simian Army: https://github.com/Netflix/SimianArmy

Principles of Chaos : http://principlesofchaos.org/

NGINX proxy_cache docs :https://www.nginx.com/blog/nginx-caching-guide/

HAProxy load balancing example :https://serversforhackers.com/load-balancing-with-haproxy

Topics: DevOps