Amazon today finally offered up a detailed failure report itemizing what went wrong last week when its Elastic Compute Cloud went down and took a chunk of the internet with it. It's quite a read. That is, it's quite a read if you can get through all the definitions and jargon. Data Center Knowledge does a pretty good job of simplifying it a bit, but let's make it a bit simpler.
Amazon's EC2 is a complicated beast made up of many automated processes that all interact with each other in predictable well engineered ways. What brought it all crashing down was process complexity. An unanticipated mistake during a capacity upgrade sent rerouted traffic on the wrong path. That caused problems for other components of the cloud, which responded automatically as they were designed to. Simultaneously. That overloaded more stuff, which caused more automatic responses, which overloaded more stuff. Simultaneously. Which...
Well you get the idea now - it was a cascading failure. If this were one server or something, the obvious fix is the ubiquitous reboot -- just wipe the slate clean and start fresh. But for EC2 or probably any large distributed cloud it wasn't nearly so easy I guess and it took days to get back to normal.
So what did Amazon do wrong? Well, nothing we won't see happen again to them and other similarly large, complicated systems. Nothing, and I mean nothing, will work out all such issues except time and usage. The more complex things get, the harder it becomes to predict how they will respond to a wide variety of stimuli. Each time this sort of thing happens, their engineers will adjust the design so that their cloud becomes that much more stable. Just call it growing pains.
If you haven't already, please take our Reader Survey! Just 3 questions to help us better understand who is reading Telecom Ramblings so we can serve you better!Categories: Cloud Computing