Amazon and the Mis-Use of Power

On Dec. 9,  Amazon’s EC2 North Virginia site experienced a power failure, which impacted “a minority of instances in one of our four Availability Zones in the US-EAST-1 Region,” according to information Amazon published at its performance dashboard for the service.

Amazon told its customers:

“Dec 9 12:08 PM PST (revision of Dec 10 post to clarify timing) We would like to provide further information on the issue experienced on December 9 in one of our east coast availability zones (this event affected a minority of instances in one of our four Availability Zones in the US-EAST-1 Region). A single component of the redundant power distribution system failed in this zone. Prior to completing the repair of this unit, a second component, used to assure redundant power paths, failed as well, resulting in a portion of the servers in that availability zone losing power. Impacted customers experienced a loss of connectivity to their instances. As soon as the defective power distribution units were bypassed, servers restarted and instances began to come online shortly thereafter. Over 25% of these instances recovered within 30 minutes, and over 90% recovered within an hour. A small number of instances took up to a few hours to recover, and we worked with those customers during the morning.”

What I found most interesting was:

“A single component of the redundant power distribution system failed in this zone. Prior to completing the repair of this unit, a second component, used to assure redundant power paths, failed as well, resulting in a portion of the servers in that availability zone losing power.”

While I have no firsthand knowledge of the event, it sounds like a classic case of cascade failure. One of the basic tenets of redundant power may have been violated. 

In a “perfect” scenario, such as a Tier IV Data Center, there are two completely independent power paths. Each path and all the items in the path must be capable of supporting 100 percent of the entire data center load by itself. This represents true 2N redundancy, which means that no single point of failure will interrupt the operation of the data center equipment.

In other words, to assure that redundant power is really redundant and not a trap door, it is imperative to make sure that the total load can be carried by either side of the entire power path. While this may seem obvious on face value, I have seen this rule violated many times, sometimes even in well-run data centers.  It usually happens when there is no rack or branch circuit monitoring to continuously monitor actual current draw. However, even if there is active monitoring, this rule may have been inadvertently broken.

In a typical scenario, while the redundant power paths are both available and feeding the loads (i.e., IT equipment with dual power supplies), each half of the power path carries approximately 50 percent of the total load. This creates a “sense” of redundancy for most administrators. In reality, this is where the hidden exposure to power problems starts.

IT equipment, such as servers, are normally installed, started up and operated with both rack-level PDUs available. Typically, each PS would only draw 50 percent of the server’s power requirement. Normally, the total PDU load is less (again, hopefully) than the trip value of the circuit breaker that protects it. In fact, even if the PDU has a current meter, most administrators would think they were safe if they were “only” at a 50 percent power level on each PDU. However, UL and NEMA mandated codes require that you can safely only draw 80 percent of the rated branch circuit breaker value. Therefore, at 50 percent of the PDU breaker rating, the power system is no longer redundant and no one even realizes it!

The only way to safely implement a dual-server PS and dual-rack PDU is to never exceed 40 percent of the face-rated circuit breaker value of the rack PDU or power path. 

For example: You cannot draw more than 16A from a 20A PDU. This means that in a dual PDU rack, the entire equipment load should not exceed 16A for the rack. Therefore, each PDU should normally have only an 8A load on it, in order to avoid a potential cascade overload and resultant compete rack-level power failure.

In a multi-phase PDU, this is even more important, since it has become very common to use a 3-phase 208/120V PDU populated with three groups of single-phase 120V outlets, being fed from a single 3-phase breaker. In this scenario, if any phase exceeds the rated current, the breaker will trip, and all three phases will be dropped, potentially resulting in a loss of power to the entire rack.

As mentioned earlier, even those administrators who do have metered PDUs do not realize that once they go past the 40 percent power level, they are in danger of a cascade power failure. Moreover, as servers are upgraded and added all the time, it is easy to see how the exposure continues to increase with no warning, until a problem occurs and then everyone involved is baffled about why power was lost, because everyone thought they had “redundant” power.

I would suggest that if you are fortunate enough to not have had this happen to you already, you review your rack-level current draw at each PDU. If you do not have metered PDUs, you should consider upgrading to a metered PDU in the near future, or consider adding branch circuit monitoring to the floor-level PDUs. If you have many racks, I recommend that you consider a metered PDU with remote monitoring (via SNMP and/or Web) that can send SNMP traps to your management software, since it would lower the burden of manually monitoring dozens or hundreds of PDUs. In addition, thresholds should be set in monitoring software to send automatic alerts to administrators to warn them of potential power problems before the circuit rating is exceeded.

Bottom line: Make sure that if you are implementing redundancy, it can sustain 100 percent of the load if the other path fails. Review and document your existing load structure, and continue to proactively monitor and manage the load levels on all PDUs, as well as all the other elements of your power path. Changing out PDUs can involve some downtime, however, like any power path work, some downtime may be required if there is no true 2N power path.  

Take your choice – some planned limited downtime or an unplanned surprise shutdown.

Comments

It is interesting how everyone jumps on a topic that may or may not be revelant to the incident. I have been building and operating data centers for over thirty years including three years of doing Post Mortems on failures and I can tell you that things are seldom what they seem. Further, every failure event I have ever studied, evaluated or consulted on, always had multiple contributors to the failure. In essence, had any one of these contributors been absent then the event would have been avoided. The outage only happened on December 9th. It is now Christmas Eve. Let's all give Amazon a little time to dig into the outage then hopefully they will be industry leaders and come forth and share their "Lessons Learned" so that we all my learn in the process.
Since the sensitive load is the power supply, does Amazon know what power window their server power supplies can work in/out of? Dis Amazon base power infrastructure to match server power supplies CBEMA curves?
Hi Scot, Regarding your comment that "my use of a 20% guard band as a must for redundancy" is overly cautious. You may be misunderstanding my point. Yes, certain upstream items such at transformers and UPS can operate at 100% of rating. However, if you check the codes for branch circuit protection and look any any brand of rack PDU, you will find that it is only rated rated to delivery 80% of the circuit rating i.e. a "20A unit" is only rated to deliver 16A. The unit's breaker and/or branch breaker will open at 20A. Upstream on the power chain there are 2 types of larger breakers: 80% rated and 100% rated. It is critical to understand how and where each type is used in the power chain. So while overall you should be able to operate at a power system at up 100%, it is critical to understand what actual specified rating is for part of the power chain. The classic cascade failure I described at the rack PDU and floor level is real and does happen. I hope this clarifies the issue of a false sense of redundancy cause by the belief 50% is a safe value for redundant PDUs. It is not, the 40%/80% rule is quire real and that is what was addressing. Julius
In my experience I see a lack of C-level understanding and support for the facilities staff that supports this infrastructure. Whether or not you build a 2N or other configuration, it is the staff that takes care of it that are the critical part of the equation. Having 2N of UPSs doesn't matter if you don't properly monitor batteries and replace when needed. Just a couple of bad cells in the right locations eliminates your fail safe design. Too many times I have seen companies grow and move into their own data centers just to fail. They hire people without the required skills and knowledge to monitor the facilities. I have actually seen security guards charged with this responsibility at major data centers! Other times they outsource this responsibility to some vendor that really doesn't have the capability or skills to maintain the equipment. Response times from outsourced vendors are rarely adequate to prevent outages, and in some cases have made things worst. Vendors that are unqualified, call for assistance from home offices...sometimes help is over 8 hours away... Until these companies see real monetary penalties in the form of customer credits or lost market share, there is no real driver to change. Comments welcome... Terry
I would just like to second third and fourth Terrys comments. Its a message no one is comfortable telling people and is often taken personally by those its aimed at but essentially its still a stuggle to get DC Managers and the like to understand that what may seem straight forward to manage to an IT team actually isnt and needs people with the right competencies and the right tools to do it with any degree of success. I spent a lot of time auditing power systems and process on critical sites a few years back for a global service provider. Eventually i got it off to a couple of key questions at each site, one of which was How would you know that you had a capacity issue in your infrastructure. Only twice out of about 70 instances did i get a response that acknowledged the question. Unfortunately the problem with waiting for the big failure that gets everyones attention is that they happen already but the frequency and root causes aren't shared around the industry enough to enable a consensus to build on the topic. Everyone thinks they have been the victim of a rare un-manageable event. Ive also seen Facilities service providers do a very good job of convincing their client that what happened was completely unforseen or the root cause was elsewhere. Even with in house teams its a rare person who will say we don't have the tools or the skills to do this properly. Part of the problem is they often don't know what it is they aren't doing. Infrastructure Management is a discipline that requires competency in engineering design and the risks associated with it. Until DC managers or whoever wealds the power in an organisation insists on seeing evidence of this in their setups i don't see much to suggest things will change.
Hi Julius, I won't speculate on whether this was a cascade failure of not but I think you may be a bit strong with your use of the 20% guard band as a must for redundancy. There's no reason to expect that either side of the power chain won't operate satisfactorily right up to 100% rating, the danger is that there are variable loads which could momentarily exceed the rating of some part of the chain; it's a matter of probability whether or not events will align to cause a failure. A good statistical analysis of both expected failure rates of individual pieces of equipment and current configuration will show expected up-time; it's never 100% and as expected up-time approaches that limit, the cost to improve grows exponentially. By understanding the cost/benefit, good choices can be made regarding improving expected availability. We can't tell from the article how costly this outage was for Amazon or their customers so we don't know if their choice or power schemes was appropriate or not. -Scot

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <b> <i>

More information about formatting options