Amazon told its customers:
“Dec 9 12:08 PM PST (revision of Dec 10 post to clarify timing) We would like to provide further information on the issue experienced on December 9 in one of our east coast availability zones (this event affected a minority of instances in one of our four Availability Zones in the US-EAST-1 Region). A single component of the redundant power distribution system failed in this zone. Prior to completing the repair of this unit, a second component, used to assure redundant power paths, failed as well, resulting in a portion of the servers in that availability zone losing power. Impacted customers experienced a loss of connectivity to their instances. As soon as the defective power distribution units were bypassed, servers restarted and instances began to come online shortly thereafter. Over 25% of these instances recovered within 30 minutes, and over 90% recovered within an hour. A small number of instances took up to a few hours to recover, and we worked with those customers during the morning.”
What I found most interesting was:
“A single component of the redundant power distribution system failed in this zone. Prior to completing the repair of this unit, a second component, used to assure redundant power paths, failed as well, resulting in a portion of the servers in that availability zone losing power.”
While I have no firsthand knowledge of the event, it sounds like a classic case of cascade failure. One of the basic tenets of redundant power may have been violated.
In a “perfect” scenario, such as a Tier IV Data Center, there are two completely independent power paths. Each path and all the items in the path must be capable of supporting 100 percent of the entire data center load by itself. This represents true 2N redundancy, which means that no single point of failure will interrupt the operation of the data center equipment.
In other words, to assure that redundant power is really redundant and not a trap door, it is imperative to make sure that the total load can be carried by either side of the entire power path. While this may seem obvious on face value, I have seen this rule violated many times, sometimes even in well-run data centers. It usually happens when there is no rack or branch circuit monitoring to continuously monitor actual current draw. However, even if there is active monitoring, this rule may have been inadvertently broken.
In a typical scenario, while the redundant power paths are both available and feeding the loads (i.e., IT equipment with dual power supplies), each half of the power path carries approximately 50 percent of the total load. This creates a “sense” of redundancy for most administrators. In reality, this is where the hidden exposure to power problems starts.
IT equipment, such as servers, are normally installed, started up and operated with both rack-level PDUs available. Typically, each PS would only draw 50 percent of the server’s power requirement. Normally, the total PDU load is less (again, hopefully) than the trip value of the circuit breaker that protects it. In fact, even if the PDU has a current meter, most administrators would think they were safe if they were “only” at a 50 percent power level on each PDU. However, UL and NEMA mandated codes require that you can safely only draw 80 percent of the rated branch circuit breaker value. Therefore, at 50 percent of the PDU breaker rating, the power system is no longer redundant and no one even realizes it!
The only way to safely implement a dual-server PS and dual-rack PDU is to never exceed 40 percent of the face-rated circuit breaker value of the rack PDU or power path.
For example: You cannot draw more than 16A from a 20A PDU. This means that in a dual PDU rack, the entire equipment load should not exceed 16A for the rack. Therefore, each PDU should normally have only an 8A load on it, in order to avoid a potential cascade overload and resultant compete rack-level power failure.
In a multi-phase PDU, this is even more important, since it has become very common to use a 3-phase 208/120V PDU populated with three groups of single-phase 120V outlets, being fed from a single 3-phase breaker. In this scenario, if any phase exceeds the rated current, the breaker will trip, and all three phases will be dropped, potentially resulting in a loss of power to the entire rack.
As mentioned earlier, even those administrators who do have metered PDUs do not realize that once they go past the 40 percent power level, they are in danger of a cascade power failure. Moreover, as servers are upgraded and added all the time, it is easy to see how the exposure continues to increase with no warning, until a problem occurs and then everyone involved is baffled about why power was lost, because everyone thought they had “redundant” power.
I would suggest that if you are fortunate enough to not have had this happen to you already, you review your rack-level current draw at each PDU. If you do not have metered PDUs, you should consider upgrading to a metered PDU in the near future, or consider adding branch circuit monitoring to the floor-level PDUs. If you have many racks, I recommend that you consider a metered PDU with remote monitoring (via SNMP and/or Web) that can send SNMP traps to your management software, since it would lower the burden of manually monitoring dozens or hundreds of PDUs. In addition, thresholds should be set in monitoring software to send automatic alerts to administrators to warn them of potential power problems before the circuit rating is exceeded.
Bottom line: Make sure that if you are implementing redundancy, it can sustain 100 percent of the load if the other path fails. Review and document your existing load structure, and continue to proactively monitor and manage the load levels on all PDUs, as well as all the other elements of your power path. Changing out PDUs can involve some downtime, however, like any power path work, some downtime may be required if there is no true 2N power path.
Take your choice – some planned limited downtime or an unplanned surprise shutdown.
Comments
Post new comment