Author’s Note: The information below is based upon reports which were handed to me during the incident through various sources. I am well aware that at this point there are inaccuracies. I intend to maintain this post as-is to serve as a record of what occurs when information is not disseminated in an organized fashion. For a full report on what happened, there are several sources, but it may be best to go with Rackspace’s finished report on the incident. Once again I appreciate all the feedback, and I am glad that I was able to provide some insights to at least a few people during this business crisis. Thank you.
Just heard from my head IT guy at my office:
Rackspace had a major power outage in its Dallas location due to a truck which crashed into a utility pole, causing it to crash into a transformer, causing it to explode a huge ruckus resulting in a transformer blowing up.
There are backup generators which kicked in immediately to prevent serious data loss and server damage, but it wasn’t enough to power the heating and cooling.
They realized that the infrastructure of generators wasn’t enough to handle it all when the temps were going up in the data center so they shut back down.
So what did they do? They trucked in several (around 6-10) 100 kilowatt large generators to power the HVAC.
These are being hooked up presently and service should be restored soon. They’re flipping servers back online one at a time to ensure maximum hardware integrity.
It’s estimated that it will take around 12 hours to restore the transformer. This means that Rackspace will be running on generators for a while. Could ge interesting.
Key update from my techie!
Unfortunately, the report I was given at 9:19pm stating that the site would be back up within 5 to 10 minutes was not accurate. According to Rackspace, they are still working to restore power-and as of right now they do not have a time estimate for restored service. The tech I spoke to felt that it would be at least two more hours until service was restored.
Also noteworthy: Popular site LaughingSquid is suffering the wrath of this outage as well. They have established a temporary wordpress blog with some detailed rackspace status updates here. I also heard via their twitter stream that 37signals is down due to this outage as well.
And we’re back! 10:55pm EST
Laughingsquid and the company I work for are back up and running. Saw another green light over at the digg page for this article. However it appears at this time that 37signals is still down for the count. I’ll provide more insight from my IT connection if Rackspace has provided any.
11:30pm EST - Well, 37signals is definitely back to life. Still no word from my IT connection. Knowing him, he’s probably locked himself in the server room to avoid a bombardment of sales people asking questions about the situation. Translation: I probably won’t hear a lick of new info until tomorrow. Keeping my fingers crossed. In the meantime, Valleywag’s numerous sources will undoubtedly have one among them with their finger on the pulse of the situation.
11:45pm - heard from my IT dude again. got clarification on the original report (see above), but nothing super-new.
12:51pm EST - Got an update from IT. He went and checked my.rackspace.com. Valleywag has a pretty comprehensive report from them as well, but here are the latest two blurbs from the Rackspace team.
Nov. 12th 9:30PM CST — As of 8:45 p.m. CST, temperatures are stabilizing in the DFW data center. In cases of servers that were proactively shut down to avoid overheating, we are starting the process of bringing the affected machines back online in a phased, gradual way. We are sorry for service disruptions caused by these events and understand how critical this is for your business. Throughout this process, we are making every effort to minimize impact on customer environments and return affected machines back to service as quickly and smoothly as possible.
We continue to work with vendors to re-establish utility power to the facility and will keep providing updates here in the portal.
Nov. 12th 11:30PM CST — As of 10:50 p.m. CST, all DFW servers that were proactively powered down earlier this evening, to avoid overheating, have now been powered back up. The Data Center Engineering team has been working to resolve the power issues caused by tonight’s traffic incident. The team is preparing to transfer machines affected by tonight’s power outage from generator power back to utility power. The servers and devices that were affected by the unrelated event over this past weekend will remain on generator power. We anticipate transferring the machines affected this evening back to utility power within the hour and expect the transfer to be non-disruptive to customer environments. We apologize again for the inconvenience these events have caused and have all hands on deck working fanatically to minimize the impact on your business.
Final Update: with everything that’s gone on in the past day regarding this rackspace outage, I just couldn’t keep up with all the info, and I certainly didn’t want to harrass the company’s RS rep into coughing up more info. I provided what knowledge I had at the time, some of which has been rightfully refuted by further, better information. (As pointed out in some of the comments below.) At this point, news on the accident which queued off this chain of events is available at Valleywag.
The best news is that the person who had the accident is okay. Everything after that is debate over guarantees, serious reputation management challenges for Rackspace, and damage assessment for those who lost time and money.