Rackspace Outage 07
By Giania • Nov 12th, 2007 at 10:33 pm • Category: Fnord| Hot: |
EDIT 06-29-2009: Please go here http://www.randomkitty.net/blog/2009/06/29/rackspace-outage-of-2009-06-29-2009/ if you arrived while searching for the most recent outage issue at rackspace. Thanks!
Author’s Note: The information below is based upon reports which were handed to me during the incident through various sources. I am well aware that at this point there are inaccuracies. I intend to maintain this post as-is to serve as a record of what occurs when information is not disseminated in an organized fashion. For a full report on what happened, there are several sources, but it may be best to go with Rackspace’s finished report on the incident. Once again I appreciate all the feedback, and I am glad that I was able to provide some insights to at least a few people during this business crisis. Thank you.
Just heard from my head IT guy at my office:
Rackspace had a major power outage in its Dallas location due to a truck which crashed into a utility pole, causing it to crash into a transformer, causing it to explode a huge ruckus resulting in a transformer blowing up.
There are backup generators which kicked in immediately to prevent serious data loss and server damage, but it wasn’t enough to power the heating and cooling.
They realized that the infrastructure of generators wasn’t enough to handle it all when the temps were going up in the data center so they shut back down.
So what did they do? They trucked in several (around 6-10) 100 kilowatt large generators to power the HVAC.
These are being hooked up presently and service should be restored soon. They’re flipping servers back online one at a time to ensure maximum hardware integrity.
It’s estimated that it will take around 12 hours to restore the transformer. This means that Rackspace will be running on generators for a while. Could ge interesting.
Key update from my techie!
Unfortunately, the report I was given at 9:19pm stating that the site would be back up within 5 to 10 minutes was not accurate. According to Rackspace, they are still working to restore power-and as of right now they do not have a time estimate for restored service. The tech I spoke to felt that it would be at least two more hours until service was restored.
Also noteworthy: Popular site LaughingSquid is suffering the wrath of this outage as well. They have established a temporary wordpress blog with some detailed rackspace status updates here. I also heard via their twitter stream that 37signals is down due to this outage as well.
And we’re back! 10:55pm EST
Laughingsquid and the company I work for are back up and running. Saw another green light over at the digg page for this article. However it appears at this time that 37signals is still down for the count. I’ll provide more insight from my IT connection if Rackspace has provided any.
11:30pm EST – Well, 37signals is definitely back to life. Still no word from my IT connection. Knowing him, he’s probably locked himself in the server room to avoid a bombardment of sales people asking questions about the situation. Translation: I probably won’t hear a lick of new info until tomorrow. Keeping my fingers crossed. In the meantime, Valleywag’s numerous sources will undoubtedly have one among them with their finger on the pulse of the situation.
11:45pm – heard from my IT dude again. got clarification on the original report (see above), but nothing super-new.
12:51pm EST – Got an update from IT. He went and checked my.rackspace.com. Valleywag has a pretty comprehensive report from them as well, but here are the latest two blurbs from the Rackspace team.
Nov. 12th 9:30PM CST — As of 8:45 p.m. CST, temperatures are stabilizing in the DFW data center. In cases of servers that were proactively shut down to avoid overheating, we are starting the process of bringing the affected machines back online in a phased, gradual way. We are sorry for service disruptions caused by these events and understand how critical this is for your business. Throughout this process, we are making every effort to minimize impact on customer environments and return affected machines back to service as quickly and smoothly as possible.
We continue to work with vendors to re-establish utility power to the facility and will keep providing updates here in the portal.
Nov. 12th 11:30PM CST — As of 10:50 p.m. CST, all DFW servers that were proactively powered down earlier this evening, to avoid overheating, have now been powered back up. The Data Center Engineering team has been working to resolve the power issues caused by tonight’s traffic incident. The team is preparing to transfer machines affected by tonight’s power outage from generator power back to utility power. The servers and devices that were affected by the unrelated event over this past weekend will remain on generator power. We anticipate transferring the machines affected this evening back to utility power within the hour and expect the transfer to be non-disruptive to customer environments. We apologize again for the inconvenience these events have caused and have all hands on deck working fanatically to minimize the impact on your business.
Final Update: with everything that’s gone on in the past day regarding this rackspace outage, I just couldn’t keep up with all the info, and I certainly didn’t want to harrass the company’s RS rep into coughing up more info. I provided what knowledge I had at the time, some of which has been rightfully refuted by further, better information. (As pointed out in some of the comments below.) At this point, news on the accident which queued off this chain of events is available at Valleywag.
The best news is that the person who had the accident is okay. Everything after that is debate over guarantees, serious reputation management challenges for Rackspace, and damage assessment for those who lost time and money.



thanks for these updates– I’m waiting for several Rackspace-hosted machines to come back online. v disappointing.
Appreciated your updates – we’re back up now.
We too suffered from this outage — I guess we’ll see how well they stand by their “100% uptime or your money back” guarantee.
there is no excuse for an outage like this, Rackspace should have easily been able to determine that their generators were not capable of powering the servers and the HVAC system, so the only thing I can figure, is that they never planned on losing “wall-power” to the entire building at one time – simply unexcusable contingency planning for a company like Rackspace
[...] portal an hour into the outage, but up until that point I had to rely on Twitter updates and blog posts for information on the [...]
Chis… I’d like to see you design a datacenter.
Chis.. Let me apologize and rephrase. Perhaps I read your comments to be more venomous than you had intended, or perhaps not. Let me state that I am not an employee of or in anyway a spokesperson for Rackspace. However, I have many friends, who are at this very moment are receiving the rage and bile of many disappointed customers as you too must be. I have been in their/your position a few times myself and I do not envy it.
Agreed, there may be mistakes in the design of this datacenter, which Rackspace will pay dearly for. I have no doubt the SLA terms will be honored. Please, all the details of the outage are not yet disclosed. Your temperance and patience is most appreciated while this is being worked through.
It is important to note that this is Rackspace’s SECOND outage in the past 48 hours! They had a huge outage early Sunday morning that left my company with no server for 5 hours!
And then tonight again…almost 5 hours (and still no server) because of another power outage. They must be kidding. We all pay a major premium for the “zero downtime guarantee”….uhh…what’s going on!?
No fault to the IT and sales people over there…they are wonderful. But it sounds like management has cut corners with the back-up and emergency systems. This is really unnacceptable (sorry, just had to vent and was searching for another blog talking about this!).
The Laughing Squid support blog isn’t temporary… it’s been up for almost two years as a backup source of squidly status and info. Just another symptom of excellent customer-oriented service!
Thanks for the updates Giania! Please keep em coming!
@Jeff – I realized after the fact and neglected to correct the situation. All I’d looked at was the one post related to this particular downtime event, without checking out the rest of the blog. And I was too hyperfocused on the unfolding events to even take notice of the fact that it had archives. Thank you for the correction, because I doubt I would have gone back to check.
While I am not hosted in Rackspace’s DFW datacenter, I am hosted at Rackspace and have been following their detailed updates. Unfortunately, the information posted here is incorrect, and can/is causing some misconception of events or abilities.
> “They realized that the infrastructure of generators wasn’t enough to handle it all when the temps were going up in the data center so they shut back down.
So what did they do? They trucked in several (around 6-10) 100 kilowatt large generators to power the HVAC. ”
This was not the case, if you are a customer of Rackspace you can call them and verify that they have 100% EXCESS generator capacity. However, as they state in their public updates, the chillers failed to start when switching from utility to generator power (not because of a lack of power). Thus resulting in them having to be started manually.
Please understand Thomas, that at the time I was going on reports which had been handed down to me from Rackspace to my company’s IT. At that time, he was told their generators were not appropriately handling the HVAC and that more generators were being brought in, so that is what I reported.
I have no way of knowing whether my direct source or his Rackspace representative confused these events, other than comparing them to reports from my.rackspace.com. (There is also a possibility that they did some reputation management when constructing these customer base reports.)
As a RackSpace client for four years, I have seen a slow, but steady, drop in the quality of support they provide. it is common to have to go through two or three techs before you get someone with knowledge. this, to me, shows growth without planning for qualified support. Now, massive outages and a datacenter that does not have sufficient back up power clearly shows even more cost cutting. We have seen routers, load balancers and servers all mis configured by these clowns. Seriously, how long can they live on a reputation that no longer matches their true capabilities?
Our Rackspace representatives, by and large, have all be incredibly helpful and from what I can tell, totally capable. Then again my measurement of bad service is the company’s previous experience with INetU, who were of little-to-no help for serious ongoing issues we had.
In my experience Rackspace is very helpful unless there is a problem. And when there is you won’t find anyone to blame for it or get answers (besides the usual fluff). They normally want to spend 1/2 day on the phone with you, have you speak with 8 different managers and in the end, when your head is spinning, you still don’t have any useful information. It’s happened time and time again. I’m about fed up with Rackspace and I’m looking into other options.
Giania,
You do have bad information concerning the chillers and generators. For data centers, its standard for the chillers to shut down when going from utility power to generator power. It takes about 10 seconds for the generators to generate enough power for the data center. During that time, the servers and networking equipment run off the UPS’s, which definitely cannot support the chillers. Once the generators spin up, the chillers turn back on. This is unfortunately what did not happen. The power source for the servers was uninterrupted. As a Rackspace customer, I am disappointed, but I know that they will honor their promises and provide the service that we demand. They have been extremely forthcoming with information and I appreciate their dedication… Even the BEST engineered systems sometimes fail.
Respectfully,
Sam
All I have to say is that you’ve got your story ALL wrong about what happened – why don’t you truck over to http://www.rackspace.com and read about what REALLY happened and stop being spreading mis-information!
Sam and Anonymous:
As previously stated, I was originally going on information that was given to me by my company’s IT department FROM a Rackspace representative. I cannot vouch for the accuracy of either of those two people.
However you will note that I provided other sources of information on the situation, up to and including postings from my.rackspace.com.
At no time did I claim to be the authority on the situation.
Thank you for the further info & clarifications, Sam.
It’s perfectly understandable that Giania got a few facts wrong, since Rackspace themselves were silent for what seems like far too long. I appreciate her posting the info when she did!
[...] portal an hour into the outage, but up until that point I had to rely on Twitter updates and blog posts for information on the [...]
[...] jealous. Who wouldn’t want to get paid money in buckets for coming up with the hottest dish? I had a flash in the pan when I got that inside tip on the Rackspace situation, even if it turned ou… (The article should be accurate overall, or at least provide sources to the final, real [...]
[...] ElaineMeinelSupkis article is brought to you using rss feeds.Here you will find the best trucking resources for truckers.Rackspace had a major power outage in its Dallas location due to a truck which crashed into a utility pole, causing it to crash into a transformer, causing it to explode a huge ruckus resulting in a transformer blowing up. … [...]