About that data loss at Microsoft/Danger

Yesterday the usual news outlets speculated about the rumours circulating about the reasons for the data loss at Danger. For the uninitiated: Danger is the the company behind the Sidekick. TGDdaily started by reporting the information provided by a reader with the article: “Oracle, Linux, Sun to blame for Sidekick Danger data loss”:

TG Daily reader Tommy T informed us earlier today that: “Microsoft was running on Sun Solaris/Linux/Oracle-based Danger servers at the Verizon Business Center in San Jose as part of a contractual obligation to T-Mobile.”

That isn’t really a surprise. Despite all the press fuzz, many many sites still deploy and use Solaris for their mission critical stuff, while using Linux for stuff that scales well on clusters. Of course Microsoft was asked for the reasons and just reported

Sidekick runs on Danger’s proprietary service that Microsoft inherited when it acquired Danger in 2008. The Danger service is built on a mix of Danger created technologies and 3rd party technologies.

I think that’s a good answer, we don’t blame anyone but we tell the world, the application wasn’t developed at Microsoft and the infrastructure wasn’t Microsoft as well. Especially as Microsoft doesn’t need additional doubts about the feasibility of their operating and infrastructure environment after Microsoft loosed the .net implementation at the London Stock Exchange in favor of of a Solaris/Linux implementation. I don’t expect Microsoft to shed more light into the situation. And it isn’t that way, that they are the only shop on the world using RAC on Sun. The Inquirer expresses this situation in a nice way - “Oracle and Sun fingered for Sidekick fiasco”

That would all be well and good but for the fact that the Sun-Oracle combo is not exactly rocket science and is popular among those who manage very large, mission critical corporate databases.

I think, the truth about this outage is much more complex, as the various media outlets state. IT systems are called systems for a reason. They are large complex interrelationships of components. There is a database, there is server, there is storage, there is a SAN, there is a network. And there are many devils of the detail in any system. Whoever says something different, simply lies. But at the end it’s pretty irrelevant. It’s like with aircraft crashes. There isn’t a single reason why an aircraft crash happens. When there is a major CNN moment at a service, there are several resons why it happened. It may start with a small hick-up and other effects leads to a disaster. At the end it’s a large heap of rumors surrounding this situation. Daniel Eran Dilger writes in “Microsoft’s Pink/Danger backup problem blamed on Roz Ho”:

According to the source, the real problem was that a Microsoft manager directed the technicians performing scheduled maintenance to work without a safety net in order to save time and money. The insider reported:
In preparation for this [SAN] upgrade, they were performing a backup, but it was 2 days into a 6 day backup procedure (it’s a lot of data). Someone from Microsoft (Roz Ho) told them to stop the backup procedure and proceed with the upgrade after assurances from Hitachi that a backup wasn’t necessary. This was done against the objections of Danger engineers.

As far as I’m understanding the rumor, they made a backup, deleted they only other backup in the course of this procedure, stopped the backup and did a firmware update. Sorry … i don’t even do an update of of notebook without starting a backup before doing so and my blog replicates it data every six hours to a system in my office at home. A backup is a safety net. You don’t do any task that could put your data at risk without having an backup. And you should better test the restore. Of course this takes time, at my blog a few minutes … at Danger a few days. But you are sure that you don’t appear at CNN. (okay, i would appear at CNN, but i would get some question in our coffee kitchen at work). Sh.t happens, customers understand this, but they don’t accept that you don’t value their data. Perhaps this is the real story of the this mishap.

“In preparation for this [SAN] upgrade, they were performing a backup, but it was 2 days into a 6 day backup procedure (it’s a lot of data). Someone from Microsoft (Roz Ho) told them to stop the backup procedure and proceed with the upgrade after assurances from Hitachi that a backup wasn’t necessary. This was done against the objections of Danger engineers.

I really can’t imagine a single reason why a HDS/HP/EMC/Sun storage support engineer says “Hey. You don’t need a backup. We will do the firmware upgrade without it”. I just can imagine that the the engineer said: “Well … i did several upgrades so far and there wasn’t a problem that sent me to the tapes” By the way: There is one thing, that keeps me really puzzled. It’s a piece of information at TheRegister:

It involves 20 or so CentOS Linux servers [...]

Sorry, CentOS is a nice variant of Linux, but i don’t really think that’s a good idea to use a distribution of Linux with at least questionable commercial support (The “Commercial Support” menu item at centos.org delivers a page stating that there is no content) in a mission critical environment. I know that an enterprise Red Hat or Suse distribution/support agreement costs you some bucks, but in such an environment i would opt for the professionally supported variant. But again: This is a example of an attitude. And this attitude is not one of valuing the data of the users. A hickup in a service has most often their reason in the realm of technology, but real disaster has more reasons, it has deeper reasons. Just 10% are technological reasons. Or to say it more precisely: A disaster can be root caused by a technical reason, but the step from a hickup to a disaster needs always reasons outside the realm of technology. And given the rumors surrounding the Danger outage are true, this outage is one of the best examples to show it.