Saturday, February 16, 2008

Web 2.0 Meets Error 33

Apparently Amazon's Elastic Cloud snapped yesterday and havoc rained down on a number of Web 2.0 sites. This is unfortunate because the same kind of technology was deployed very rapidly (elastically?), exactly one year ago, to help search for missing computer scientist and yachtsman, Jim Gray.

When I was at Xerox PARC, we had a term for this kind of failure mode: Error 33. Error 33 states that it is not a good idea for the success of your research project to be dependent on the possible failure of someone else's research project. This term was coined by the first Director of Xerox PARC, Dr. George Pake and the nomenclature is reminiscent of Catch 22.

Error 33 is an all too appropriate reminder that a lot of Web 2.0 technology, which is hyped as ready for prime-time, is really still in the R&D phase. It's probably only very annoying when SmugMug is off the air for several hours, but mission-critical services like banks and hospitals should approach with caution. Achieving higher reliability is only likely to come at a higher premium.

2 comments:

Unknown said...

Not exactly a comment on the AWS outage but an interesting link to EC2 performance oddities - http://teddziuba.com/2008/02/the-amazon-ec2-swindle.html

Neil Gunther said...

An interesting link indeed. The "time stolen" metric is discussed in the February blog entry entitled,
"Paravirtualization in VMWare Server"
(http://perfdynamics.blogspot.com/2008/02/transparent-paravirtualization-in.html)
This is one of the prices paid when using hypervisors. If his code was competing with another VM guest O/S, under certain circumstances, he's only going to get the CPU resources 50% of the time. At least, VMware now reports that fact---which is what I've been arguning for.