Monday, May 7, 2007

Backing Up Web 2.0

So Alice is happily managing her administrative duties, a few purchase orders, some emails, update the conference room schedule, watch a few YouTube videos, checking out the links her friends send her on del.icio.us, just the usual. She's typing away, getting things done, tapping her fingers to the music in her headphones as the pages load and suddenly she realizes they aren't. Something has gone down in the network. It could be the connection, it could be the server, it could be the service; the ultimate result is she is no longer working.

I was just bit by this when del.icio.us wasn't responding. Suddenly I can't look up that how-to I was working from last week.

Networked applications are great. They let us all communicate and collaborate better, they centralize things, and we're all more productive. However, the problem is that when network applications fail they tend to fail catastrophically. If the network is down, no one can work.

The problem really started to arise when we started getting always-on internet connections with good bandwidth. Before our networks were reliable and fast things were designed for intermittent communication. Things like FTP and CVS are designed to create a local copy for you to work on and then upload to the server when it's convenient. Now the bandwidth is high enough that we can just keep our documents on the server all the time, be they file shares or GoogleSpreadsheets. In the old model if the network went down everyone could keep working on their local copies. Now when the network goes down no one can work.

For this reason it makes a lot of sense to have backups, both of data and procedures, for when your web applications fail and servers fail. Who do you talk to to submit that purchase order? Who will keep track of changes to a collaborative document? How will you get your files from one system to another? Some suggestions:
  • Designate an individual to be in charge of each process, such as documents or purchasing or code check-ins, so that there is a central point of contact when the system stops.
  • For web-forms and submissions make sure you have paper backups and that people know where they are and have a supply on hand to last at least a day.
  • Use technologies where users keep a local cache such as AFS and SVN instead of having all files reside on the server.
  • Never entirely replace something with a networked counterpart. Leave the previous version around so people can roll back to something familiar when the new system fails.
  • Make sure you have out-of-band communication with your systems. This is usually a keyboard and monitor or serial connection to the system. If you get in the habit of SSHing into everything you'll have a lot of trouble fixing it when you accidentally assign an IP address of 192.168.1. Yes, that's three octets, and yes, I've done that when trying to remotely roll-over a firewall while out traveling.
Fortunately, most of these outages are brief, but it is important to remember that nothing has 100% uptime. To mis-quote Fight Club: On a long enough timeline, the uptime of any system approaches 0%.

No comments: