Horror Stories - The Oncer

Over the years you hear some great IT horror stories. Hopefully by sharing it, others can learn and have a good laugh. Names have been changed to protect the innocent.

Part I - A New Hope

The team had created a bootable CD that could re-image an ESXi host within 10 minutes. Late one night, rather than troubleshoot an issue, Brent mounted the ISO, re-imaged the ESXi host and it fixed the problem.

Although he soon realised one of the important steps was missed - disconnect fiber from ESXi host. During the automated re-image, it picked the first disk it saw, which in this case was LUN 0. All VMs on that LUN had been wiped out. There were no VM level backups, just in-guest file level backups. It took days to restore most applications.

It was identified the current backup solution didn't cut it, and funding was provided for a full VM backup solution. The boss said to Brent, "This is a 'oncer'. Something you'll only ever do once in your life."

Part II - Attack of the Clowns

Six months later, Brent re-imaged an ESXi host to fix a problem, and went home.

The next day his coworkers were seeing strange events in the monitoring solution, but the VM's were still accessible. Browsing one of the SAN datastores, showed no VMs! Just then Brent came in and was asked "What did you do last night?". "I just re-imaged a host…… oh no, not again", as his faced showed the realisation of what happened.

The weird thing was that VM's on LUN 0 were still running, but they couldn't write to disk, so application data could still be exported / backed up to an external filesystem.

Brent went to see the boss. "You know that 'oncer'? I've done it twice".

Using the new VM backup solution, VM's were restored before lunch. Impressive.

The boot CD was modified to filter out SAN LUN's, and provide a menu, with a warning and confirmation before proceeding.

Lessons learnt:

  • Don't rely on humans to always follow a process
  • Don't automate bad processes
  • When automating, validate and include error checking
  • Ensure backups are good

Mistakes will be made.