Sunday, July 10, 2011

When the Safety Net Fails

It's been quite some time since my sysadmin/DBA skills were put to test. Usually, when they do, it means something bad has happened.

Recently, a server I'm responsible for, suffered some a catastrophe in the form of a power outage. Usually, when such things happen, and the UPS (if such exists) fails, one is still relatively safe, since corruptions to the data itself will be handled by the RAID. This means that if one HDD fails - you can pull it out and push in another - and everyone's happy.

Not my case. In the last power outage something far worse has happened - the RAID controller died. As you may know, with many RAID controllers this means long nights of data recovery, since rebuilding the RAID will cause initialization of the disks, which means total loss.

This is the kind of the events no one is planning for in terms of redundancy. The only redundancy possible is a second server in a cluster configuration, but it doesn't solve exactly this problem, and has its costs. Backup is another form of "redundancy".

So, the component that was supposed to save us in case of failures - failed us.
Luckily, with some witchcraft, we managed to reconstruct the RAID, losing only 1 of the logical drives. That drive had to be restored from backup. Yet, that was the easier thing to do, as another drive, containing the DB, appeared to be alright, while it actually hid a far worse problem - corruptions in the DB. Did I already mention it's good to have backups?

My conclusions were also applied on my home equipments - have a detached (as in "doesn't require power 24/7") backup, and get yourself a good UPS.

No comments:

Post a Comment