Ways to Spend Memorial Weekend
A week ago today was Memorial Day here in the US, which means the weekend leading up to then was Memorial Weekend. Apparently some people want to restore Memorial Day (nee Decoration Day) to a static May 30th, but until that happens we enjoy an automatic long weekend. It is a time officially set aside to honor those who died while in military service for our country, although for the most part it seems to be family and friends barbecue day. I’m not writing about that.
On the preceding Friday afternoon, the main server for work had a disk failure. This is a server with plenty of backup goodness like a RAID-5 disk array. That should mean we can lose a whole disk and everything will be just fine. So I went home after work and mowed the lawn knowing that the good folks at the data center were working to bring it back to life. By late Friday it was clear things were not going to be just fine, as they had replaced the bad disk and were now getting strange error messages from the disk controller. They were going to try a few things over night, and I went to bed. Saturday morning came with no success. I made the call to let the boss know what was up and it was decided that a server house-call might be in order.
We have two of these servers, normally set up to mirror each other, one in California somewhere and one in Tampa. I say normally set up to mirror each other because the backup server was in the slow process of being rebuilt after it had problems with the (identical) RAID card earlier. So, as Saturday evening approached I booked a ticket to Tampa for that night. I arrived in Tampa at midnight armed with a replacement RAID card and three bottles of water. No one from the data center could pick me up, so I grabbed a taxi and found my way there. The data center is in a strip mall, inside a huge converted furniture store. There are still signs for different departments, although doors and rooms have been added. Many, many doors, all with electronic card locks.
I went to work on the server there, but by 6 am when I had no luck I just grabbed all the drives from the server, put in a new one with a bare operating system on it so we could at least do something with the server later, and was on my way back to the airport. Took off about 8:30 Sunday morning and got back to town around 11 or so. Slept for a few hours and then I was on my way for help working with the drives I brought back with me. With some great technical help from a friend who runs a great hosting business, we spent about 6 hours poking and prodding the drives. The RAID card thought everything was OK, but would fail whenever trying to regenerate the array. Finally, somewhere around Midnight Sunday we started a raw copy of the last 120 GB or so of the disks to another (non-RAID) drive. Some sleep was had after that.
Monday, Memorial Day, morning, I started looking through what we had managed to copy. Essentially we had a single 120 GB file and buried within that was the data I needed to get to. Through some magic and very good luck, I located a smaller, 6 GB section of that data which mostly corresponded to a backup of what I was looking for. It was still a very slow process and so in order to be operational by Tuesday morning I switched to just getting enough restored on the bare box back in Tampa to run the server software.
At some point on Monday I had dinner at my parents’ — it was a barbecue style dinner.
Then Monday night up to about 6:30 am on Tuesday I rebuilt the server software. From there I started working on processing that smaller chunk of data, and by Tuesday night I had extracted a large and important piece. I pretty much had to sleep for a bit at that point. From there things started to go fairly well. By last Friday nearly everything was restored, and I was once again able to sleep more than one night for every other day.
From Friday night to Thursday I felt sick. It was a great combination of stress, disappointment, lack of sleep, lack of food, etc. This was about as close to killing our company as you could come and manage to survive. Thousands of elementary, middle, and high school kids have their work stored on our server, and it’s my responsibility to make sure this doesn’t happen. Yikes. We’re back to having a working backup server, and I’m adding a third off-site nightly backup, too.
You can leave a response, or trackback from your own site.