Personal tools
You are here: Home Blog Oops

Oops

Posted by Tom Hallman at Nov 19, 2010 03:00 PM |

Around 3:30pm yesterday, something bad happened.

But let's back up.

For the past couple months we've been in the process of migrating our server virtualization infrastructure from Xen to KVM.  It's something akin to lifting up a house to replace the foundation - but without anyone in the house realizing what had happened.

We had a good plan.  In fact, the plan has gotten even better since we've begun using some agile techniques including pairing.  By around 3pm yesterday, we were actually ahead of schedule and really excited to get ahead.  We'd migrated over a good number of our virtual servers from Xen to KVM and had planned out the remaining virtual servers so as to minimize downtime.

So all we needed to do was make a little change in the logical volume manager.  To use our house analogy, we were going to move the floor supports from one place to another.  While not a small task, it's easy in concept. Thus, we typed in the command.

It was about 3:30pm.

Suddenly, very strange things started happening.  Our house analogy breaks down a bit here, but let's just say that some very important, low-level data was being reported correctly on our new KVM server and not correctly on the old Xen server.  To our surprise, nothing actually stopped immediately.  But we were no longer sure what magic was holding it all together.  As our concerns about data corruption began rising, we decided that the best thing to do would be to reboot the Xen server (along with all its virtual servers).  If all went well, the fresh start would clean up any leftover junk and all the virtual machines would come up correctly.  So we rebooted it.

We never heard from those virtual machines again.

To make a long story slightly shorter, we realized that our best hope of fixing everything was to just follow the plan we'd already made for migrating the remaining Xen virtual machines over to the new KVM server.  In other words, we'd do the next two weeks of work... but in one night.

After we prayed together, Brian and Adam went to work migrating the remaining machines.  Jason pulled the "Beta" label off of our new KVM-based terminal server (which only a select few had been testing at that point) and moved over the remaining users still left on the now-dead Xen-based terminal server.  I got on the phone (email was down) and called through our Staff list to let them know what happened and to ask them to pray.

The highlight of the evening (after Jason and I had gone home) was when Brian's wife Carin and Adam's wife Jen showed up with food to encourage their husbands!  Both couples had scheduled date nights anyway - this just looked different than they'd expected.

By 8pm that evening, almost every service was migrated and running correctly on KVM.  I sent an email out telling the Staff what had happened and we quickly got back many encouraging responses!

Only in the LORD's grace can we experience a server catastrophe and yet actually arrive further down the line in our project.

Only in the LORD's grace can a canceled date night turn into an opportunity for two ladies to show their love and devotion to their husbands.

Only in the LORD's grace can we find rest, peace and joy even in the midst of a crisis.  No doubt this will not be our last one.  But our God knows the very hairs on our heads.  He can certainly handle corrupted disks - and whatever else comes our way.

Document Actions