Over the last couple of days I had a sad opportunity to use Poul-Henning Kamp's recoverdisk(1) utility.

Since it turned out to be a life- (well, disk-) saving device, and it is covered by the beer-ware license, I definitely owe phk a beer.

Yesterday morning I discovered that my server is down. After traveling to the server room (there is no remote console) and pretending that my index finger is square in its cross-section, my investigation showed that there is something fishy with the system disk (which is backed up but not mirrored).

Namely, there are bad blocks, and periodic scripts that run at night touch some of them, which leads to bad things.

Now, the first thing I did was to order a replacement disk. The problem was that it won't arrive until the beginning of next week and I want the box up and running now. Even worse, during the weekend I am going to be in Stockholm for the Nordic Perl Workshop (oh, and by the way I still have a presentation to prepare), and thus won't be able to fix things that require my presence on-site.

After asking around, I got pointed in the direction of recoverdisk(1) by Phil. Thankfully, recoverdisk /dev/ad4 told me exactly how many bad blocks there are (one) and what offsets they are at.

The next step was to make the on-disk controller to remap the block to one of the good reserve sectors on the disk. While I am sure that there are programs that will do just that, I am not aware of any that run on FreeBSD.

Besides, having the offsets, it was a trivial task to quickly create a simple one-shot program that writes something to the bad block, so that the disk will have an opportunity to remap the sector all by itself.

The only problem I had with this was that I could not open(2) the raw disk for writing while any partitions were mounted on it. I was ready to move the disk to another box to run the program there, but Flemming has helpfully told me that doing sysctl kern.geom.debugflags=16 will do the trick. And it did. Thanks, Flemming!

After this there are four more steps - run recoverdisk again to make sure everything's fine, run fsck on all partitions, put the box online, and move the system to the new and shiny disk when it finally arrives.

While this last step will have to wait a bit more, the fact that you are reading this shows that everything else worked.

I do realize that the primary purpose of the recoverdisk(1) is to salvage the data from media that has gone hopelessly bad. Nevertheless, I think that my example shows that it is pretty darn useful in other cases as well.

Several people helped me along the way. I have not yet mentioned Lars, who did some heavy lifting, and Kristoffer for doing the network magic with subnet routing when the box was moved to another, closer location.

Lessons learned:

  • remote console is useful;

  • mirroring the system disk is essential.

Open question: what does phk do with all the beers he is getting when I know for a fact that he does not drink much?