October 11, 2005

Hard Drive




A simple little power failure, sometime after midnight, just after I had published my EAA Chapter's newsletter, and when the power came back on Linux would not boot and I got a nasty surprise to see the words "Kernel Panic!" as the last words on the monitor.

It took from the wee hours of last Thursday morning until about 3 o'clock yesterday afternoon to get the drive put back together and running again. For those who may run into this little disaster, here's a run down of what seems to have happened and how it got fixed:

Apparently from the power failure the superblock on the drive became corrupted, and accordingly Linux, would not mount it. But it gets a little more complicated. The system under Fedora Core 3 was set up with the Logical Volume Management (LVM2) disk management system. Under this approach the boot sector is on an Ext3 partition and the root "/" and swap partititions are logical volumes in a LVM2 partition. It took a while for that to sink in, particularly after repeated attempts run dumpe2fs to find the backup superblocks failed using the hardware mount points.

After I could not get a rescue disk to mount the drive I took a new 120gb drive and built a Fedora Core 3 system from scratch and dug out the source codes for the file system utilities. After some snooping around I started modifying a hack routine called "findsuper.c" to raw read the device and see what was on it. The boot partition looked fine, when I dumped some of the second partition to disk and looked it with a binary editor I got a nasty surprise when I went to where the superblock on an ext3 system should be. I found the text reference for the LVM partition. The real meanng of my fdisk checks on the disk now dawned me!

I didn't have very good tools for LVM under Core 3 so I bought the book on Core 4 and built a core 4 system from scratch on the 120gb drive. I got a little smarter (this is key!) when I built the core 4 system. I explicitly named the LVM components differently then the Core 4 defaults. This was necessary because if I didn't I never would get Core 4 to see the defective drive's LVM structure because it would have the same identifiers. With this construction I could then use the LVM gui to find the mount point which would now be under /dev/VolGroup00/LogVol00. I also installed Core 3 on a small, otherwise useless 5gb drive. With the little drive for testing I could now mount it's root partition at /dev/VolGroup00/LogVol00 and run dumpe2fs to get a report on the structure of the drive and where the backup superblocks where.

Armed with that info I started dumping sections of the drive to disk for snooping with the binary editor to make sure our understanding of how ext3 was structured jived with reality. Headway there, now it was time to try the bad drive.

Dumpe2fs would not work on the bad drive, but now debugfs would if you mounted it read only and with the catastrophic option (-c) turned on. I though, well, here goes nothing... Lets just run the "ls" command to see if we can read a directory... Voila!... or "Balok" in Lord of the Rings parlance ...We could read the drive root directory. That 5gb drive key to figuring out how to run debugfs and get it to mount the drive. From the little drive I could see that Linux sets the block size at 4096. And from that 5gb drive I got the backup superblock address. The boot sector block size is 1024, the root partition is 4096 and the difference is very important.

Anyhow, you have to run debugfs with both the blocksize and the backup superblock address specified. That done, I was able to start dumping directories onto the new Core 4 drive. After pulling off every key thing I could think of (including the 10+gb flightgear terrain, my nasa shuttle radar altimeter database etc.) I was now ready to take the plunge and run e2fsck on the sick drive to try to repair it. It took a couple of e2fsck runs with different settings and about 5 or six hours but finally, the drive passed systems tests with a clean bill of health and would boot. This email was generated from the system running with the fixed drive.

Gee, now that... That killed a week. I can hopefully go back to looking at the uNav, but first I've got to get some priming done before winter and there is still some firewood to rustle up... The fox picture is of a young one that keeps dropping by for a visit and taking a nap out in the open in the back yard.