Finally – my server Marvin is back online! I’ve had lots of trouble with bad sectors on the operating system hard drive for over six months, but just couldn’t be bothered to fix it properly.
I started off just checking where the bad sectors were, moving the contents of that partition to the second hard drive and symlinking/mounting the new location. This was enough for a while, but along came more bad sectors. I tried to move the whole operating system to the other hard drive and get that one to boot, but since my 160G ‘main storage’ drive is connected to a Promise Ultra TX133 card and my bios (even after latest upgrade) just can’t boot off a hard drive connected to that card, it didn’t work out.
So I repressed it for a while until my site didn’t work anymore (= catastrophe). I realized that the hard drive just wasn’t well and had to take it in for surgery. Luckily my career as kleptomanic IT-pathologist came into handy as I found three unused hard drives stashed away in some storage box. Now I only had to perform the delicate procedure of replacing the poor bastard’s brain without memory loss!
The whole procedure didn’t seem that difficult. All I had to do was basically to copy all contents from the broken hard drive to the new one, fix lilo and fstab and physically switch it from being hdb to hda and reboot. Almost easier said than done.
First of all, I had to boot off the debian sarge install cd and interrupt the installation after having activated my lvm partitions on the old drive. Using another virtual console (leaving the installation procedure running on ‘tty1’) I mounted the hard drives and started playing around with the ‘cp’ command. After a few trial and errors I got to the conclusion that ‘cp -a -r’ is the best way to go (recursive copy, preserving virtually all attributes etc). I decided not to copy the whole filesystem at once, since it would take forever and if something goes wrong I don’t know where it does it.
So I started transferring the filesystem to the new drive copying one whole root-level directory at a time. When the first I/O errors started coming up, I realized that I should have run ‘e2fschk’ on the broken drive first. Back to square one.
This time it worked out fine. I ‘cp -a -r’ed every root-level directory and the partition I moved to my ‘media’ hard drive onto the new drive and double-checked every copied directory with its original using ‘diff -r’. When all contents was copied (correctly), I ‘chroot’ed into the new drive, updated lilo.conf, ran ‘lilo’ and updated fstab to mount the right partitions at boot time. Then I removed the debian install cd, shut down the computer, made the new drive master on the first ide-bus and the old drive slave, took a deeeeep breath and pressed the power button.
The first glimpse of hope was the lilo boot menu which popped up nicely. Then the kernel loaded and the drives were found. But after a while came linux’ answer to the blue screen of death: kernel panic. It claimed that it there were no such directory as /dev/console, which is strange since there was such a directory. Well, I rebooted with the install cd once again, mounted the drives and re-copied the console from the old drive , chrooted into the new drive and looked through lilo and fstab again. Everything seemed ok. When getting ready to reboot once again I ran ‘mount’ (as chrooted in the the new drive) to see which partitions had to be unmounted. ‘mount’ claimed that an lvm partition on the old drive was mounted as root, so I modified /etc/mtab to say that /dev/hda1 is mounted as root and rebooted.
Now it actually worked! It booted really nicely without any problems whatsoever. All databases worked, apache and tomcat worked and above all, my blog, forum and gallery worked again! I still don’t know if the mtab manipulation was the key to the success, but I don’t really care. Now I just have to keep the resurrected Marvin by my desk a couple of days for observation and then send him home to the closet again.