Monday, April 10, 2006

more frightening sporadic HD errors

So, there I am this morning, booting my ibm x40 laptop when I get the unfamous /dev/hda2 was not cleanly unmounted message during linux boot. Off to a root shell. ok. what happened? All right, I remember. Another one of those sporadic Input/Output errors, yesterday night... What's causing them? Nobody knows. Why does the HD report IO errors just somtimes, but generally works 99% when manipulating those files in directory X?

However, things are more weird. 'Tis my first time with fsck required due to errors. In general the Input/Output error had always remounted filesystem read-only (as it ought to) and after reboot everything had worked flawlessly. Until today.

So I drop to this shell and shortly refresh my knowledge on e2fsck and badblocks. After looking for a while I decide to e2fsck -c -p -v the partition. It starts to work, recreating the journal, but then tells me to run a manual e2fsck (ie, without -a or -p option) Well, it's not the first time this happens to me, so no reason to fear. So I go on with e2fsck -c -v and start recovering inodes, readjusting filesystem consistency and so on for hundreds, if not thousands of inodes. Yes, I confess there were so many errors that in the end I didn't check anymore what inodes I was repairing. So I just pressed "enter" until the job was done. I only stopped once to consider if repairing the /etc directory could be a problem. But that's not a question to think a lot. In any case, until the filesystem is not repaired it will not be possible to boot. So I also repaired that one. But not without implications

After repairing and seeing that badblocks did not detect a single bad block I reran with e2fsck -c -c -v to see if the non-destructive read-write check would find some bad block. But still there was no bad block to be found. So weird.. I thought. Isn't there supposed to be some problem if I get continuous Input/Output errors on some inodes? Or what else does a readlink() call do? Anyway. Disk repaired. Next Step: Reboot

So I select linux on the prompt and start to boot. After a short while it is clear that something is not working. I have an interesting init: NO INITTAB message. So I try to enter manually the runlevel. Let's say: 3. But all this did was freeze command line and allow no further input. So I repeat the procedure. But now it's time for linux single user mode.

Once dropped off to the root shell I look at the root filesystem. Everything's fine, so let's look into etc. cd /etc returns /etc: not a directory. WHAT?!?! file /etc returns English ASCII text WHAT?!?!^2. And vi etc shows an interesting xml file interleaved with some non-sense binary data. Definitively something is not quite right when /etc is an XML file and all your configuration file are stored as recovery inodes in the /lost+found directory :) Thought I may recover from /lost+found. I entered, but: Welcome to the jungle!

Fortunately, after seeing that I would not be able to live without a reinstall, I was able to backup most of my critical data. And it was thanks to my Windows partition, as linux was very firm in its intention of not letting me mount the usb mass storage devices, despite it allowing me (apparently) to load the kernel module. Connecting the devices just did not add anything to dmesg output, nor did it activate the device lights. All I could do is move my sensitive data to the windows partition and reboot from there.

I'm reporting now about 12 hours after incident. As system administration was also unable to find errors in the hard drive (what the hell!) we opted to reinstall the original setup, without the manually installed kernel that I had added from suse's ftp (we're using novell GNU/linux in our company). But I'm still wondering in how far a misconfigured system with a standard kernel (used as default in a more recent distro) would end up generating Input/Output errors. Maybe a race within readlink()? But did I see this error always with readlink()?? I can't remember. This has happened to me about 5 times now (all of them after I installed the new kernel). If the problem is hardware then I'll just wait some weeks until it resurfaces. And then the diagnostic will be definitive. Any ideas anyone??

Well, hope you liked my catastrophy-story with happy-ending.
Next time I'll better ask Cthulhu for inspiration

1 Comments:

Anonymous Anonymous said...

Glad you had a happy ending! (My own laptop is still not functional at the moment.) I made it here by googling Lovecraft. Enjoyed my visit - esp. the pics from Japan.

12:58 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home