[KLUG Members] Laptop reiser corruption and fixing it

Jamie McCarthy members@kalamazoolinux.org
Fri, 25 Jan 2002 10:40:43 -0500
Previous message: [KLUG Members] Bandwidth throttling
Next message: [KLUG Members] Laptop reiser corruption and fixing it
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I recently had trouble with one of my filesystems on my laptop
getting corrupted.  I'm running Reiser and I don't know if that made
things better, or worse.  The problem exhibited itself while I was
doing an apt-get dist-upgrade, and the symptom was bizarre -- my
laptop would lock up completely, its backlight would go off, but I
could still read the text which was telling me that there was some
kind of an error in the USB system.

That was spurious;  it just turns out filesystem corruption is not
handled well, at least not by reiser, and basically I was getting a
kernel panic in a really bizarre way.  I think.  I'm not sure
whether things would have been better or worse under a different
filesystem than reiser.  But I'm tempted now to try ext3 at my
earliest convenience.

Just for the record, in case anyone else is running reiser or an HP
Omnibook 6000, here are the emails I sent to a 6000-specific mailing
list to get help and report my progress.



[Jan. 24]

I've been off the net for 36 hours and hadn't dist-upgraded for
longer.  I found tonight that python2.1 (required by several
other packages, I don't use it personally) wants to upgrade
itself from 2.1.1-4 to 2.1.1-8.

This has locked my machine up three times so far, always at that
exact point.  Locked it up hard.

I'm using Chris's kernel (2.4.17) and most of his special
packages.

When the machine locks up, the screen backlighting goes off, but
I am able to read the text still there by pointing a flashlight
nearly parallel to the screen.  I got this message:

    usb-uhci.c: interrupt, status 30, frame# 1519
    usb-uhci.c: host controller halted, trying to restart.

which is odd because I don't have any USB devices plugged in.
The frame number has changed at least once, to 1711.

When it boots back up, several things refuse to run because it
was interrupted -- my libcrypto is hosed so appletalk and sshd
will not start, and many X libs are hosed too so X will not
start.  The files are probably corrupt;  as I understand it,
Reiser protects my filesystem but not necessarily the contents
of the files, so I assume I have at least one corrupted lib.

For example, I get the messages:

    Starting AppleTalk Daemons (this will take a while):
    /usr/sbin/atalkd: error while loading shared libraries:
    /usr/lib/libcrypto.so.0.9.6: invalid ELF header

    Starting X display manager: xdm/usr/bin/X11/xdm: error
    while loading shared libraries: libXpm.so.4: cannot open
    shared object file: No such file or directory

I can't upgrade anything because any attempt to use apt-get
refuses to do anything until it tries to reinstall python2.1.
I cannot remove the package, either with apt-get or "dpkg -r"
because it informs me the package is in a "very bad inconsistent
state" and that I need to reinstall it.  Yeah right.  An attempt
to "dpkg -i" the older 2.1.1-4 version locked it up with the
same message.

I have yet to have any problems with any other package;  it's
locked up four times so far, every time I've tried installing a
python2.1 package, and nothing else has caused it to do that.
I have *no clue* what USB has to do with the latest python
packages, and have to assume there's some very very obscure bug
that's getting triggered somehow.

So, a warning to Debian users... be very careful with the latest
versions of python, don't blindly do an upgrade.  If anyone has
gotten python 2.1.1-8 to work, let me know.

And most importantly, if anyone knows how I can get python off
my machine, please let me know.  At this point I'm thinking I
will have to "make menuconfig" to turn USB support off in the
kernel, recompile it, install it, boot from it, and then try
the "dpkg -i" of 2.1.1-4 again and see what happens.  And hope
the damage to my libs and whatever else isn't so bad that it
can't be repaired.  Does that sound viable?



[Chris Hanson writes back, Jan. 24]

The most likely explanation I can think of is that you have some
corrupted libraries.  Since you're running ReiserFS, that's entirely
likely.  Have you checked the checksums on your packages recently? 
I do this on a regular basis to test for FS corruption, although
since discovering the problematic disk/APM interaction I haven't
seen any.

You should be able to use my rescue floppy to repair the disk.  I
think that there's an md5sum on there too, in which case you should
also be able to examine the package checksums to find the
corruption. (If md5sum isn't on the rescue floppy, you can put it on
another floppy, assuming you have access to another Debian machine,
and run it from there.)  Once the library corruption is repaired you
should be able to boot the machine and fix the python2.1 problem. 
Fixing the library corruption may require some hand-unpacking of
.deb files and transport onto the machine via floppy.

One thing I have found very helpful is having a small (~ 500 MB,
though 300 MB is probably enough) partition with a basic Debian
system on it.  I use this partition to do repair work on the primary
partitions, which unfortunately has been too common over the last
few months due to the other problem.



[Jamie, Jan. 24]

Chris, your intuition seems to be right, my /usr partition is
corrupted (and possibly others, I haven't checked, one thing at a
time).  Booting from the rescue disk/root disk doesn't seem to help
me;  there's no reiserfsck on the rescue disk and the reiserfsck on
my drive needs glibc 2.2 which the rescue doesn't have.

What I've done is boot up my system and, by shutting down almost
everything, gotten it so "lsof" doesn't report any files in /usr
being used and then remounting /usr read-only.  At that point,
"reiserfsck --check /usr" works and reports scads of errors.
bad_leaf's, bad_indirect_item's, "object id map shrinked", free
block count mismatch, and "on-disk bitmap does not match to the
correct one" before the actual segfault.

I'm not sure what to do at this point.  I think I'm going to try
copying as much of /usr as possible onto a backup drive over NFS,
and then asking reiserfsck to take its best shot at fixing it.
If that doesn't work, the next step depends on how good the
backup attempt went...

This has _not_ endeared me to Reiser and if I get this fixed I
may be looking much harder at ext3 or even XFS instead...



[Jamie, Jan. 25]

The result for now is that my laptop is working again.  I'm sure
some files are corrupted and I believe one filesystem is corrupted
as well.

At my earliest opportunity I'm going to wipe it and reinstall with
ext3.  Chris, do your Debian 2.2 ISOs allow for that possibility?  I
don't know if ext3 is in the kernel at installation time like ext2
normally is and, with your kernels, reiser also is.  If so, I'll
just repeat the installation process with your same 2.2 ISO I used
lo these many months ago.

Or do you just install ext2 and then after boot, do the command-line
thing to switch them to ext3?

Anyway, the recovery was quite painful.  After several failed
attempts to boot from floppy and yet be able to reiserfsck my
filesystems, I went back and booted from my hard drive.  I probably
should have just booted into single-user mode to check the
filesystems, but it worked almost as well to boot normally, shut
down everything that lsof said was using /usr, umount /usr, and then
check it.  (Turns out unix is still largely usable even with /usr
gone.)

After saving off all of /usr and most of /home over NFS, just in
case, I ran reiserfsck, stepping up from its lowest level --check
through --fix-non-critical and --fix-fixable, up to its highest
level --rebuild-tree.  Nothing did much of anything except the
highly warning-filled and beta-quality --rebuild-tree.  I think the
warnings said something like "caution, you may lose all your data
and cause the structure of reality to collapse" but it worked OK.

After having done that, --check still gives me warnings of the type
"objectid map expanded" and "...shrinked" but I'm reassured by this
person that they "are harmless and should have been removed from
fsck forever ago":

   http://www.uwsg.iu.edu/hypermail/linux/kernel/0103.3/0449.html

After doing that I snooped around a few directories manually, found
goofy files in /usr/lib/python2.1 with permissions of "?---rwx---",
impossibly high file sizes and mod dates in the 1970s.  Deleted
those, grepped "ls -lR /usr" for more of the same and didn't find
any, ran --rebuild-tree again just to be sure.

My X libraries were still corrupted at that point so with a little
help from #debian on irc.slashnet.org I ran this:

dpkg --force-depends -r xfree86-common xlibs xlib6g xlibs-dev
apt-get -f install

That removed the guts of my X installation including the corrupted
library files, and then reinstalled them.  Immediately after doing
that, "startx" worked and the whole system continues to work.  After
a reboot, "apt-get upgrade" finally brought my system into line with
no errors.

I can understand now the wisdom of checking reiser partitions every
so often on a laptop (especially one that apparently has issues with
APM and switching from AC to battery and back).  But I hope soon to
get off reiser and not have to worry about that quite so much
anymore.  I'm lucky this only burned one workday and that I was able
to get good backups and didn't lose any data.



[Chris Hanson, Jan. 25]

I haven't tried building an ext3 file system directly; I'm not sure
there's any way to do that.  I've just converted existing ext2 file
systems in place.

In any case, the ISO images _don't_ have ext3 support.  But the last
two or three kernel packages I've built _do_ have ext3 support, as
does the most recent set of boot floppies.  This shouldn't be a
problem; just install with ext2, upgrade to recent kernel (>=
2.4.16) and e2fsprogs (>= 1.25) packages, then switch over to ext3.

I've been really happy with ext3.  I'm probably going to convert the
file server at work over to use it, as soon as I have time to update
the kernel on that machine.

   I can understand now the wisdom of checking reiser partitions
   every so often on a laptop (especially one that apparently has
   issues with APM and switching from AC to battery and back).

I do this with ext3 as well.  Using ext3 doesn't prevent corruption,
at least in my case, because this APM problem appears to clobber a
random block of sectors on the disk.  But with ext3, the corruption
seems to stay localized to the area that was clobbered; with
ReiserFS it quickly got out of hand to the point where the entire
machine was unusable.

Note that the debsums package is very useful for dealing with this;
I now use it on a regular basis to make sure that all the package
files are checksummed, and I test the checksums about once a week or
so. This way I can at least be confident that the operating system
is intact.
--
 Jamie McCarthy
 jamie@mccarthy.vg
Previous message: [KLUG Members] Bandwidth throttling
Next message: [KLUG Members] Laptop reiser corruption and fixing it
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]