18 October 2010

Recovering from a corrupted ext3 journal

Occasionally, an EXT3 volume's journal will become corrupted.  When
this occurs, Linux will proactively remount the volume in question as
read-only in attempt to stave off further data corruption, loss, etc.
When the volume is remounted r/o, lines similar to the following can be
found in dmesg, written to the console, and possibly in /var/log/messages:

        Aug 13 15:29:16 localhost kernel: EXT3-fs error (device hda7) in ext3_reserve_inode_write: Journal has aborted
        Aug 13 15:29:16 localhost kernel: EXT3-fs error (device hda7) in ext3_dirty_inode: Journal has aborted
        Aug 13 15:29:17 localhost kernel: ext3_abort called.
        Aug 13 15:29:17 localhost kernel: EXT3-fs error (device hda7): ext3_journal_start_sb: Detected aborted journal
        Aug 13 15:29:17 localhost kernel: Remounting filesystem read-only

Commands attempting to modify data on the volume may also act or
report strangely.  Furthermore, the volume may not be reporting as
mounted read-only via 'mount' (with no options) or a check of /etc/mtab.
If the volume can be taken offline, the issue can be corrected with
the host online.  If, however, the volume cannot be offlined, as it
would impact the host's operation, the host needs to be brought into
single-user or single-user from linux-distro-disk_1 CD to alleviate
usage of the volume in question.  The following procedure can then be
used to correct the issue (following assumes booted from CD):

        sh-3.00# umount /mnt/sysimage/var
        sh-3.00# tune2fs -O ^has_journal /dev/hda7
        sh-3.00# e2fsck -y /dev/hda7
        sh-3.00# mount -t ext2 /dev/hda7 /mnt/sysimage/var
        sh-3.00# ls -a /mnt/sysimage/var
        sh-3.00# rm -f /mnt/sysimage/var/.journal
        sh-3.00# umount /mnt/sysimage/var
        sh-3.00# tune2fs -j /dev/hda7
        sh-3.00# exit

In the above, cmd 2 removes the journal on /dev/hda7.  After running
fsck and ensuring the volume is 'stable', cmd 8 re-adds the journal back
to /dev/hda7.  Cmd 9 exits single-user and reboots the host, at which
point, the volume should be brought online without further issue.