Last week, reppep.com (Dell Inspiron 530 desktop, a couple years old and running CentOS 5) stopped responding to email requests. It serves a bunch of websites and a few email accounts, but the email service is much more important. The key disks were a pair of 750gb disks mirrored with mdadm; I also have 3 1tb disks for data.

I discovered that logging in locally restored responsiveness, at least for a while. Unfortunately there was nothing I could do to bring it back from work. I was in the middle of a cluster build at work and busy with some projects at home, so I left it for a few days. I noticed some panics on the console and messages about resynching /dev/md3 (swap) and /dev/md6 (/home). Those should clear themselves, but I always wonder: with a discrepancy among 2 mirrored disks, how do you decide which to trust? If one disk completely fails it's clear, but in this case despite a heat warning, smartmontools stubbornly claimed neither disk had serious problems. I kept it staggering along for a few days, until one day after a particularly long bout of responsiveness and a complaint from Amy, I gave up on waiting it out or finding a solid indication of what was wrong.

I tried pulling one of the 750gb disks, hoping it would run off the good(?) submirror, but [warning: details get fuzzy at this point] it kept complaining about /dev/md3 sync not completing (with an implication it was just stuck waiting for /dev/md6 to sync, but perhaps the system just wasn't staying up long enough to resync the 634gb of /home), and additionally I got out-of-memory crashes. I had bought the system with 1gb RAM and configured 4gb of swap. After starting the mail system, Apache httpd, openfire Jabber server, CrashPlan backup service, etc., the system exceeded 1gb, and with swap offline it was killing processes and crashing. I bought and installed a couple 1gb DIMMs (it's convenient to have Staples a couple blocks away!). I saw USB / IRQ errors, which suggested irqpoll (which can apparently slow the system down, but was worth trying), so I added it to the kernel arguments, but still no stability. I tried running off the other 750gb submirror instead, but that didn't help.

I bought a new 1tb disk, figuring I'd use it to replace the 750gb disk with the heat warning. But the system kept crashing, the same way. I tried pairing the 1tb with the other 750gb, and got the same crashes. To avoid the crashing sync process, I used The --zero-superblock argument to mdadm (syntax is a bit tricky) to remove the RAID metadata, and changed the partition types from RAID to regular Linux filesystems. Finally I installed CentOS 5.5 afresh on the 1tb disk and disconnected the rest of the disks and all USB except the keyboard and mouse: more panics, including the IRQ errors. At this point, it was apparent that my 2-year-old Dell was curdled.

ns2.reppep.com is a Compaq Evo 510 SFF (EOL 8 years ago). It's perfectly adequate as a BIND slave server, but I've been planning to replace it with a plug computer or netbook for a while, to stop wasting power.

The Evo has a single PATA drive bay, but I have USB cases. Unfortunately, as I began to configure it I noticed it only has 256mb of RAM! That's fine for BIND, but not my email system. I could spend $100 on RAM for this ancient computer, but that seemed silly. Instead I bought an HP Pavilion P6610F, which so far seems fine. It has a quad-core Athlon, which may be irrelevant because its main purpose is to serve web & email up a 1.5mbps uplink (6mbps down), or might be handy for HandBrake or other stuff. It came with 4gb RAM, so the $100 I spent on the Dell was wasted. That's one of the more purely irritating aspects of this whole misadventure.

Installing CentOS on the HP was easy. With RPMforge, installing the mail system was straightforward (much easier than building amavisd-new, clamav, and all their dependencies manually, as I did a couple years ago for the Dell). Unfortunately, users did not see old mail until I realized that I was using the wrong reconstruct syntax for cyrus-imapd (Cyrus can use . or / as a path delimiter, and although chk_cyrus uses . as a delimiter on my system, reconstruct requires /, and doesn't complain when provided the wrong syntax -- I kept running reconstruct and wondering why it didn't recover mail! Thanks to the helpful info-cyrus@ list members!

openfire was trivial -- I just installed the RPM and copied /opt/openfire to the new disk. Apache was quick & easy too -- putting my configuration back was simple, then I had to install mod_ssl and a few PHP modules for Dotclear. MySQL was easy -- I just put the files back, and didn't have to test my automysqlbackup dumps.

Unfortunately, the HP only has a single 10/100 Ethernet port (and WiFi, but who cares on a Linux server?). The Dell was PCI based, so I ordered a new PCIe GE card for the HP; fortunately GE cards are cheap, so the only aggravation is waiting for it. Ironically/sadly, Staples (who sold me the HP) only has 2 GE cards in the store -- both the PCI GA311 I already have -- meaning they don't have any GE options for the HP they sold me.

My other irritation is that this Dell died -- so badly -- after 2 years. Obviously the Evo is much more robust, as have been most of my computers.

This whole unpleasant experience reminded me (painfully) that grub is very poor at dealing with mirrored boot disks. It tends to try booting the wrong disk, in various iterations. The grub command always assumes there is a single boot disk, and simply doesn't support redundancy well. With real hardware mirroring this would all be out of grub's control or visibility, but that's rare on desktops (most 'RAID' support on desktops is just fakeraid. Fortunately the HP's BIOS lets me choose which SATA disk to boot from, and that becomes /dev/sda, so I was able to get grub working (with a few false starts).

Now mail is back with all mail recovered, all websites are online, I have Jabber back, and things seem copacetic. As soon as I get the PCIe GE card and rid of the flood abatement hardware, I can restore the high-speed connection to our home LAN and reconnect my data drives...