We have a couple Dell R900s: 4 sockets, 24 Xeon cores, & 128gb RAM. One of them started reporting RAM & processor errors in December, so I called Dell. The rep explained it might be spurious, due to a BIOS bug. Not that there was any known issue, but Dell naturally hoped I could fix the problem with a software upgrade, so they wouldn't need to replace any hardware. I upgraded BIOS, and it shut up for a couple months.

Last week the front panel went amber again, and the System Event Log started recording RAM errors in one memory board (the system has 4 boards, each with 8 DIMM slots: a total of 32 4gb DIMMs).

Non-critical    02/17/2010 14:58:11 Mem CRC Err: Memory sensor, transition to non-critical from OK ( Memory Board D ) was asserted
Unknown 02/17/2010 11:47:03 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:03 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:47:02 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Non-Recoverable 02/17/2010 11:47:02 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:47:02 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:02 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:47:01 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Non-Recoverable 02/17/2010 11:47:01 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:47:01 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:01 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:47:00 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:47:00 I/O Fatal Err: Unknown sensor, unknown event
Unknown 02/17/2010 11:47:00 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:47:00 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:47:00 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:47:00 I/O Fatal Err: Unknown sensor, unknown event
Unknown 02/17/2010 11:46:59 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:59 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:46:59 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
Unknown 02/17/2010 11:46:59 I/O Fatal Err: Unknown sensor, unknown event
Unknown 02/17/2010 11:46:59 I/O Fatal Err: Unknown sensor, unknown event
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
OK  02/17/2010 11:46:58 System Software event: OEM Diagnostic data event was asserted
Non-Recoverable 02/17/2010 11:46:58 CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
OK  02/17/2010 11:50:02 CPU3 Status: Processor sensor for CPU3, IERR was deasserted
OK  02/17/2010 11:50:02 CPU2 Status: Processor sensor for CPU2, IERR was deasserted
Critical    02/17/2010 11:49:47 CPU3 Status: Processor sensor for CPU3, IERR was asserted
Critical    02/17/2010 11:49:47 CPU2 Status: Processor sensor for CPU2, IERR was asserted

I called Dell, and was told I'd need to run a "Dell 32 Bit Diags" to isolate the bad component. Unfortunately it's only available as a Windows self-extracting executable, which can generate a floppy .img file or a CD-ROM .iso file; Dell's tool can also copy the diagnostics to a flash drive. I hate that Dell both assumes that everybody runs Windows, and helps ensure that by requiring Windows to manage Dell machines. Fortunately I have an XP VM.

So I swapped the suspect memory board from slot D into slot C and ran the diagnostics. I was told to erase the SEL and run the included mpmemory.exe. It was supposed to take half an hour, but actually took about 2 1/2 hours for each run. Additionally, the diagnostics showed an unclear warning that the (DOS-based) diagnostics are not compatible with console redirection (presumably because these hosts have serial consoles configured). Fortunately we bought DRAC, for this machine, and that seems to work fine.

To boot into the diagnostics, I checked the "Boot Order" section of the R900 BIOS. Surprisingly, although it does show VIRTUAL FLASH, I was unable to find a USB FLASH entry. For some reason Dell configures USB flash as a virtual hard drive, so I had to change the "Hard Disk Boot Order" to prefer flash to the RAID controller -- this got me a a DOS-based menu and let me run mpmemory.exe.

Disturbingly, Dell's memory diagnostic triggered but was not able to detect the memory error. mpmemory returned a clean bill of health, but the SEL recorded errors on memory board C (the suspect card in a different slot, so the motherboard itself is fine).

Non-critical    02/23/2010 20:56:56 Mem CRC Err: Memory sensor, transition to non-critical from OK ( Memory Board C ) was asserted
Non-critical    02/23/2010 15:27:27 Mem CRC Err: Memory sensor, transition to non-critical from OK ( Memory Board C ) was asserted
Non-critical    02/23/2010 15:27:27 Mem CRC Err: Memory sensor, transition to non-critical from OK ( Memory Board C ) was asserted

Unfortunately the diagnostics failed to isolate an individual DIMM, and I don't have the time to start keep reconfiguring the RAM (across all 4 memory cards, which apparently need to match each other) to do a binary search, running 150 minutes per test, to isolate the faulty DIMM or slot -- worse, I'd have to visit the server and reconfigure it after each round. Fortunately, Dell acknowledged the absurdity of running 5+ hours of tests (it could easily have taken over 20 hours to find the right DIMM). They sent a new card with 8 DIMMs (2 types, at least one refurbished). I swapped the replacement parts in and reran the test, which failed. Apparently nobody at Dell had ever seen this particular error (generated by a Dell proprietary diagnostic) -- not comforting. I ran it again and got a complete lockup -- this time apparently a common occurrence on Dell multi-processor systems. It turned out I had been given an old version of the diagnostics.

I got the new version, ran it twice, and saw no further errors in the SEL. Hopefully I won't have to think about that R900 for a while, but diagnosing it is so awkward -- it looks like the 256gb max configuration would take 5 hours for each pass!