Update 2010/07/20: We had another failure; after replacing several components the system now appears fine again. The new lesson learned is that apparently the ILOM updates the "System Information: Components" inventory during boot. If the system won't boot or hasn't booted yet, ILOM just shows old (incorrect) information. Additionally, ILOM power readings are unreliable. The old ILOM didn't show any power consumption when the system was running, and the new ILOM (with latest firmware) looks different but still doesn't show 'Actual Power'.

ILOM Power Consumption: Actual Power


We have had serious hardware and service problems with a Sun server recently. Unfortunately, while the hardware problems can be written up to incredibly bad luck, other problems indicate serious corporate and support flaws at Oracle.

Prologue

We bought a new high-end server (X4540) a year ago, and a 2-hour onsite service contract. After installing it, we discovered the system only saw half its RAM. I called Sun, and they sent out a Field Engineer with a new motherboard. Unfortunately the replacement motherboard didn't work. After 3 days of parts replacement -- a second replacement motherboard, some RAM, and a replacement CPU -- they were unable to get either replacement motherboard to boot. They did eventually get the original motherboard to see all the RAM, though, so we resumed using it.

A little while later, the server became inaccesible. A reboot cleared the problem temporarily, and we discovered the problem was a bad patch breaking Sun's ipf firewall. After a couple weeks of requesting the fix (as a patch), I removed the bad patch and the firewall worked again.

April

In April, this X4540 lost a disk, which should have resulted in an automatic ZFS rebuild onto a hot spare, and the filesystem problems cascaded to disable about 20 dependent systems. I called support Thursday night, asking why the hot spares had not been utilized, and was told the problem was almost certainly a bad disk coupled with a bad disk controller on the motherboard.

Friday morning, an FE brought a new motherboard; unfortunately it didn't work. He got another motherboard and CPU, but the system still wouldn't boot. The daytime phone rep didn't know what was going on, so he escalated to another phone rep who told me (condescendingly) that he knew a lot about the X4500 & X4540 hardware, but it turned out he didn't actually know the basic component configuration. This third phone rep insisted that a bad CPU was causing all our problems, including phantom DIMMs reported in empty slots, etc. He also insisted we needed new DIMMs, a new CPU, and a couple more disks. The whole process -- mostly waiting on hold -- took long enough to kill my phone's battery.

Saturday I met another FE back at the machine room to pick up the parts, which were due by 9am. We got some of them by 10am, but others didn't arrive until later. We resumed the parts replacement dance, and again spent several hours on the phone (I brought my charger this time!), fortunately with a different phone rep (#4, for those of you counting along at home). This gent noticed that the system reported 0V coming into the motherboard, and lots of other voltages were off. At the end of the day, we agreed we needed a new chassis, as the X4540 routes power through the power supplies, into the power distribution board, through the chassis, and into the motherboard (so a fault in any major component can screw up CPU power input, and thus everything). The chassis was the only component we hadn't replaced yet. The phone reps, however, explained that the chassis could not be on-site until Monday morning. So much for our 2-hour SLA -- our Regional Service Manager explained it means an FE will be on-site within 2 hours, but they make no commitment at all on parts delivery. I asked the support reps how we could replace our lemon (at this time, refusing to boot from their fifth motherboard) which they were unable to repair, and was told the service organization could not authorize a replacement. So I called our sales rep, who referred the question back to a counterpart in the service arm.

Sunday nothing happened -- they were unable to provide a replacement chassis.

Monday morning, the second FE and I met at the machine room with a Senior System Engineer to assist and supervise. At this point (including the earlier RAM problems), we had had a complete failure to handle RAID recovery, 4 'bad' motherboards, 3 'bad' drives, and 2 'bad' CPUs. They were escalating internally, and the chassis was due at 9am. At 10am, the FE called the distribution center to ask where the chassis was, and was told it was 'almost there'. At about 11:45 the courier arrived, bringing a few small components, but not the required chassis. Someone in the warehouse had sent the wrong box. The courier explained that it would take 60-90 minutes to get us the chassis, because it took him that long to drive to the machine room from the warehouse -- meaning he left after 10am. So not only did the warehouse send the wrong part, but they sent it after the delivery time, when they told us the delivery was nearby at 10am, it hadn't even left yet. More calls, and someone explained that the chassis was not available -- they would have to send one from Boston, and it couldn't arrive Monday at all.

Tuesday they sent back an FE with 2 SSEs and the chassis, and the system came up. This ended the outage that had disabled 20 machines since Thursday night.

May/June

A month later, we received some disk alerts, apparently because we were supposed to mark the ZFS pool as repaired, but we were unaware of this and Sun didn't tell us about it.

On the next reboot, Solaris started logging errors that both the boot disks were offline (while running from these same disks). Eventually I was told that this was due to a bug in the kernel patch, which I backed out.

After rebooting we started seeing errors from another disk. When I asked the Sun case owner how to fail over to the hot spare until we could physically swap it out, he eventually sent me an unhelpful snippet from the manual page. Our SSE actually sent me a separate document with the correct command.

New policy: no more Solaris patching. Between this bug and the patch which broke networking, clearly Sun no longer test patches adequately, and we cannot trust them.

After the disk replacement, the system once more sees only 32gb. We are ordering a replacement storage system from another company, and will avoid breathing on this X4540 until we can migrate off it. It's clearly not trustworthy, and Sun is clearly incapable of supporting it.

Recap

Over two incidents I spent about 6 days at the machine room, well over 20 hours on the phone (much of it on hold), and watched Sun replace 4 motherboards, at least 2 CPUs, several RAM sticks (although they never just sent a full set of 16 4gb DIMMs), a PDB, and a chassis. This is all the components to an X4540. The chassis should have been replaced Friday or Saturday, but only arrived Tuesday.

Lesson

On this one system, we have seen multiple failures of multiple different types.

  • Undiagnosed failure (apparently in the chassis), which prevented 4 motherboards from working.
  • SATA controller failure (the first I've ever heard of).
  • Automatic ZFS hot spares didn't fail over.
  • A 'backline' phone tech was completely wrong, and obnoxious.
  • Warehouse staff failed to send the right part, failed to deliver parts on time on all 3 days, and lied about courier/delivery status.
  • Warehouse stocking is inadequate -- it took us 3 days to get a part.
  • Support escalation was a complete failure. It took about 3 weeks before I got any response from management other than "I'll get back to you."
  • In less than 18 months, this system has experienced 2 major hardware incidents, encompassing over a week of downtime. ZFS hot sparing has not yet worked, but has instead failed twice.
  • We have twice installed recommended patches with serious flaws, once making the system entirely unusable.
  • We have had entirely too many problems appear after reboots. Perhaps there is a disk scanning process that is automatically started after rebooting, but the result is that we do not trust this machine, and are afraid to reboot it.

Oracle's support is a mess. I feel like an idiot for buying this system.

Check contract SLAs carefully. I believed that this support level included parts availability within 4 hours (EMC, at least, used to make a big deal out of their 4-hour parts availability in NYC, for instance), but Sun makes no commitment for timeliness of parts replacement.