Extra Pepperoni

To content | To menu | To search

Wednesday, June 26 2013

Solaris patching is broken because Oracle is dumb and irresponsible

I am setting up a Solaris 10 system, starting from S10U10 (Solaris 10 Update 10) as a starting point to match another server. Solaris includes a registration wizard that comes up automatically after installation, but it doesn't work. Oracle updated their whole online patching system when they took over from Sun and broke the old built-in patching tools. Unfortunately the procedure to update an old system is completely byzantine.

You have this problem if the graphical Solaris Registration Wizard says "Error in SCN/Cacao Update License" when you register, or the smpatch command errors out like this:

-bash-3.2# smpatch analyze
Error: Unable to download document : "xml/motd.xml"
Cannot connect to retrieve motd.xml: Authorization Required
Failure: Cannot connect to retrieve current3.zip: Authorization Required
-bash-3.2# cat /etc/release
Oracle Solaris 10 8/11 s10x_u10wos_17b X86
Copyright (c) 1983, 2011, Oracle and/or its affiliates. All rights reserved.
Assembled 23 August 2011

If you kept your Solaris system patched during the transition period you were presumably fine, as hopefully they released client updates before the took down the old backend, but old systems like my new install get stuck.

I opened a case with Oracle Support who searched their internal database and gave me an irrelevant answer. The real fix is to run sconadm manually -- which entails manually creating a (simple) RegistrationProfile.properties file per the instructions in the sconadm(1M) manual page, embedding your Oracle Support username and password in the file because Oracle cannot do security, registering, and then immediately deleting the file.

Then smpatch installed a bunch of patches and choked on 147993-05 SunOS 5.10_x86: Pidgin libraries patch. The patch instructions say to install SUNWgnome-im-client-root from the installation DVD. Our X4500s don't actually have DVD drives -- why can't I just download this package from https://support.oracle.com/? It turns out SUNWgnome-im-client-root is not on sol-10-u10-ga2-x86-dvd.iso, but it is available on sol-10-u11-ga-x86-dvd.iso.

-bash-3.2# smpatch update -i 147993-05
Installing patches from /var/sadm/spool...
Failed to install patch 147993-05.

Utility used to install the update failed  with exit code 15.
Validating patches...Loading patches installed on the system...Done!Loading patches requested to install.Done!The following requested patches have packages not installed on the systemPackage SUNWgnome-im-client-root from directory SUNWgnome-im-client-root in patch 147993-05 is not installed on the system. Changes for package SUNWgnome-im-client-root will not be applied to the system.Checking patches that you specified for installation.Done!Approved patches will be installed in this order:147993-05 Checking installed patches...Executing prepatch script...No SUNWgnome-im-client-root package can be found. The SUNWgnome-im-client-rootpackage must be installed before applying this patch.Please see the patch README NOTE 1 for information on installing SUNWgnome-im-client-root.The prepatch script exited with return code 1.Patchadd is terminating.

Jun 26 09:54:09 dhcp-172-21-230-215 root: [ID 702911 user.alert]  => com.sun.patchpro.util.PatchBundleInstaller@1342a67 <=Failed to install patch 147993-05.
Failed to install patch 147993-05.
ALERT: Failed to install patch 147993-05.
/var/sadm/spool/patchpro_dnld_2013.06.26@09:54:03:EDT.txt has been moved to /var/sadm/spool/patchproSequester/patchpro_dnld_2013.06.26@09:54:03:EDT.txt

Then I just had to manually install one more patch which smpatch refused to, and I was current. Good Times(TM)!

Wednesday, December 23 2009

A painful lesson on Solaris patching & /var

A few weeks ago I ran updatemanager to patch a Sun X4500 'Thumper' running Solaris 10/x86. I got an error -- there was not enough space in /var for all the relevant patches.

When I started working on Solaris 8, SOP was to make /var 2gb. When I installed Solaris 10 here, I made /var 4gb, thinking this would be plenty. Unfortunately it was inadequate -- 4gb is not enough for a "Solaris 10 10/08 s10x_u6wos_07b X86" system which hasn't been patched recently. Yowch! I had 3gb in /var/sadm/pkg (patch residue) and the system wanted to fetch an additional 3gb of patches to install in /var/sadm/spool. So clearly /var needs to be larger than 6gb, but how large should it be?

I asked at least 4 different Sun reps how large /var ought to be, and none of them had an answer -- a few told me that, as far as they are able to determine, Sun does not have any guidance on how large /var should be. I asked some friends what they do, and was surprised to discover that they make /var 10gb. Solaris 10 needs 10gb for patch residue & temp space? That's insane! Not that they are wrong -- I eventually repartitioned the server to make /var 16gb (I do not want to go through this again), and /var/sadm currently occupies 5.7gb!

The Solaris installer has some built-in minima for partitioning, but the /var requirement is an absurd 204mb -- presumably for systems which will never be patched. I'll add it to the S10 installer buglist

Solaris 10 installer: /var

Under Solaris 8, when /var ran out of space, we either repartitioned the system (always a bit risky), or made /var/sadm a symlink to space on another partition with adequate free space -- which must, of course, be available in single-user mode for single-user patches to install successfully. In Solaris 10, though, patches refuse to install if /var/sadm/pkg is a symlink.

# smpatch update
141529-01 has been validated.
141559-01 has been validated.
141883-01 has been validated.
142435-01 has been validated.
141877-05 has been validated.
Installing patches from /var/sadm/spool...
Failed to install patch 141529-01.

Utility used to install the update failed  with exit code 5.
Checking installed patches...Patch 141529-01 failed to install due to a failure produced by pkgadd.See /var/sadm/patch/141529-01/log for detailsPatchadd is terminating.
Transition old-style patching.Directory is expected, found link - //var/sadm/pkg.Cannot open input /
Failed to install patch 141529-01.
Dec  7 09:51:00 thumper root: [ID 702911 user.alert]  => com.sun.patchpro.util.PatchBundleInstaller@11d2572 <=Failed to install patch 141529-01.
ALERT: Failed to install patch 141529-01.
Failed to install patch 121264-01.

Utility used to install the update failed  with exit code 5.
Checking installed patches...Patch 121264-01 failed to install due to a failure produced by pkgadd.See /var/sadm/patch/121264-01/log for detailsPatchadd is terminating.
Transition old-style patching.Directory is expected, found link - //var/sadm/pkg.Cannot open input /
Dec  7 09:51:01 thumper root: [ID 702911 user.alert]  => com.sun.patchpro.util.PatchBundleInstaller@11d2572 <=Failed to install patch 121264-01.

One rep recommended I work around the lack of space in /var by using smpatch set to move the pkg & spool directories to another filesystem with free space. Unfortunately, with these settings, both smpatch and updatemanager suddenly claimed there were no patches to install. I used:

smpatch set patchpro.backout.directory=/export/home/sadm/pkg
smpatch set patchpro.download.directory=/export/home/sadm/spool
smpatch set patchpro.baseline.directory=/export/home/sadm/spool

I have been asking about this bug in smpatch for a couple weeks, but haven't gotten any response.

Sun's other suggestion was to remove patch 'undo' or 'backout' information, but this forecloses future options, so I didn't pursue it. The details (which require an active contract/warranty to see) are at http://sunsolve.sun.com/search/document.do?assetkey=1-61-208057-1.

So, kiddies, if you are making /var its own filesystem in Solaris 10, it had better be at least 10gb. Unfortunately, Sun hasn't figured this out yet, and in fact doesn't appear to understand the problem.

Every Sun rep I speak to says "Just repartition the server." as if that was no big deal. I shouldn't have to repartition the server. Sun should be warning users about this. The Solaris installer should warn users if /var isn't large enough (the current minimum is a poor joke). If /var is too small, it should be easy to work around! Why is Solaris so hung up on where it reads patches from?

A Solaris developer actually wrote code to detect symlinks and refuse to run with the old workaround, so is it too much to expect the official workaround (smpatch set) to actually work?

Monday, November 23 2009

Serial console on Solaris

On our 'new' X4540 we use a serial console. This has a few advantages:

  • It's easy to access with just ssh -- no X11 or VNC tunneling required, unlike the Java-based graphical console.
  • It's much faster & more efficient (quite usable over dial-up, which shockingly is not yet dead).
  • Serial output gets logged to our console server, so we have a record of some diagnostics.

On SPARC systems at Rockefeller, we simply didn't connect a USB keyboard -- the console automatically went to the serial port. To login to a Sun from the machine room, we used a Linux or PC as a terminal.

Here, though, we also want access to (x86-based) Suns via the local KVM switch, which is slightly more complicated.

Supposedly, installing in serial mode enables serial console, but I couldn't confirm this. dtconfig controls the graphical login prompt, but shouldn't be needed.

Basically I just had to run eeprom console=ttya and reboot to move the console to serial port A. This didn't interfere with graphical login.

Kernel arguments in /boot/grub/menu.lst, if specified, can override eeprom, which is how 'failsafe' boot (from the GRUB menu) works.

Tuesday, May 12 2009

Bad Solaris 10 documentation: boot-adm recovery

Sun's How to Manually Update the Boot Archive on a RAID-1 (Mirror) Volume procedure says to find the root slices from the console messages -- which note md devices discovered during boot -- and embed these into /etc/vfstab to fix the boot environment and temporarily disable mirroring. Unfortunately, this guidance is incomplete and incorrect.

The first problem is that Sun's documentation instructs you to boot from 'the primary submirror.' But of course it might be corrupt (something scrambled the boot archive, after all). This week, one of our submirrors for / and both submirrors for /var showed errors under fsck -n. c4t0d0s0 (the default boot device) had problems which prevented the system making it all the way to normal multiuser state. c4t0d0s3 had moderate corruption, while c4t4d0s0 had minor corruption. Fixing /var/ was tricky, because Sun does not have any documentation which I could find on how to recover good data from one submirror onto one with bad data. They assume the only failure mode is a dead disk, and disk replacement is simple. The undocumented trick is that 'Submirror 0' is authoritative for resync operations.

root@jean:/# metastat d3|head -9
d3: Mirror
    Submirror 0: d23
      State: Okay         
    Submirror 1: d13
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 8385930 blocks (4.0 GB)

Problem #2: fixing / is harder, because the procedure for booting from a different disk is totally obscure and non-standard. Basically, you must edit /boot/solaris/bootenv.rc, which overrides /boot/grub/menu.lst. I don't know why Sun apparently created a brand-new findroot command for grub, but doesn't actually run Solaris from the disk it specifies. At a guess, it stems from Sun's dissatisfaction with the way Linux & GRUB deal with the horrible multi-stage boot procedure required on x86 PCs. bootadm(1M) says it updates the 'boot archive', but not what a boot archive actuall is, and that it also updates the GRUB configuration, but our menu.lst hasn't actually been updated since I installed the system.

root@jean:/# grep bootpath /boot/solaris/bootenv.rc 
setprop bootpath /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@0,0:a
root@jean:/# grep -v \# /boot/grub/menu.lst 
default 0
timeout 10
splashimage /boot/grub/splash.xpm.gz
title Solaris 10 10/08 s10x_u6wos_07b X86
findroot (rootfs0,0,a)
kernel /platform/i86pc/multiboot
module /platform/i86pc/boot_archive
title Solaris failsafe
findroot (rootfs0,0,a)
kernel /boot/multiboot kernel/unix -s
module /boot/x86.miniroot-safe

If you find yourself in single-user mode with a root device like /pci@1,0/pci1022,7458@4/pci11ab,11ab@1/disk@0,0:a, rather something more normal like /dev/md/dsk/d0 or /dev/dsk/c4t0d0s0, it probably means Solaris is running from a device which it cannot correlate back to a valid boot device, although you can do this manually by examining the slice symlinks in /dev/dsk/:

root@jean:/# ls -l /dev/dsk/c0t4d0s0 
lrwxrwxrwx   1 root     root          62 Dec 30 14:13 /dev/dsk/c0t4d0s0 -> ../..

The final major problem is that disk device paths are not stable on the X4500. Sun's instructions are to find the disk path to the root submirror from console messages (in my case, they referred to /dev/dsk/c3t0d0s0 & /dev/dsk/c3t0d0s0) and use one of these in /etc/vfstab rather than the metadevice, but when I actually booted into Solaris, those slices didn't exist, because the bootable disks were at c4 rather than c3. I had to boot back into GRUB's Failsafe mode and correct the device for / in vfstab. Sun's documentation is fine for machines with consistent disk paths, but wrong for the X4500 (and presumably the X4540 as well).

Additionally, I worried that bootadm might read c4 from the vfstab file and write to c3 (a ZFS pool disk it must not modify!), or something similarly screwy, but this turned out to be a non-issue once I'd sorted out the rest.

Tip: bootadm apparently normally caches its changes and writes them to disk when rebooting; to force bootadm to write changes immediately, add the undocumented -f flag, e.g.: bootadm update-archive -fR /a

bootadm(1M), of course, doesn't provide any useful detail on what it does.

Solaris 10: SUNWlwact errors

Today I noticed a cascade of tictimed errors on the console of a Solaris 10/x86 server:

root@jean:/export/home/pepper# grep tictimed /var/adm/messages | grep May\ 12 | head
May 12 11:24:02 jean tictimed[1111]: [ID 921880 user.error] [tictimed]: XML file corruption detected!
May 12 11:29:10 jean tictimed[1111]: [ID 921880 user.error] [tictimed]: XML file corruption detected!
May 12 11:29:11 jean tictimed[1111]: [ID 423602 user.error] [tictimed]: stopping on SIGTERM or SIGPWR.
May 12 11:33:01 jean tictimed[1143]: [ID 921880 user.error] [tictimed]: XML file corruption detected!
May 12 11:36:38 jean tictimed[1143]: [ID 423602 user.error] [tictimed]: stopping on SIGTERM or SIGPWR.
May 12 11:40:23 jean tictimed[1143]: [ID 921880 user.error] [tictimed]: XML file corruption detected!
May 12 11:47:34 jean tictimed[1143]: [ID 921880 user.error] [tictimed]: XML file corruption detected!
May 12 11:53:44 jean tictimed[1143]: [ID 921880 user.error] [tictimed]: XML file corruption detected!
May 12 12:00:55 jean tictimed[1143]: [ID 921880 user.error] [tictimed]: XML file corruption detected!
May 12 12:07:05 jean tictimed[1143]: [ID 921880 user.error] [tictimed]: XML file corruption detected!

I eventually discovered the errors were caused by v3.2 of the Sun Services Tools Bundle, specifically Light Weight Availability Collection Tool v3.0, and removing SUNWlwact. I upgraded to STB v5.0, but the cascading XML errors returned, so Sun hasn't fixed the issue.

Tuesday, March 17 2009

More Sun Grief

Update 2009/10: I ran into this again installing S10U8 (released last week). Sun knows that their graphical installer is broken. It includes fdisk-based code to partition the disk, but the installer has a bad sanity check which rejects the installer's partitioning, preventing installation from completing. Their workaround is to run the text-mode installer, which lacks the bad check. Alternatively, you can manually partition the disk with fdisk before installing, because if the fdisk partitioning is not modified by the installer, the graphical installer skips its bogus sanity check.

I'm reinstalling Solaris 10 on a Thumper which came (last year) with a 2-year-old version of Solaris 10. Wanting to protect myself, I removed HDD0 (primary boot), and replaced it with HDD47 (part of the ZFS pool, which I will have to recreate because ZFS does not allow removing or changing RAID levels in a pool).

Unfortunately, when I boot the system this way, it goes through the BIOS screens and just sits there -- apparently it cannot get past the lack of a boot block on HDD0. So much for redundancy or failover! Perhaps it would have worked if I'd left bay 0 empty, but that's an unlikely scenario.

I mounted the S10U6x86 ISO from a Windows VM under Parallels (sadly, Sun's Java Remote Console app only supports virtual media on Windows -- not Mac OS X, although the console control works fine on the Mac), and ran through the installer -- which takes about an hour, because it's so amazingly slow. I booted the machine 2 blocks away, walked back to my desk, and was in time to see it probing Ethernet devices...

Anyway, I accepted the default fdisk partitioning (everything in partition 1), and set up a Solaris VToC within partition 1 (as the Solaris installer defaults, and as our other Thumper is configured), but the installer failed with a confusing error:

ERROR: The '/' slice extends beyond HBA cylinder 1023
WARNING: Change the system's BIOS default boot device for hands-off rebooting.

Note that I copied the Solaris VToC from our other Thumper which is running this way right now. The fdisk configuration is simply the default. Thinking I'd done something wrong, I rebooted (there's really no alternative at that point) and reran the installer -- another hour gone. At the end of the whole process, I got the error again. Fortunately, I found a blog post that mentions that error. Apparently the Solaris installer cannot actually create the fdisk partition which it suggests!! I put the disks back as they were, ran the installer a third time, and Solaris is installing right now -- so it can install into that fdisk layout, but it cannot partition the disk that way. FAIL.

See also: * Sun * Solaris

Wednesday, January 28 2009

Solaris tip: 1gb or more

I installed Solaris 10U6 (10/08) inside Parallels 4 (I had been using VirtualBox, but the MSKCC Windows build image is a Parallels virtual disk file, not VirtualBox). Interestingly, the installer becomes more of a 'staller with Parallels' default 512mb allocation -- it just sits there showing a small console window (which by default is offscreen on my system) for longer than I care to wait. With the allocation raised to 1gb, the installer is snappy and works fine.

Monday, May 21 2007

Solaris 10 / JDS Initial Login Bug

I spent a few minutes scratching my head over this one today, so here is a donation to the Googleverse.

When you first log into a Solaris 10 system via a graphical terminal, it prompts you to select the Java Desktop System or Common Desktop Environment. It stores your preference in your home directory for future reference.

The bug is that if this fails for some reason, no error is presented, but instead the login screen comes back up.

I had created ~pepper as root and not set ownership, but ssh was working fine. When I tried logging into my Solaris 10 VM, I thought it was curdled because login could not complete. Now I think creating ~/.dt/sessions/lastsession is a hard requirement, which it should not be. There are lots of reasons one's home directory might not be writable. None of these need to prevent login, and the lack of an error message or explanation aggravates the problem.

Thursday, March 15 2007

Locking ssh Access to Solaris Accounts

Today's tip is brought to you by the letters SUNW.

As we migrate from encrypted passwords to public keys for ssh access, one of the problems we have is how to make sure an account is disabled across many machines (we're not up to authenticating against a central LDAP or AD directory yet).

Given the variations in home directory location, eccentricities of escaping commands over ssh connections (dsh makes it even worse), difficulty in recognizing public key strings, and level of paranoia appropriate when closing an account, it's very difficult to be certain a person's access has been entirely cut off (especially if they had legitimate access to elevate privilege).

There are simply a lot of files in a lot of different places to check and edit, and no way to know you've found them all without spending some quality time with each host, when we want something quick and complete.

On Solaris (but not on RHEL4), if the password field in /etc/shadow is set to *LK*, the account is not accessible via OpenSSH and public keys, although it is via local su. Just to keep things interesting, OpenSSH prompts for a password, even though it knows the account is locked! If the password field is '*', '!', or a real encrypted password prefixed with '!' (a standard way of temporarily disabling a password), ssh access is permitted with a valid public key.

That doesn't guarantee the user didn't bury their public key in someone else's authorized_keys file, but doing this to other individuals would be clearly malicious and actionable, and doing it to the system accounts is more reasonable to check; better is to have a policy of simply replacing their authorized_keys files from known-good copies whenever there's a change.

Update 2007/05/11: In Mac OS X Server, you can uncheck "access account" in Workgroup Manager, but still log in with a public key.

Under OpenSSH on Solaris, authorized_keys (and the .ssh directory) can be owned by root, handy for centralized account management, but under RHEL4 they must be owned by the user.