Extra Pepperoni

To content | To menu | To search

Tag - isilon

Entries feed - Comments feed

Friday, April 27 2012

Isilon Notes, 2012 Edition

General

  • Isilon provides templates for Nagios, which you should use. Unfortunately Nagios cannot distinguish serious problems (failed disk) from trivia (quota violations & bogus warnings).

Hardware

  • Isilon's current units are either 2U (12-bay 200 series) or 4U (36-bay 400 series).
  • The new NL400-108 nodes are similar enough to the older 108NL nodes that they pool together. The 108NLs are dual-socket 16gb nodes based on the 72000x chassis, which is an upgrade from the 36000x chassis. This makes them much faster than the older single-core 36NLs & 72NLs.
  • As of OneFS v6.0(?), Isilon nodes no longer use the VGA keyboard & mouse console. Instead they use the serial port exclusively as console, although the VGA port does display some booting messages. In 2011, a USB connection to a KVM made a node reboot until we disconnected USB.
  • Every node is assigned a device ID when it is joined to the cluster. All alerts are tagged with the device ID of the node reporting the event. Device IDs are never reused, so if a chassis fails and is swapped out, the replacement will get a new device ID, but the old node's hostname. If this happens to you, you may want to use isi config (with advice from Isilon Support) to change the hostname to match the device ID. With a large or dynamic cluster it might just be better to ignore device IDs and let the node names run in a contiguous sequence.

Jobs

  • Isilon's job engine is problematic. Only one job runs at a time, and jobs are not efficiently parallelized.
  • MultiScan combines Collect and AutoBalance jobs.
  • During the Mark phase of Collect (or MultiScan), with snapshots enabled, delete is slow and can cause NFS timeouts.
  • It is fine for non-disruptive jobs to run in the background for long periods, and it is understandable for high-priority jobs to briefly impact the cluster, but there are too many jobs (SmartPools, AutoBalance, Collect, MultiScan) which have a substantial impact on performance for long periods.
  • There are enough long-running jobs that it's easy to get into a cycle where as soon as one finishes another resumes, meaning a job is always running and the cluster never actually catches up. It took months for us to get this all sorted out so the jobs run safely in the background and don't interfere badly.
  • When a drive does not respond quickly, Isilon logs a 'stall' in /var/log/messages. Stalls trigger "group changes", which can trigger jobs. Group changes also disrupt jobs including MultiScan, AutoBalance, & MediaScan from completing. The workaround is to tune /etc/mcp/override/sysctl.conf per Isilon Support.
  • The default job priorities were dysfunctional for us. We had to alter priorites for AutoBalance, SnapshotDelete, SmartPools, and QuotaScan, and frequency for at least SmartPools. This improved somewhat in v6.5.
  • To tweak job priority, do not redefine an existing priority. This caused problems as the change cascaded to other jobs. Define a new priority instead.

Batch Jobs

  • /etc/mcp/templates/crontab is a cluster-wide crontab; field #6 is username.

Support & Diagnostics

  • By default, Isilon's main diagnostic command, isi_gather_info, builds a tarball of configuration and logs and uploads it to EMC. This took over 15 minutes on our clusters. To make this quicker, change "Gather mode" to Incremental under Help:Diagnostics:Settings.
  • Isilon does not actually maintain an HTTP upload server, so uncheck HTTP upload to avoid a wasted timeout.
  • When a node crashes it logs a core in /var/crash, which can fill up. Upload the log with 'isi_gather_info -s "isi_hw_status -i" -f /var/crash' on the affected node before deleting it.

Network & DNS

  • Isilon is "not compatible" with firewalls, so client firewalls must be configured to allow all TCP & UDP ports from Isilon nodes & pools back to NFS clients (and currently SNMP consoles).
  • Specifically, there is a bug where SNMP responses come from the node's primary IP. iptables on our Nagios console dropped responses which came from a different IP than Nagios queried.
  • To use SmartConnect you must delegate the Isilon domain names to the SmartConnect resolver on the cluster. We were unable to use DNS forwarding in BIND with this delegation active.

NFS

  • By default Isilon exports a shared large /ifs filesystem from all nodes. They suggest mounting with /etc/fstab options rw,nfsvers=3,rsize=131072,wsize=524288.

CIFS

  • Migrating an IP to another node disconnects CIFS clients of that IP.
  • CIFS clients should use their own static SmartConnect pools rather than connecting to dynamic SmartConnect pools (for NFS clients).

Load Balancing

  • Rather than real-time load balancing, Isilon handles load-balancing through its built-in DNS server (SmartConnect: Basic or Advanced). Because this happens at connection time, the cluster cannot manage load between clients which are already connected, except via "isi networks --sc-rebalance-all", which shuffles server-side IPs in to even out load. Unfortunately OneFS (as of v6.5) does not track utilization statistics for network connections, so it cannot intelligently determine how much traffic each IP represents. This means only Round Robin and Connection Count are suitable for "IP failover policy" (rebalancing) -- "Network Throughput" & "CPU Usage" don't work.
  • High availability is handled by reassigning IPs to different nodes in case of failure. For NFS this is seamless, but for CIFS this causes client disconnection. As a result CIFS clients must connect to static pools, and "isi networks --sc-rebalance-all" should never be run on clusters with CIFS clients (there is apparently a corresponding command to rebalance a single pool, suitable for manual use on each dynamic pool).

Quotas

  • Some of the advantage of the single filesystem is lost because it is impossible to move files from one quota under another. This forces us to copy (rsync) and then delete as if each quota were its own mount point.
  • For user quota reporting, each user should have an account (perhaps via LDAP or AD) on the cluster.
  • For user quota notifications, each user must have an email mapping (we created aliases to route machine account quota notifications to the right users).

Bugs

  • The user Enable checkbox disables all login access (but preserves UID mappings for quota reports). Unchecking it blocks both ssh and CIFS/SMB access and clears the user password.
  • You cannot create a user with a home directory that exists (even with --force). Workaround: move the directory aside before creating the user, or create with a bogus homedirectory (which can only be used once) and use "isi auth local user modify" to fix after creation.
  • Don't use more than 8 SyncIQ policies (I don't know if this bug has been fixed).
  • Gateways and priorities are not clear, but if there are 2 gateways with the same priority the cluster can get confused and misbehave. The primary gateway should have the lowest priority number (1).
  • We heard one report that advisory quotas on a SyncIQ target cluster caused SyncIQ errors.
  • If you configure two gateways with the same priority, the cluster can get confused and misbehave.
  • In at least one case, advisory quotas on a SyncIQ target disrupted SyncIQ.
  • The Virtual Hot Spare feature appears to reserve twice as many drives as are specified in the UI, and they do not work as described.

Support

  • Support is very slow. SLAs apparently only apply to parts delivery -- our 4-hour service does not prevent Isilon from saying they will answer questions in a few days.
  • Support is constantly backlogged. Callback times are rarely made and cases are often not followed up unless we call in to prod Support.
  • My process for opening a case looks like this:
    1. Run uname -a; isi_hw_status -i; isi_gather_info.
    2. Paste output from first 2 commands and gather filename into email message.
    3. Describe problem and send email to support@.
    4. A while later we get a confirmation email with a case number.
    5. A day or two later I get tired of waiting and phone Isilon support.
    6. I punch in my case number from the acknowledgement.
    7. I get a phone rep and repeat the case number.
    8. The phone rep transfers me to a level 1 support rep, who as a rule cannot answer my question.
    9. The L1 rep tries to reach an L2 rep to address my question. They are often unable to reach anyone(!!!), and promise a callback as soon as they find an L2 rep.
    10. As a rule, I do not receive a callback.
    11. Eventually I give up on waiting and call in again.
    12. I describe my problem a third time.
    13. The L1 tech goes off to find an answer.
    14. I may have to call back in and prod L1 multiple times (there is no way for me to reach L2 directly).
    15. Eventually I get an answer. This process often takes over a week.
  • Support provides misinformation too often. Most often this is simple ignorance or confusion, but it appears to be EMC policy to deny that any problem affects multiple sites.

Commands

For manual pages, use an underscore (e.g., man isi_statistics). The command line is much more complete than the web interface but not completely documented. Isilon uses zsh with customized tab completion. When opening a new case include output from "uname -a" & "isi_hw_status -i", and run isi_gather_info.

  • isi_for_array -s: Execute a command on all nodes in in order.
  • isi_hw_status -i: Node model & serial number -- include this with every new case.
  • isi status: Node & job status. -n# for particular node, -q to skip job status, -d for SmartPool utilization; we use isi status -qd more often.
  • isi statistics pstat --top & isi statistics protocol --protocol=nfs --nodes=all --top --long --orderby=Ops
  • isi networks
  • isi alerts list -A -w: Review all alerts.
  • isi alerts cancel all: Clear existing alerts, including the throttled critical errors message. Better than the '''Quiet''' command, which can suppress future errors as well.
  • isi networks --sc-rebalance-all: Redistribute SmartConnect IPs to rebalance load. Not suitable for clusters with CIFS shares.
  • du -A: Size, excluding protection overhead, from an Isilon node.
  • du --apparent-size: Size, excluding protection overhead, from a Linux client.
  • isi devices: List disks with serial numbers.
  • isi snapshot list --schedule
  • isi snapshot usage | grep -v '0.0'
  • isi quota list --show-with-no-overhead | isi quota list --show-with-overhead | isi quota list --recurse-path=/ifs/nl --directory
  • isi quota modify --directory --path=/ifs/nl --reset-notify-state
  • isi job pause MultiScan / isi job resume MultiScan
  • isi job config --path jobs.types.filescan.enabled=False: Disable MultiScan.
  • isi_change_list (unsupported): List changes between snapshots.
  • sysctl -n hw.physmem: Check RAM.
  • isi device -a smartfail -d 1:bay6 / isi devices -a stopfail -d 1:bay6 (stopfail is not normally appropriate)
  • isi devices -a add -d 12:10: Use new disk in node 12, bay 10.
  • date; i=0; while [ $i -lt 36 ]; do isi statistics query --nodes=1-4 --stats=node.disk.xfers.rate.$i; i=$[$i+1]; done # Report disk IOPS(?) for all disks in nodes 1-4 -- 85-120 is apparently normal for SATA drives.
  • isi networks modify pool --name *$NETWORK*:*$POOL* --sc-suspend-node *$NODE*: Prevent $POOL from offering $NODE for new connections, without interfering with active connections. --sc-resume-node to undo.
  • isi_lcd_d restart: Reset LEDs.
  • isi smb config global modify --access-based-share-enum=true: Restrict SMB shares to authorized users (global version); isi smb config global list | grep access-based: verify (KB #2837)
  • ifa isi devices | grep -v HEALTHY: Find problem drives.
  • isi quota create --path=$PATH --directory --snaps=yes --include-overhead --accounting
  • cd /ifs; touch LINTEST; isi get -DD LINTEST | grep LIN; rm LINTEST: Find the current maximum LIN.

Thursday, August 18 2011

Cluster job distribution & general Isilon status

Users of our Isilon clusters need basic status information, so every 10 minutes our clusters run status.sh per /etc/mcp/templates/crontab. This provides a variety of useful information to users with access to the Isilon shared filesystem, and no need to provide shell access to the cluster nodes or remember the command syntax.

We now need to run some large/slow jobs, so I wanted a list of nodes in least-busy order. Obviously Isilon tracks this so SmartConnect can send connections to the least loaded node when using the "CPU Usage" connection policy, but it's not available to user scripts. The pipeline to provide a list of nodes sorted by lowest utilization to highest is applicable to all clusters, though -- just swap in the appropriate local cluster-wide execution command for isi_for_array.

status.sh

#!/bin/sh
# Record basic cluster health information

PREFIX=/ifs/x/common/cluster/status

isi status                   > $PREFIX/status.log
isi status -q -d             > $PREFIX/pool.log
isi job status -v            > $PREFIX/job.log
isi quota list               > $PREFIX/quota.log
isi quota list|grep -v :|grep -v default- > $PREFIX/quota-short.log
isi snapshot list -l         > $PREFIX/snapshot.log
isi snapshot usage | tail -1 > $PREFIX/snapshot-total.log
isi sync policy report | tail> $PREFIX/synciq.log
isi_for_array -s uptime      > $PREFIX/uptime.log
isi_for_array uptime | tr -d :, | awk '{print $12, $1}' | sort -n | awk '{print $2}' > $PREFIX/ordered-nodes.txt

Wednesday, July 27 2011

OpenSSH is smart about cluster hostkeys

Normally, the first time you ssh to a new server, OpenSSH asks for permission to store the server's hostname (and IP) along with its unique ssh hostkey in ~/.ssh/known_hosts. Then if the hostkey ever changes, either because the machine was rebuilt or because you're connected to a different machine (as would be the case if someone intercepted your connection, for instance...), OpenSSH complains loudly that something is hinky:

pepper@teriyaki:~$ ssh cluster uname -a
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@       WARNING: POSSIBLE DNS SPOOFING DETECTED!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The DSA host key for cluster has changed,
and the key for the corresponding IP address 10.0.10.124
is unknown. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the DSA host key has just been changed.
The fingerprint for the DSA key sent by the remote host is
f7:b0:d4:11:2c:6c:ec:be:96:f0:88:71:d9:26:20:0c.
Please contact your system administrator.
Add correct host key in /Users/pepper/.ssh/known_hosts to get rid of this message.
Offending key in /Users/pepper/.ssh/known_hosts:81
DSA host key for cluster has changed and you have requested strict checking.
Host key verification failed.

This is a nuisance with high-availability (HA) clusters, where multiple nodes may share a single hostname and IP. The first time you connect to a shared IP everything works and you store the hostkey for whichever node accepted your connection. Then it may continue to work for a long time, if you keep connecting to the same node. But when you get a different node at that IP, OpenSSH detects it's a different machine (hostkey), and either the connection fails (if it's non-interactive) or you get the scary warning (if it's interactive). To avoid this, the convention is to ssh directly into individual nodes for administration.

But some of our sequencers use rsync-over-ssh to export data to our Isilon storage clusters, so we had a problem. If we configured them to connect to the VIP (like NFS clients), things would break when they connected to different nodes. But if we configured them to connect to individual nodes, we'd lose failover -- if any Isilon node went down, all of 'its' clients would stop transferring data until it came back up.

I briefly considered synchronizing the ssh hostkeys between nodes, to avoid the hostkey errors, but this is poor security -- if each node has the same hostkey, it's easy for any node to eavesdrop on connections to all its peers with the same hostkey, and changing keys is disruptive.

Fortunately the OpenSSH developers are way ahead of me. If the hostkey is already on file as valid for a known host -- even if there are other conflicting keys on file for the same host -- OpenSSH accepts it.

To set this up, just ssh to each node, then append the cluster hostname and IPs to their entries in ~/.ssh/known_hosts or /etc/ssh/ssh_known_hosts.

cluster-1,10.0.10.101,cluster,10.0.10.121,10.0.10.122,10.0.10.123,10.0.10.124 ssh-dss AAAAB3NzaC1kc3MAAACBA...
cluster-2,10.0.10.102,cluster,10.0.10.121,10.0.10.122,10.0.10.123,10.0.10.124 ssh-dss AAAAB3NzaC1kc3MAAACBA...
cluster-3,10.0.10.103,cluster,10.0.10.121,10.0.10.122,10.0.10.123,10.0.10.124 ssh-dss AAAAB3NzaC1kc3MAAACBA...
cluster-4,10.0.10.104,cluster,10.0.10.121,10.0.10.122,10.0.10.123,10.0.10.124 ssh-dss AAAAB3NzaC1kc3MAAACBA...

Monday, January 31 2011

Isilon Cluster

Our old bulk storage is Apple Xserve RAIDs. They are discontinued and service contracts are expiring, so we have been evaluating small-to-medium storage options for some time. Our more modern stuff is a mix of Solaris 10 (ZFS) on Sun X4500/X4540 chassis (48 * 1tb SATA; discontinued), and Nexsan SATABeasts (42 SATA drives, either 1tb or 2tb) attached to Linux hosts, with ext3 filesystems. We are not buying any more Sun hardware or switching to FreeBSD for ZFS, and ext4 does not yet support filesystems over 16tb. Breaking up a nice large array into a bunch of 16tb filesystems is annoying, but moving (large) directories between filesystems is really irritating.

We eventually decided on a 4-node cluster of Isilon IQ 32000X-SSD nodes. Each ISI36 chassis is a 4U (7" tall) server with 24 3.5" drive bays on the front and 12 on the back. In our 32000X-SSD models, bays #1-4 are filled with SSDs (apparently 100gb each, currently usable only for metadata) and the other 32 bays hold 1tb SATA drives, thus the name. Each of our nodes has 2 GE ports on the motherboard and a dual-port 10GE card.

Isilon's OneFS operating system is based on FreeBSD, with their proprietary filesystem and extra bits added. Their OneFS cluster file system is cache coherent: inter-node lookups are handled over an InfiniBand (DDR?) backend, so any node can serve any request; most RAM on the nodes is used as cache. Rather than traditional RAID 5 or 6, the Isilon cluster stripes data 'vertically' across nodes, so it can continue to operate despite loss of an entire node. This means an Isilon cluster must consist of at least 3 matching nodes, just like a RAID5 must consist of at least 3 disks. Unfortunately, this increases the initial purchase cost considerably, but cost per terabyte decreases as node count grows, and the incremental system administration burden per node is much better than linear.

Routine administration is managed through the web interface, although esoteric options require the command line. Isilon put real work into the Tab completion dictionaries. This is quite helpful when exploring the command line interface, but the (zsh based) completions are not complete -- neither are the --help messages nor the manual pages, unfortunately.

There are many good things about Isilon.

Pros

  • Single filesystem & namespace. This sounds minor but is essential for coping with large data sets. Folders can be arbitrarily large and all capacity is available to all users/shares, subject to quotas.
  • Cost per terabyte decreases with node count, as parity data becomes a smaller proportion of total disk capacity.
  • Aggregate performance increases with node count -- total cache increases, and number of clients per server is reduced.
  • Administration burden is fairly flat with cluster growth.
  • The FlexProtect system (based on classic RAID striping-with-parity and mirroring, but between nodes rather than within nodes/shelves) is flexible and protects against whole-node failure.
  • NFS and CIFS servers are included in the base price.
  • Isilon's web UI is reasonably simple, but exposes significant power.
  • The command line environment is quite capable, and Tab completion improves discoverability.
  • Quotas are well designed, and flexible enough to use without too much handholding for exceptions.
  • Snapshots are straightforward and very useful. They are comparable to ZFS snapshots -- much better than Linux LVM snapshots (ext3 does not support snapshots directly).
  • The nodes include NVRAM and battery backup for safe high-speed writes.
  • Nodes are robust under load. Performance degrades predictably as load climbs, and we don't have to worry about pushing so hard the cluster falls over.
  • Isilon generally handles multiple network segments with aplomb.
  • The storage nodes provide complete services -- they do not require Linux servers to front-end services, or additional high availability support.
  • The disks are hot swap, and an entire chassis can be removed for service without disrupting cluster services.
  • Because the front end is gigabit Ethernet (or 10GE), an Isilon storage cluster can serve an arbitrarily large number of clients without expensive fibre channel HBAs and switches.

And, of course, some things are less good.

Cons

  • Initial/minimum investment is high: 3 matching nodes, 2 InfiniBand switches, and licenses.
  • Several additional licenses are required for full functionality.
  • Isilon is not perfectionistic about the documentation -- in fact, the docs are incomplete.
  • Isilon is not as invested in the supporting command-line environment as I had hoped.
  • The round-robin load balancing works by delegating a subdomain to the Isilon cluster. Organizationally, this might be complicated.
  • CIFS integration requires AD access for accounts. This might also be logistically difficult.
  • Usable capacity is unpredictable and varies based on data composition.
  • There are always two different disk utilization numbers: actual data size, and including protection. This is confusing compared to classic RAID, where users only see unique data size.
  • There is no good way for users to identify which node they're connected to. This is possible but awkward for administrators to determine, but it is generally not worth going beyond the basic web charts.
  • Support can be frustrating.
    • We often get responses from many people on the same case, and rehashing the background repeatedly wastes time.
    • Some reps are very good; but some are poor, with wrong answers, pointless instructions, and a disappointing lack of knowledge about the technology and products.
    • We are frequently asked for system name & serial number, and asked to upload a status report with isi_gather_info -- even when this is all already on file.
    • Minor events trigger email asking if we need help, even when we're in the middle of scheduled testing.
  • The cluster is built of off-the-shelf parts, and the integration is not always complete. For instance, we are not alerted of problems with an InfiniBand switch, because things like a faulted PSU are not visible to the nodes and not logged.
  • Many commands truncate output to 80 columns -- even when the terminal is wider. To see full output add -w.
  • When the system is fully up, the VGA console does not show a prompt. This makes it harder to determine whether a node has booted successfully.
  • There is only one bit of administrative access control: when users log in, they either have access to the full web interface and command-line tools, or they don't. There is no read-only or 'operator' mode.
  • Running out of space (or even low on space) is apparently dangerous.
  • One suggestion was to reserve one node's worth of disks as free space, so the whole cluster can run with a dead node. In a 4-node configuration, reserving 25% of raw space for robustness (in addition to 25% for parity) would mean 50% utilization at best, which is generally not feasible. In fairness, it is rare for a storage array to even attempt to work around a whole shelf failure, but most (non-Isilon) storage shelves are simple enclosures with fewer and simpler failure modes...
  • SmartConnect is implemented as a DNS server, but it's incomplete -- it only responds to A record requests, which causes errors when programs like host attempt other queries.
  • The front panels are finicky. The controls are counterintuitive, the LED system is prone to bizarre (software) failure modes, and removing the front panel to access the disks raises an obscure but scary alert.

Notes

  • On Isilon nodes, use du -Sl to get size without protection overhead. On Linux clients, use du --apparent-size.
  • Client load balancing is normally managed via DNS round robin, with the round robin addresses automatically redistributed in case of a node failure. This is less granular and balanced than you'd get from a full load balancer, but much simpler.
  • EMC has bought Isilon. I'm not sure what the impact will be, but I am not confident this will be a good thing over the long term.
  • In BIND (named), subdomain delegation is incompatible with forwarding. Workaround: Add forwarders {}; to zone containing Isilon NS record.

Future

  • All that said, we are getting more Isilon storage -- it seems like the best fit for our requirements.