General

  • Isilon provides templates for Nagios, which you should use. Unfortunately Nagios cannot distinguish serious problems (failed disk) from trivia (quota violations & bogus warnings).

Hardware

  • Isilon's current units are either 2U (12-bay 200 series) or 4U (36-bay 400 series).
  • The new NL400-108 nodes are similar enough to the older 108NL nodes that they pool together. The 108NLs are dual-socket 16gb nodes based on the 72000x chassis, which is an upgrade from the 36000x chassis. This makes them much faster than the older single-core 36NLs & 72NLs.
  • As of OneFS v6.0(?), Isilon nodes no longer use the VGA keyboard & mouse console. Instead they use the serial port exclusively as console, although the VGA port does display some booting messages. In 2011, a USB connection to a KVM made a node reboot until we disconnected USB.
  • Every node is assigned a device ID when it is joined to the cluster. All alerts are tagged with the device ID of the node reporting the event. Device IDs are never reused, so if a chassis fails and is swapped out, the replacement will get a new device ID, but the old node's hostname. If this happens to you, you may want to use isi config (with advice from Isilon Support) to change the hostname to match the device ID. With a large or dynamic cluster it might just be better to ignore device IDs and let the node names run in a contiguous sequence.

Jobs

  • Isilon's job engine is problematic. Only one job runs at a time, and jobs are not efficiently parallelized.
  • MultiScan combines Collect and AutoBalance jobs.
  • During the Mark phase of Collect (or MultiScan), with snapshots enabled, delete is slow and can cause NFS timeouts.
  • It is fine for non-disruptive jobs to run in the background for long periods, and it is understandable for high-priority jobs to briefly impact the cluster, but there are too many jobs (SmartPools, AutoBalance, Collect, MultiScan) which have a substantial impact on performance for long periods.
  • There are enough long-running jobs that it's easy to get into a cycle where as soon as one finishes another resumes, meaning a job is always running and the cluster never actually catches up. It took months for us to get this all sorted out so the jobs run safely in the background and don't interfere badly.
  • When a drive does not respond quickly, Isilon logs a 'stall' in /var/log/messages. Stalls trigger "group changes", which can trigger jobs. Group changes also disrupt jobs including MultiScan, AutoBalance, & MediaScan from completing. The workaround is to tune /etc/mcp/override/sysctl.conf per Isilon Support.
  • The default job priorities were dysfunctional for us. We had to alter priorites for AutoBalance, SnapshotDelete, SmartPools, and QuotaScan, and frequency for at least SmartPools. This improved somewhat in v6.5.
  • To tweak job priority, do not redefine an existing priority. This caused problems as the change cascaded to other jobs. Define a new priority instead.

Batch Jobs

  • /etc/mcp/templates/crontab is a cluster-wide crontab; field #6 is username.

Support & Diagnostics

  • By default, Isilon's main diagnostic command, isi_gather_info, builds a tarball of configuration and logs and uploads it to EMC. This took over 15 minutes on our clusters. To make this quicker, change "Gather mode" to Incremental under Help:Diagnostics:Settings.
  • Isilon does not actually maintain an HTTP upload server, so uncheck HTTP upload to avoid a wasted timeout.
  • When a node crashes it logs a core in /var/crash, which can fill up. Upload the log with 'isi_gather_info -s "isi_hw_status -i" -f /var/crash' on the affected node before deleting it.

Network & DNS

  • Isilon is "not compatible" with firewalls, so client firewalls must be configured to allow all TCP & UDP ports from Isilon nodes & pools back to NFS clients (and currently SNMP consoles).
  • Specifically, there is a bug where SNMP responses come from the node's primary IP. iptables on our Nagios console dropped responses which came from a different IP than Nagios queried.
  • To use SmartConnect you must delegate the Isilon domain names to the SmartConnect resolver on the cluster. We were unable to use DNS forwarding in BIND with this delegation active.

NFS

  • By default Isilon exports a shared large /ifs filesystem from all nodes. They suggest mounting with /etc/fstab options rw,nfsvers=3,rsize=131072,wsize=524288.

CIFS

  • Migrating an IP to another node disconnects CIFS clients of that IP.
  • CIFS clients should use their own static SmartConnect pools rather than connecting to dynamic SmartConnect pools (for NFS clients).

Load Balancing

  • Rather than real-time load balancing, Isilon handles load-balancing through its built-in DNS server (SmartConnect: Basic or Advanced). Because this happens at connection time, the cluster cannot manage load between clients which are already connected, except via "isi networks --sc-rebalance-all", which shuffles server-side IPs in to even out load. Unfortunately OneFS (as of v6.5) does not track utilization statistics for network connections, so it cannot intelligently determine how much traffic each IP represents. This means only Round Robin and Connection Count are suitable for "IP failover policy" (rebalancing) -- "Network Throughput" & "CPU Usage" don't work.
  • High availability is handled by reassigning IPs to different nodes in case of failure. For NFS this is seamless, but for CIFS this causes client disconnection. As a result CIFS clients must connect to static pools, and "isi networks --sc-rebalance-all" should never be run on clusters with CIFS clients (there is apparently a corresponding command to rebalance a single pool, suitable for manual use on each dynamic pool).

Quotas

  • Some of the advantage of the single filesystem is lost because it is impossible to move files from one quota under another. This forces us to copy (rsync) and then delete as if each quota were its own mount point.
  • For user quota reporting, each user should have an account (perhaps via LDAP or AD) on the cluster.
  • For user quota notifications, each user must have an email mapping (we created aliases to route machine account quota notifications to the right users).

Bugs

  • The user Enable checkbox disables all login access (but preserves UID mappings for quota reports). Unchecking it blocks both ssh and CIFS/SMB access and clears the user password.
  • You cannot create a user with a home directory that exists (even with --force). Workaround: move the directory aside before creating the user, or create with a bogus homedirectory (which can only be used once) and use "isi auth local user modify" to fix after creation.
  • Don't use more than 8 SyncIQ policies (I don't know if this bug has been fixed).
  • Gateways and priorities are not clear, but if there are 2 gateways with the same priority the cluster can get confused and misbehave. The primary gateway should have the lowest priority number (1).
  • We heard one report that advisory quotas on a SyncIQ target cluster caused SyncIQ errors.
  • If you configure two gateways with the same priority, the cluster can get confused and misbehave.
  • In at least one case, advisory quotas on a SyncIQ target disrupted SyncIQ.
  • The Virtual Hot Spare feature appears to reserve twice as many drives as are specified in the UI, and they do not work as described.

Support

  • Support is very slow. SLAs apparently only apply to parts delivery -- our 4-hour service does not prevent Isilon from saying they will answer questions in a few days.
  • Support is constantly backlogged. Callback times are rarely made and cases are often not followed up unless we call in to prod Support.
  • My process for opening a case looks like this:
    1. Run uname -a; isi_hw_status -i; isi_gather_info.
    2. Paste output from first 2 commands and gather filename into email message.
    3. Describe problem and send email to support@.
    4. A while later we get a confirmation email with a case number.
    5. A day or two later I get tired of waiting and phone Isilon support.
    6. I punch in my case number from the acknowledgement.
    7. I get a phone rep and repeat the case number.
    8. The phone rep transfers me to a level 1 support rep, who as a rule cannot answer my question.
    9. The L1 rep tries to reach an L2 rep to address my question. They are often unable to reach anyone(!!!), and promise a callback as soon as they find an L2 rep.
    10. As a rule, I do not receive a callback.
    11. Eventually I give up on waiting and call in again.
    12. I describe my problem a third time.
    13. The L1 tech goes off to find an answer.
    14. I may have to call back in and prod L1 multiple times (there is no way for me to reach L2 directly).
    15. Eventually I get an answer. This process often takes over a week.
  • Support provides misinformation too often. Most often this is simple ignorance or confusion, but it appears to be EMC policy to deny that any problem affects multiple sites.

Commands

For manual pages, use an underscore (e.g., man isi_statistics). The command line is much more complete than the web interface but not completely documented. Isilon uses zsh with customized tab completion. When opening a new case include output from "uname -a" & "isi_hw_status -i", and run isi_gather_info.

  • isi_for_array -s: Execute a command on all nodes in in order.
  • isi_hw_status -i: Node model & serial number -- include this with every new case.
  • isi status: Node & job status. -n# for particular node, -q to skip job status, -d for SmartPool utilization; we use isi status -qd more often.
  • isi statistics pstat --top & isi statistics protocol --protocol=nfs --nodes=all --top --long --orderby=Ops
  • isi networks
  • isi alerts list -A -w: Review all alerts.
  • isi alerts cancel all: Clear existing alerts, including the throttled critical errors message. Better than the '''Quiet''' command, which can suppress future errors as well.
  • isi networks --sc-rebalance-all: Redistribute SmartConnect IPs to rebalance load. Not suitable for clusters with CIFS shares.
  • du -A: Size, excluding protection overhead, from an Isilon node.
  • du --apparent-size: Size, excluding protection overhead, from a Linux client.
  • isi devices: List disks with serial numbers.
  • isi snapshot list --schedule
  • isi snapshot usage | grep -v '0.0'
  • isi quota list --show-with-no-overhead | isi quota list --show-with-overhead | isi quota list --recurse-path=/ifs/nl --directory
  • isi quota modify --directory --path=/ifs/nl --reset-notify-state
  • isi job pause MultiScan / isi job resume MultiScan
  • isi job config --path jobs.types.filescan.enabled=False: Disable MultiScan.
  • isi_change_list (unsupported): List changes between snapshots.
  • sysctl -n hw.physmem: Check RAM.
  • isi device -a smartfail -d 1:bay6 / isi devices -a stopfail -d 1:bay6 (stopfail is not normally appropriate)
  • isi devices -a add -d 12:10: Use new disk in node 12, bay 10.
  • date; i=0; while [ $i -lt 36 ]; do isi statistics query --nodes=1-4 --stats=node.disk.xfers.rate.$i; i=$[$i+1]; done # Report disk IOPS(?) for all disks in nodes 1-4 -- 85-120 is apparently normal for SATA drives.
  • isi networks modify pool --name *$NETWORK*:*$POOL* --sc-suspend-node *$NODE*: Prevent $POOL from offering $NODE for new connections, without interfering with active connections. --sc-resume-node to undo.
  • isi_lcd_d restart: Reset LEDs.
  • isi smb config global modify --access-based-share-enum=true: Restrict SMB shares to authorized users (global version); isi smb config global list | grep access-based: verify (KB #2837)
  • ifa isi devices | grep -v HEALTHY: Find problem drives.
  • isi quota create --path=$PATH --directory --snaps=yes --include-overhead --accounting
  • cd /ifs; touch LINTEST; isi get -DD LINTEST | grep LIN; rm LINTEST: Find the current maximum LIN.