Sun is doing a road show to talk about their new Sun Storage 7000 Unified Storage Systems (Fishworks). It's interesting to see what's new and what isn't.

There are 3 main products, all based on existing Sun servers. They all (like the base models) include quad-core Opterons & 4 Gigabit Ethernet ports, and accommodate 10gbps Ethernet interfaces.

  • The 7110 is based on the X4240. $11k with 1 Opteron, 8gb RAM, & 14 usable 2.5" 146gb drives; 2U.
  • The 7210 is based on the X4540 "Thor", successor to the X4500 "Thumper". $118k with 2 Opterons, 64gb RAM, 44tb raw, & 2 Logzillas; 4U.
  • The 7410 storage controller is apparently based on the X4440. It's designed to attach to one or more J4400 JBoDs (4U & 24 hot-swap drives). Single controller: $79k with 2 Opterons, 64gb RAM, 11tb raw, 1 Readzilla, & 1 Logzilla; 2U. HA cluster (active/active or active/standby): $193k for 2 controllers (each 4 Opterons, 128gb, & 1 Readzilla) and a single shared J4400 (22tb raw & 2 Logzillas; 4U). Pricing is complicated.

Sun has done several things to differentiate the new 7000 series from their existing server models.

  • They all use the Fishworks custom build of OpenSolaris. This is obviously not the very latest release, due to Sun's testing and qualification cycle. I was surprised it's not Solaris 10. It's easier to get patches into OpenSolaris than Solaris proper; for example, the kernel CIFS server has been in OpenSolaris for a while, but will not make it into Solaris proper until Solaris 11.
  • Sun has integrated ZFS patches (and presumably non-ZFS patches as well); this is easier to do in or on top of OpenSolaris proper. They are all intended to reach Solaris eventually.
  • Sun adds the Fishworks web-based GUI. It handles all admin tasks, including installation (hopefully better than the Solaris 10U3 installer, which was too stupid to set up hot spares on a Thumper), patching, and configuration (including networking, ZFS, and sharing). The GUI is pretty extensive -- it handles link aggregation, LDAP/AD integration, DTrace analysis, fault isolation, etc.
  • All models reserve a pair of mirrored disks for the OS, configuration, and logs.
  • Although the X4540 supports chaining J4500 JBoDs for increased capacity, the 7210 does not. This is unfortunate, as the 7210 & J4500 are twice as dense (48 top-accessible 3.5" bays) as the J4400 (24 front-accessible 3.5" bays).
  • They can "phone home" to Sun for diagnotics; Sun can proactively send replacement components (drives), and can also detect a crashed host if it doesn't make the daily call.
  • The 7210 & 7410 offer Logzilla, and the 7410 includes Readzilla, which are not otherwise available.

The J4500 JBoD is basically a lobotomized Thor -- CPUs removed in favor of SAS ports. There is a small price savings, but ZFS makes it easy to present all the disks as a larger ZFS pool. If that's not a hard requirement, multiple X4540s provide better performance.

Readzilla & Logzilla are quite interesting. Readzilla is a 100gb 2.5" SSD (flash drive). It's intended serve as cache in a 7410 controller, which has 6 available bays for Readzillas. Sun doesn't support normal hard drives in these bays, because that would interfere with failover. So instead they reserve 7410 internal drive bays for the OS and read cache.

Logzilla is a more exotic SSD. It's a 3.5" 18gb low-latency store for filesystem logs (journals): the ZIL (ZFS Intent Log). Logzilla combines DRAM (the working cache), flash memory (to store the data from DRAM in case of a failure), and a supercapacitor with enough juice to copy the data from DRAM to flash in an emergency.

Basically, when an application (particularly a database) writes data and needs to ensure it has been recorded, it instructs the operating system to flush the data to stable storage, to ensure that even in the event of a crash or power outage the data won't be lost. File systems do this too, to ensure that the metadata (directory structure) is valid -- it's not safe to create a file if its parent directory might not have been created/recorded, for instance. The problem is that disks are the main type of stable storage, and writing to disk takes significant time -- the data must be transferred from the CPU to the disks, and then the disks need to spin around and write the data in the right places. This is aggravated by RAID levels 2-6, which require extra disk reads and parity calculations. The application (user) ends up wasting time waiting for data to be stored safely on disk.

Storing data in a DRAM cache is much faster, but if the system crashes or power fails, data in DRAM is lost. So when an application requests a flush, Sun copies it from DRAM in 7410 controller to DRAM in Logzilla and the application continues. This way even if the OS crashes or power fails, Logzilla itself has enough intelligence to copy the data to flash. Since flash doesn't require power to retain data (just drain your iPod to confirm this), the data is available when the system is ready to read and flush it to disk.

Our Sun presenter, Art, talked about wear leveling and a 5-year lifespan for Logzilla's flash, but I don't understand why this is a factor -- it seems like the flash should only be written to in case of emergency. Clearly I'm missing something.

So the architecture ends up slightly odd -- Readzilla cache is inside the 7410 controllers, while Logzilla cache is outside the controllers in the JBoDs. This is because all the data needs to be available to both controllers in a redundant configuration. If controller A gets data from a client and writes it to Logzilla and then crashes, controller B can access the Logzilla and its data via the shared SAS fabric, so no data is lost -- just as it can access the 1tb disks. Internally, this is a zfs import operation, and Logzillas are just part of the pool. Readzilla doesn't have this constraint, though -- if controller A fails with data on its Readzillas, controller B can just fetch the data from the SATA disks. There's a performance hit as the cache refills but no data loss. The design assumes that much more data is read from Readzillas via private SAS connections than written to Logzillas shared SAS connections -- a safe bet.

Right now, the X4540 looks more attractive to me. The 7210 price is considerably higher, I don't think we really need Logzillas/Readzillas, and 7210s do not support J4500s for extending the zpool. The 7410 is impressively engineered, but we don't need HA clustering and it takes up much more rack space than the considerably cheaper X4540. As you add J4400s, the density gradually approaches 50% of the X4540's. Sun's list price is $116k for a 7410 with 2 Logzillas, 1 Readzilla, & 34 disks in 8U -- compared to $62k for a 4U X4540 with 48 disks. No, I don't know why the single-controller 7410 comes with a 12-bay J4200, rather than the 24-bay J4400, but Sun doesn't sell J4200s for the 7000 series.

Through all of this, don't be taken in by Sun's (or any other vendor's) capacity numbers. A 1,000,000,000,000 byte "1tb" disk provides about 930gb of usable space, because operating systems use base 2. 10^12 / 2^30 = 930gb (10^12 / 2^40 = .909tb). But even worse, some of those disks are needed for parity and hot spares, so the realistic capacity of a Thor with RAIDZ is in the 30-35gb range -- less under RAIDZ2 -- and each Logzilla or hot spare subtracts from the usable space.

Sun has a handy table of usable space for the 7110 & 7210, but note that it ignores the base 10 vs. base 2 differential, so remember that those "1tb SATA" drives are really .9tb. Unfortunately to calculate sizes for a 7410 you need a program that's part of the Fishworks installation (details on that page).

Fishworks is a very cool project, though, and it's clearly driving ZFS, (Open)Solaris, and the industry to advance. There is a cool video of Fishworks development.

Tangential irony: Sun offers VirtualBox as a free virtualization system, but the Sun Unified Storage Simulator is a VMware VM. It provides the full software stack, so you can run through the installation procedure and set up shares (and run the 7410 capacity planner), but the storage is VMware virtual volumes rather than real disks. Clever, but why isn't this available as a VirtualBox image too?? Perhaps because VirtualBox only supports 3 disk devices -- fix it, guys!

Update: As of 2009/02/02, Sun offers a VirtualBox image, but for some reason it's 1,136mb instead of 418mb. Now it's a VB issue to make their images more efficient, rather than the Fishworks team's task to provide a VB image. I just found a nice overview from the launch.

Update: As of 2009/02/04, Sun's VirtualBox image is format v1.5, which requires conversion to format v1.6 to run under VB 2.1.2, released a couple weeks ago. The included 'install' script wasn't executable; when I ran it it complained 16 times about creating the VM image. But the conversion didn't work right, because I had to manually reattach the 16 virtual SATA disks. On the other hand, this demonstrates that VB can indeed use more than 3 virtual hard disks, which is well done by the VB team.