Update 2010/05/06: Apparently I was wrong. ext3 uses 32-bit block numbers from 0..4,294,967,295. With 4kbyte blocks (maximum on i386 & x86_64 systems) this gives a maximum ext3 filesystem of (2^32-1) * 4096 = 17,592,186,040,320 bytes. Using LVM with 4096kbyte physical extents, this means ext3 filesystems must be under 4,194,304 PEs. So use lvcreate --extents 4194303. 4,194,303 4096kbyte physical extents = 4,294,966,272 4kbyte blocks = 17,179,865,088 bytes in the resulting filesystem.

Update 2009/03/10: It looks like mke2fs is smart enough to automatically select the 4k blocksize, and largefiles4 is not necessary (which is good, as it was interfering with our backups).


We compared performance between 10-disk and 20-disk RAID6 sets on a SATABeast, and discovered the performance difference is not significant, so we chose the most efficient reasonable layout: 2 20-disk RAID6 sets, each containing a single volume the same size. These appear to the Linux host as a couple 16.37tibyte LUNs. We're using device mapper multipathing to provide fault tolerance across both FC paths (in Nexsan's recommended "All Paths All LUNs" mode, each LUN is available via both controllers). This is all handled (except the performance testing) via the SATABeast administration interfaces.

Within Linux, we create 2 LVM logical volumes of just under 8tibyte (the largest ext3 can handle), and a third with the leftover 384gibyte, from each LUN.

The SATABeast lets the host see and use the volumes while it's still generating parity on the underlying RAID arrays ("Online Creation"), but creating file systems is much slower during this process.

A fully configured SATABeast contains 42 1,000,137,687,040-byte ("1 terabyte") drives. They reserve 2 for spares, so we have 40 disks to work with. Nexsan suggests 4 10-disk RAID sets, but RAID 6 allocates 2 disks per RAID set to parity, so with 4 10-disk sets we would 'waste' 10 disks, and our usable space would be 4 volumes, each 8 usable disks = 8,001,101,496,320 bytes / 1024^4 * 4 RAID sets = 29tibyte usable. 29tibyte is a lot, but only 70% of the specified "42tbyte", so we'd really like to be more space efficient -- we have 2 hot spare and 4 p

I set up /etc/fstab earlier, using ext3 labels.


How to create an ~~8tibyte ext3 filesystem on a large multipath raw volume.

  1. pvcreate /dev/mapper/mpath3
  2. vgcreate satabeast1vg /dev/mapper/mpath3
  3. vgdisplay # Note number of usable extents (PE).
  4. lvcreate -n satabeast1lv --extents 2096128 satabeast1vg # Use number from above. # It least close to the largest valid ext3 volume size.
  5. mkfs.ext3 -L/satabeast1a -Tlargefile4 -b4096 /dev/satabeast1vg/satabeast1a # If the SATABeast is still calculating parity, this takes a while. Go get some food...
  6. vi /etc/fstab ## I use something like LABEL=/satabeast1 /satabeast1 ext3 defaults 0 0
  7. mount /dev/satabeast1vg/satabeast1lv /satabeast1
  8. df -h /satabeast1

Here's my transcript of creating an LVM volume with two not-quite-8tibyte and one 348gb filesystems on a 16.37tibyte LUN.

[root@norimaki device-mapper-multipath-0.4.7]# pvcreate /dev/mapper/mpath3 
  Physical volume "/dev/mapper/mpath3" successfully created
[root@norimaki device-mapper-multipath-0.4.7]# vgcreate noribeast0vg /dev/mapper/mpath3 
  Volume group "noribeast0vg" successfully created
[root@norimaki device-mapper-multipath-0.4.7]# lvcreate -n noribeast0lv --extents 2096128 noribeast0vg
  Logical volume "noribeast0lv" created
  Logical volume "noribeast0lv" successfully removed
[root@norimaki device-mapper-multipath-0.4.7]# lvcreate -n noribeast0a --extents 2096128 noribeast0vg
  Logical volume "noribeast0a" created
[root@norimaki device-mapper-multipath-0.4.7]# lvcreate -n noribeast0b --extents 2096128 noribeast0vg
  Logical volume "noribeast0b" created
[root@norimaki device-mapper-multipath-0.4.7]# vgdisplay
  --- Volume group ---
  VG Name               noribeast0vg
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  5
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               16.37 TB
  PE Size               4.00 MB
  Total PE              4292125
  Alloc PE / Size       4192256 / 15.99 TB
  Free  PE / Size       99869 / 390.11 GB
  VG UUID               QLdntg-9ccY-0HYe-DnWI-Lxxu-Huzj-S3bExH

[root@norimaki device-mapper-multipath-0.4.7]# lvcreate -n noribeast0c --extents 99869 noribeast0vg
  Logical volume "noribeast0c" created
[root@norimaki device-mapper-multipath-0.4.7]# mkfs.ext3 -L/noribeast0a -Tlargefile -b4096 /dev/noribeast0vg/noribeast0a
mke2fs 1.39 (29-May-2006)
Filesystem label=/noribeast0a
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
8384512 inodes, 2146435072 blocks
107321753 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
65504 block groups
32768 blocks per group, 32768 fragments per group
128 inodes per group
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
    102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 25 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
[root@norimaki device-mapper-multipath-0.4.7]# time mkfs.ext3 -L/noribeast0b -Tlargefile -b4096 /dev/noribeast0vg/noribeast0b
mke2fs 1.39 (29-May-2006)
Filesystem label=/noribeast0b
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
8384512 inodes, 2146435072 blocks
107321753 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
65504 block groups
32768 blocks per group, 32768 fragments per group
128 inodes per group
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
    102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 23 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

real    9m52.451s
user    0m0.657s
sys 0m3.660s
[root@norimaki device-mapper-multipath-0.4.7]# time mkfs.ext3 -L/noribeast0c /dev/noribeast0vg/noribeast0c 
mke2fs 1.39 (29-May-2006)
Filesystem label=/noribeast0c
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
51134464 inodes, 102265856 blocks
5113292 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3121 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 27 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

real    1m45.683s
user    0m0.172s
sys 0m10.150s
[root@norimaki device-mapper-multipath-0.4.7]# df -h |grep nori
/dev/mapper/noribeast0vg-noribeast0a
                      8.0T  175M  7.6T   1% /noribeast0a
/dev/mapper/noribeast0vg-noribeast0b
                      8.0T  175M  7.6T   1% /noribeast0b
/dev/mapper/noribeast0vg-noribeast0c
                      384G  195M  365G   1% /noribeast0c