You don't need zfs resize ... and a workaround when you need one ;)

Okay, the title is provocative and of course ZFS needs a tool to resize a filesystem. But out of other reasons. Somehow some people think that you can’t resize a ZFS filesystem. You can’t shrink a zpool, but you can increase it’s size while operation without much hassle just by doing the obvious. I’ve heard a few comments at customer sites that this feature is somehow missing in Solaris. But whenever you ask a little bit more about the need, you will find out that only a few are about reducing the size of the disks but most are about increasing the size of an existent RAIDZ pool. And often there is a little bit lack of knowledge. In the light of an missing resize command many people think that they only way is a RAID0 of RAIDZ1/2/3. But you have to think more Mac like to this problem. Free your mind, think about what you are really doing and most times it’s the correct way with ZFS. While writing this article i even had an idea how to fake a pool shrink/restructuring feature. Because this is feature that is really mising: Reducing the number of vdevs or making them smaller. This is a hack, but it looks like it works reasonable well. Nevertheless it has some caveats.

A warning

As usual a warning when you work at the way your data is stored on disks. Don’t try this with your production data (or production equivalent data like the favorite recordings of TV shows of your significant other made by the digitial video recorder). Having a working backup of your data is a best practice whenever you doing vast changes at the structure which data is stored on rotating rust.

Increasing the zpool size by doing the obvious

What do you do when you want to increase the size of a filesystem? You swap one disk after the other. Sound obvious. Let’s do this with zfs. At first i will create some files to use them as demo disks. At first we create our set of small disk:

# mkdir /testfiles<br />
# mkfile 128m /testfiles/smalldisk1<br />
# mkfile 128m /testfiles/smalldisk2<br />
# mkfile 128m /testfiles/smalldisk3<br />
# mkfile 128m /testfiles/smalldisk4<br />
# mkfile 128m /testfiles/smalldisk5

Now we create an RAIDZ from it:

# zpool create testpool_resizing \
raidz /testfiles/smalldisk1  /testfiles/smalldisk2  /testfiles/smalldisk3 \
 /testfiles/smalldisk4  /testfiles/smalldisk5</code></pre>
</blockquote>
Okay ... we use the disks for a while ... our disks getting full (not that hard at 128 Megabytes) and we want to increase the size of our pool. It's ringing at the door, the postman gives us the new set of disks:<br />
<blockquote><code>
<pre># mkfile 256m /testfiles/bigdisk1
# mkfile 256m /testfiles/bigdisk2
# mkfile 256m /testfiles/bigdisk3
# mkfile 256m /testfiles/bigdisk4
# mkfile 256m /testfiles/bigdisk5

In preparation to our task we set a property of our pool in the case we didn’t do that earlier:

zpool set autoexpand=on testpool_resizing

Now we replace all disks in the pool with the bigger one. It’s important that you wait until the disk completed it’s resilivering. You can check this via zfs status

# zpool replace testpool_resizing  /testfiles/smalldisk1  /testfiles/bigdisk1
# zpool list testpool_resizing
NAME                SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool_resizing   616M   306K   616M     0%  1.00x  ONLINE  -
# zpool replace testpool_resizing  /testfiles/smalldisk2  /testfiles/bigdisk2
# zpool list testpool_resizing
NAME                SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool_resizing   616M   267K   616M     0%  1.00x  ONLINE  -
# zpool replace testpool_resizing  /testfiles/smalldisk3  /testfiles/bigdisk3
# zpool list testpool_resizing
NAME                SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool_resizing   616M   270K   616M     0%  1.00x  ONLINE  -
# zpool replace testpool_resizing  /testfiles/smalldisk4  /testfiles/bigdisk4
# zpool list testpool_resizing
NAME                SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool_resizing   616M   267K   616M     0%  1.00x  ONLINE  -
# zpool replace testpool_resizing  /testfiles/smalldisk5  /testfiles/bigdisk5

Now let’s swap the last disk:

# zpool list testpool_resizing
NAME                SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool_resizing  1,23G   294K  1,23G     0%  1.00x  ONLINE  -

Tada … the size jumped to the new size. It’s really that easy.

Coexistence

Of course there are use cases for resizing, for example when you want to make the pool smaller or when you want to reduce the number of vdevs. Or just want to get the data from the expensive disks, but aren’t allowed to simply delete them because you need them later again. However i see a number of workarounds. At first file based vdevs and physical vdevs can coexist on the same pool. Thus it’s feasible to migrate your pool into files in another pool or even UFS filesystem. It’s similar to the stuff above. So just a short demonstration without any explanation. I’m just substituting the virtual devices on ramdisk with the virtual devices in files:

# ramdiskadm -a testramdisk1 128m
/dev/ramdisk/testramdisk
# ramdiskadm -a testramdisk2 128m
/dev/ramdisk/testramdisk2
# ramdiskadm -a testramdisk3 128m
/dev/ramdisk/testramdisk3
# ramdiskadm -a testramdisk4 128m
/dev/ramdisk/testramdisk4
# ramdiskadm -a testramdisk5 128m
/dev/ramdisk/testramdisk5
# zpool create migrationtest raidz  /dev/ramdisk/testramdisk1 \
 /dev/ramdisk/testramdisk2 /dev/ramdisk/testramdisk3 \
  /dev/ramdisk/testramdisk4  /dev/ramdisk/testramdisk5
# cp /etc/squid/&#42; /migrationtest/

# zpool status migrationtest
  pool: migrationtest
 state: ONLINE
 scrub: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        migrationtest                  ONLINE       0     0     0
          raidz1-0                     ONLINE       0     0     0
            /dev/ramdisk/testramdisk1   ONLINE       0     0     0
            /dev/ramdisk/testramdisk2  ONLINE       0     0     0
            /dev/ramdisk/testramdisk3  ONLINE       0     0     0
            /dev/ramdisk/testramdisk4  ONLINE       0     0     0
            /dev/ramdisk/testramdisk5  ONLINE       0     0     0

errors: No known data errors

# mkfile 128m rdisk2filemigration1
# mkfile 128m rdisk2filemigration2
# mkfile 128m rdisk2filemigration3
# mkfile 128m rdisk2filemigration4
# mkfile 128m rdisk2filemigration5
# zpool replace migrationtest  \ 
 /dev/ramdisk/testramdisk2 /testfiles/rdisk2filemigration2
# zpool replace migrationtest  \ 
 /dev/ramdisk/testramdisk3 /testfiles/rdisk2filemigration3
# zpool replace migrationtest  \ 
 /dev/ramdisk/testramdisk4 /testfiles/rdisk2filemigration4
# zpool replace migrationtest \
 /dev/ramdisk/testramdisk5 /testfiles/rdisk2filemigration5
# zpool replace migrationtest \ 
/dev/ramdisk/testramdisk6 /testfiles/rdisk2filemigration6
# zpool status migrationtest   
  pool: migrationtest
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Mon Dec 21 15:42:10 2009
config:

        NAME                                                    STATE     READ WRITE CKSUM
        migrationtest                                           ONLINE       0     0     0
          raidz1-0                                              ONLINE       0     0     0
            /testfiles/rdisk2filemigration1  ONLINE       0     0     0
            /testfiles/rdisk2filemigration2  ONLINE       0     0     0
            /testfiles/rdisk2filemigration3  ONLINE       0     0     0
            /testfiles/rdisk2filemigration4  ONLINE       0     0     0
            /testfiles/rdisk2filemigration5  ONLINE       0     0     0  165K resilvered

errors: No known data errors
#ls /migrationtest
cachemgr.conf  mime.conf  mime.conf.default  msntauth.conf  msntauth.conf.default  
squid.conf  squid.conf.default

Okay , that works too. If you just want to get your data from some expensive disks in RAIDZ (for example 6 73 GB 15.000 disks), you would use a filesystem on some cheap 1 TB in a RAID1 configuration and stop now. But there are cases where you want to change the structure or the size of your zpool.

A step further: Resizing/restructuring

While writing this article i thought, if it’s possible to drive this concept a little bit further, but in an obvious way. At first i’ve made the problem a little bit more complex by adding a filesystem to the pool.

# zfs create migrationtest/filesystem
# cp /etc/mail/&#42; /migrationtest/filesystem
# ls
aliases     helpfile          mailx.rc  sendmail.cf  submit.cf      trusted-users
aliases.db  local-host-names  main.cf   sendmail.hf  subsidiary.cf

To have some data in it, i’ve copied some files into. As we freed the ramdisks a few moments ago, i will use them for something different. But i need a sixth device for it.

# ramdiskadm -a testramdisk6 128m
/dev/ramdisk/testramdisk6

Now i use the six ramdisks to create a stripe out of three mirrors

# zpool create migrationtest_target \
mirror  /dev/ramdisk/testramdisk  /dev/ramdisk/testramdisk2 \
mirror  /dev/ramdisk/testramdisk3  /dev/ramdisk/testramdisk4  \
mirror /dev/ramdisk/testramdisk5  /dev/ramdisk/testramdisk6

Okay, now i’m snapshoting all the datasets recursively in the pool migrationtest and move them to the pool migrationtest_target i’ve used before.

# zfs snapshot -r migrationtest@migrationrun
# zfs send -R  migrationtest@migrationrun | zfs receive -F -d migrationtest_target

Of course with real storage this would take quite a time. But you could do a number of snapshots, do an incremental ZFS send and just to the interruption when a send/receive just takes a few seconds. Either way it’s important to know, that we need a short application downtime starting just before taking the last snapshot (or in our example the only snapshot) and ending with the successful renaming of the pools) to ensure that our data is consistent from the applications perspective. Okay, Let’s have a short look into the datasets of the pool migrationtest_target:

jmoekamp@hivemind:~/zfstest# cd /migrationtest_target/
jmoekamp@hivemind:/migrationtest_target# ls
cachemgr.conf  mime.conf          msntauth.conf          squid.conf
filesystem     mime.conf.default  msntauth.conf.default  squid.conf.default
jmoekamp@hivemind:/migrationtest_target# cd filesystem/
jmoekamp@hivemind:/migrationtest_target/filesystem# ls
aliases     helpfile          mailx.rc  sendmail.cf  submit.cf      trusted-users
aliases.db  local-host-names  main.cf   sendmail.hf  subsidiary.cf

Okay … looks similar to the stuff we’ve copied over in the respective filesystems in the pool migrationtestOf course this work isn’t complete until we rename the pools. At first we export both pools

# zpool export migrationtest
# zpool export migrationtest_target 

No we have to reimport them, as we used some strange devices, we have to give ZFS an hint where it should search for our devices. At first we reimport out pool migrationtest as migrationtest_source:

jmoekamp@hivemind:/# zpool import -d /testfiles
  pool: migrationtest
    id: 10800479326844677517
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        migrationtest                                           ONLINE
          raidz1-0                                              ONLINE
            /testfiles/rdisk2filemigration1  ONLINE
            /testfiles/rdisk2filemigration2  ONLINE
            /testfiles/rdisk2filemigration3  ONLINE
            /testfiles/rdisk2filemigration4  ONLINE
            /testfiles/rdisk2filemigration5  ONLINE
jmoekamp@hivemind:/# zpool import -d /testfiles migrationtest migrationtest_source

Now we reimport our pool migrationtest_target as our new pool migrationtest:

jmoekamp@hivemind:/# zpool import -d /dev/ramdisk
  pool: migrationtest_target
    id: 6998046081983056495
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        migrationtest_target           ONLINE
          mirror-0                     ONLINE
            /dev/ramdisk/testramdisk   ONLINE
            /dev/ramdisk/testramdisk2  ONLINE
          mirror-1                     ONLINE
            /dev/ramdisk/testramdisk3  ONLINE
            /dev/ramdisk/testramdisk4  ONLINE
          mirror-2                     ONLINE
            /dev/ramdisk/testramdisk5  ONLINE
            /dev/ramdisk/testramdisk6  ONLINE
jmoekamp@hivemind:/# zpool import -d /dev/ramdisk migrationtest_target migrationtest
jmoekamp@hivemind:/#

We just migrated our pool migrationtest from a RAIDZ to a RAID0+1.

jmoekamp@hivemind:/# zpool  status migrationtest
  pool: migrationtest
 state: ONLINE
 scrub: none requested
config:

        NAME                           STATE     READ WRITE CKSUM
        migrationtest                  ONLINE       0     0     0
          mirror-0                     ONLINE       0     0     0
            /dev/ramdisk/testramdisk   ONLINE       0     0     0
            /dev/ramdisk/testramdisk2  ONLINE       0     0     0
          mirror-1                     ONLINE       0     0     0
            /dev/ramdisk/testramdisk3  ONLINE       0     0     0
            /dev/ramdisk/testramdisk4  ONLINE       0     0     0
          mirror-2                     ONLINE       0     0     0
            /dev/ramdisk/testramdisk5  ONLINE       0     0     0
            /dev/ramdisk/testramdisk6  ONLINE       0     0     0

errors: No known data errors

And when we look into the filesystem we see all our files:

jmoekamp@hivemind:~$ cd /migrationtest
jmoekamp@hivemind:/migrationtest$ ls
cachemgr.conf  mime.conf	  msntauth.conf		 squid.conf
filesystem     mime.conf.default  msntauth.conf.default  squid.conf.default
jmoekamp@hivemind:/migrationtest$ cd filesystem/
# ls
aliases     local-host-names  sendmail.cf  subsidiary.cf
aliases.db  mailx.rc	      sendmail.hf  trusted-users
helpfile    main.cf	      submit.cf
# md5sum /migrationtest/filesystem/aliases.db 
48d4b1143f62a8c9a75125a717c3c824  /migrationtest/filesystem/aliases.db
# md5sum /migrationtest_source/filesystem/aliases.db 
48d4b1143f62a8c9a75125a717c3c824  /migrationtest_source/filesystem/aliases.db

Looks good. Same directory structure, same content. And due to the checksumming capabilities of ZFS all this steps are protected against bit rot while transmitting the data.

Caveats

Of course this idea has two caveats: At first you need the storage for the interims files, but you could use any filesystem that’s available on the system. And additionally you need a short downtime for transmitting the last incremental snapshot and swapping the names of the pools.

Conclusion

Resizing and restructuring a zpool is possible with minimal service interruption. Even the holy grail of reducing the number of vdevs or smaller vdevs is possible with a fairly minimal downtime. However i should point out that most customers want the stuff i’ve described in first part about the autoexpand feature and not the hack i’ve described in the second part. I don’t know where this idea was born, that you can’ increase the size of a vdev …. :)