The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Wednesday, December 2. 2009
I could summarize this article in a short sentence: Don't use
The more interesting question at this point is "Do your scripts interpret the numbers correctly?". I know there are countless scripts at customers out there monitoring disk space by using
Magical harddisk enlargementLet's just assume you have a ZFS system with deduplication:
I've created a ZFS based on files as a backend store just to demonstrate you the effect. At first we have an empty ZFS filesystem.
Now we copy a file into our newly created filesystem. I've choosed the
We've copied a single file into the pool. Our disk is 219136 KByte in size. The file takes round about 2201 Kbyte. We have still 216860 KByte available. Looks consistent. Okay ... let's copy the file
Deduplication just works as designed. The second file didn't take roundabout 2.2 MBytes, it just took 6 Kbyte. The data was deduplicated by ZFS. But look at the
But that's not unreasonable as well. Let's assume we have a three blocks disk. i write a single block on it. I have 2 blocks left. No i write the same block on the disk. With a normal filesystem you would use another block. Thus you would have 1 block left, and two blocks on disk. A deduplicating filesystem has two blocks on disk as well, but it has still two free blocks. The only way to make sense out of this situation is to increase the size of the filesystem thus showing the filesystem with a total size of 4 blocks.
That has certain impact to the monitoring of the system. So using the
This tool correctly shows the size, the allocated space and the free space.
Misguiding percentagesI didn't talked about the
When you look at the capacity column you would assume that your pool is 51% filled. But we just copied the same file again and again. Was there a problem with the deduplication?
No, of course not. When you look at the
But: Instead of having 219136 KByte pool we have now a 438912 KByte pool. So we don't have an almost empty disk as the block allocation by ZFS would suggest nor we have an full disk as a short calculation of the original size minus the size of the files copied in to the filesystem would suggest (or to be exact a disk a negative number of available blocks, as 219136 KBytes minus 222297 KBytes is -3161 KBytes). Empty would be totally wrong from filesystem perspective. So adding the
The 100 files just took 2.62 M after deduplication. The size of the pool is still the real one and the capacity calculation is more reasonable. Now let's push this situation to the extreme. I've restarted the script to generate 1000 copies of the
Well ... 2.401.486 Kilobytes in a pool that was 219.136 KBytes initially. Due to the deduplication we've just used 11807 Kilobytes for this amount of data. Additional 11 MByte in a pool of 246 MBytes in size gives you capacity usage of 93% instead of 51% percent due to the way how this numbers are calculated. Again
Digging in the sourceBut why behaves this system this way? The reason is in the source code of
When you look at the 1271 you will recognize how this
When you look into those properties you query those properties manually
So the total size is 209080832 plus 2459780608 = 2668861440 or 2606310 KBytes. Now remember the last
Now look at the line 1441 of
Let's compute this manually -
ConclusionI hope i was able to give you an example, why you shouldn't use
I know - some people would say "Modify df!". But can you really do this?
By the way, you could say, this is an absolutely unrealistic use case. Really? Let's consider you store Windows desktop images on your fileserver for usage with your favourite virtualisation tool. You have a thousand desktops and all desktops are relatively equal (All use Windows 7 for example). Then you have a vast amount of duplicates, which would be deduplicated by ZFS. It's pretty much the same like with the
Display comments as (Linear | Threaded)
I'm curious, what happens if you "fill" the disk with copies of wireshark ? (I.e. get some capacity percentage >100)
Hopefully there are not any tools depending on statvfs before deciding if to copy or not?
#1 Soren on 2009-12-02 16:51
Why can "capacity percentage > 100%"?
The "total" column in df will grow as the "used" column grows, in above case, each addition copy of wireshark only consume around 3K space due to dedup, so even if you eventually fill the disk with million copies of wireshark, the "total" column will grow at the same time, so the capacity percentage showed in df will stuck at 100% and won't exceed it.
#1.1 kevin on 2009-12-02 17:25
Joerg, one of your cut&pastes is wrong for the Wireshark cp, copied over the same filename destination twice!
Otherwise nice followup to the discussions …
#2 Craig on 2009-12-02 17:24
So, this should be true for all deduped filesystems, not a problem of ZFS. Or am I wrong ?
#3 Gerd on 2009-12-02 20:01
I have no Data Domain or NetApp device available, so i can't check it, but it should be pretty much the same with other deduping filesystems. But as ZFS is the first dedupe for the masses implementation, most users will find this challenge with ZFS the first time.
Another aspect is the "total" column is not always going up, it can come down as well when more zfs are created in the same pool since the "df -k" will calculate based on individual filesystem;
The points are:
(1) in "df -k" output, total=used+available;
(2) in "zpool list" output, size=alloc+free;
(3) The "size" in "zpool list" should remain unchanged (unless zpool add...);
(4) The "used" in df and "alloc" in zpool will be different due to dedup=on; if there is only one zfs in zpool, then alloc < used;
(5) yet, if more zfs created in the same pool, the "used" column in each zfs could be eventually less than "alloc" in "zpool list" output; (dedup is on pool level, so it is only guaranteed that the "alloc" in "zpool list" is less than the combined "used" column for all the zfs in the same pool, for an individual zfs, the "used" column in df -k could be more than the "alloc" in "zpool list" output.
As an example, here are some output on a
#### when there is only one zfs holding data, the total column for testdedup/pdfatt is 101G:
testdedup 97G 24K 97G 1% /testdedup
testdedup/pdfatt 101G 4.3G 97G 5% /testdedup/pdfatt
testdedup/vboximage 97G 21K 97G 1% /testdedup/vboximage
# zpool list testdedup
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testdedup 99.5G 1.39G 98.1G 1% 3.23x ONLINE -
#### After copy some data to another zfs testdedup/vboximage, the total column for testdedup/pdfatt became 97G:
testdedup 93G 24K 93G 1% /testdedup
testdedup/pdfatt 97G 4.3G 93G 5% /testdedup/pdfatt
testdedup/vboximage 97G 4.7G 93G 5% /testdedup/vboximage
#zpool list testdedup
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
testdedup 99.5G 5.61G 93.9G 5% 1.61x ONLINE -
#4 kevin on 2009-12-02 22:39
"df looks at the data from the filesystem perspective."
"zpool list looks from the pool perspective..."
Maybe they should look at the SA's perspective...
Great article as always
#5 Tom de on 2009-12-03 01:38
If you perform any more experiments, I would be interested to know what you have set for dedupditto. Did ZFS automatically create a second copy when you crossed the threshold (default 100 references)?
Also, it would be instructive to see how compression=on affects the reported values. It seems like the interaction with dedup ratio could further distort the user's understanding.
#6 Craig S. Bell on 2009-12-03 19:51
So I got a chance to try the snv_128a bits, and dedup works as advertised. This is all in my tiny Virtualbox guest. Good stuff! =-)
It looks like the dedupditto pool property is not entirely public yet. At least, it's not in the manpages, or the usage output of zpool get/set. But it does appear in libzfs.so.1.
I can explicitly run "zpool get dedupditto", and get a valid result. It looks like the default value is zero. Does that mean it is turned off, perhaps? I should try tuning this.
As for compression, this works rather like I expected -- the USEDDS space is shown as the compressed amount, multiplied by the number of deduped copies. Okay then, fair enough.
Where it gets confusing for me is how meaningful the dedupratio should be. If I copy in a file ten times, the ratio is 10.00x. This is true, regardless of whether the file is tiny or huge. But my pool has a real size.
What objective value does this ratio correlate with? Without knowing the size of just the particular data subject to dedup, can I really learn much about my pool from seeing the ratio out of context?
It looks like what I really have to watch is whether my pool available size dynamically grew, and compare real size vs. perceived size with dedup.
I think in the future it will be harder to know the answer of just how much space is available, unless the tools are further enhanced, as with zfs list -o space.
#6.1 Craig S. Bell on 2009-12-07 23:34
Update: "zpool set dedupditto=x" failed, it's a readonly property. Perhaps it will be enabled in a future release. Oh well.
#6.1.1 Craig S. Bell on 2009-12-08 21:49
Thanks. Interesting discussion.
I guess this will kill tools like JDiskReport as well. Or is there a way to find out how much space a directory uses after deduplication?
In build 129, I was able to set dedupditto. Minimum value is 100. As I crossed 100x dedupratio, indeed it dropped back to 50x indicating the second copy.
Further, as I crossed 10000x (100 x 100), copies appeared to go to 3 (3333x). I had to drop all the way back down to 99x before copies were removed, and it went from 3 directly back to 1.
Given the various issues with build 130, I think I'll wait before conducting any more testing. Hopefully they can also clear up the memory / performance issues that make importing a pool with dedup=on take so long. Thanks.
#8 Craig S. Bell on 2009-12-31 19:30
Deduplication and ZFS compression are two very useful features of ZFS. It's true that df hasn't kept up with the times. I think now is the time to extend df and add some additional flags that make it better capable of handling these extensions and enabling the sys admin choose how he wants to see the information presented. How about flags for total file usage, and actual space used on disk for starters?
#9 Ian Ballantyne on 2010-09-22 11:31
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Chris Gerhard about A warning in regard of mdb
Mon, 03.02.2014 14:24
This mdb bug is fixed in the c urrent Solaris 11 SRU.
Tomasz Kloczko about A warning in regard of mdb
Sun, 02.02.2014 19:25
This is why I'm never using "e xit" in shell and always ctrl- d.
Joerg M. about A warning in regard of mdb
Sun, 02.02.2014 08:30
Yes, i'm doing it the same way normally with the echo ... ho wever i needed the mdb for dif ferent stuff as well and [...]
Alan Hargreaves about A warning in regard of mdb
Sun, 02.02.2014 05:05
I recall a customer who had ju st set a tunable to 0 using > variable_name/W0 They th en typed in > exit [...]
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Germany License