Perceived Risk

Humans have a design flaw, they don’t have a sense for risks. When you look at the risks from a more scientific perspective, the probability of dying in a terror attack is vastly lower than the probability of dying because you’ve committed suicide. With a realistic view to the risks, you should look in the mirror each day and observe yourself, not to other peoples in the aircraft, the subway or somewhere else. However the fear-mongers were able to implant us other feelings and so we strip-search people looking differently from us, we even start to strip-search everybody by technology like backscatter scanners. Why do i talk about this? Well … there is an technology in ZFS, that allows you to work with probabilities to speed up things. It’s the deduplication part. You can use it in a way, that doesn’t verify bit-for-bit if a block is really identical to another before just storing a pointer to the already stored block.
In this non-verifying mode, the system just decides on the foundation of the checksums of both blocks. If the checksum of two blocks is equal, the blocks are considered as equal and the system doesn’t check it bit-wise. So you can save a read-IOP. Most people in IT know about something called hash collision. Such collisions are in the nature of hashing. It’s in the nature of hashing, that when you you sort a large amount of things in a smaller amount of buckets, that you will end with more than one thing per bucket. It’s even that way, that a perfect mechanism would end with absolutely the same numbers of things in each bucket. That said, it sounds totally unreasonable just to rely on hashes to compare blocks. Many people talk about something called birthday problem, when you start to talk about hash-based de-duplication. The Wikipedia defines it as following:

In probability theory, the birthday problem, or birthday paradox pertains to the probability that in a set of randomly chosen people some pair of them will have the same birthday. In a group of at least 23 randomly chosen people, there is more than 50% probability that some pair of them will have the same birthday. Such a result is counter-intuitive to many.

What birthdays are to humans, hashes are to our blocks. Many people think about this high probability, when they think about the deduplication. But you can just have birthday on 365 days, not 2^256 days. But there is an interesting table on the Wikipedia-Page as well: When you want to see a hash collision with probability of [tex]110^{-18}[/tex] of a hash collision, you have to hash [tex]4.810^{29}[/tex] blocks. That are roughly [tex]2^{98}[/tex] blocks, translating 316 octillion, 912 septillion, 650 sextillion, 57 quintillion, 57 quadrillion, 350 trillion, 374 billion, 175 million, 801 thousand and 344 blocks. Given you are using 128kB blocks this gives you a storage capacity of [tex]4.05610^{10}[/tex] yottabytes. Just a comparison: That’s roughly [tex]310^{15}[/tex] times the human knowledge (i love Wolfram Alpha for such comparisons). Why did i choose a 1x10^-18 and why is there a column in the wikipedia page. Well … it’s the same probability that you see a unrecoverable bit error from your favorite high-end hard disk. So when you have stored 316 octillion, 912 septillion, 650 sextillion, 57 quintillion, 57 quadrillion, 350 trillion, 374 billion, 175 million, 801 thousand and 344 blocks to have a probability of a false positive duplicate roughly as high as the probability of having a uncorrectable bit error from you hard disk. For real world dataset sizes at a customers the probability of a false-positive deduplication doesn’t look as a problem you should worry about. However everyone which i talked to about deduplication still thinks that she or he needs this verify read, that’s unacceptable to do it without it. But’s let’s face it: The probability of an undetected error of you ECC ram is much higher, the error of an undetected error while reading data from your disks is much higher. Even the probability of getting a undetected erroneous packet via network is much larger. At the end Ethernet frames are just protected by CRC32. The checksum mechanism protecting the payload of packets in IPv4 is even much more prone to undetected errors than CRC32. However many people still rely on filesystems without checksums for they data. The same people doesn’t make IPsec mandatory to protect the data while on transport (The sec in IPsec is not only about confidentiality. It’s about integrity, too). The same people use x86 servers without ECC everywhere. The same people work with TCP Checksum Offload, thus computing the checksum in the networking card allowing the packet to be unprotected from the CPU to the NIC - and there are a lot of components between both. So … at the end it’s really strange that people have problems with a checksum-only deduplication, but don’t have problems with mechanisms that put their data at a much larger risk. As i wrote before: Humans don’t have a natural real sense for risks and aren’t trained to assess risks correctly. So they have fears about the consequences, where they shouldn’t have them, and they take risks, they shouldn’t take. At the end it’s about a perceived situation, and this is the reason why there is a verify option in ZFS deduplication.