The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Wednesday, April 26. 2017
Sometimes some simple question leads you down deep to basic discussions: This was something I had to solve 2 years ago. The hard part was not finding it out, that was pretty obvious at start. The problem was to it was to explain it.
Since then I saw it reoccurring several times. Once occasion two weeks ago, thus I decided to write this down. I will just take the question of the customer ĄI had a SAN and my†tar -x†was running fine with a short runtime. Now you gave me a ZFSSA, Iím using NFS and the times for†tar -x†are horrible. 61 Minutes ď. Iím writing about this now, because in different appearances Iíve seen this problem again and again over the last few months and I think it's time to write an article where I can just point when Iím seeing this problem again.
To start with it: This issue is not Solaris specific ... this more or less a NFS specific problem. So maybe it's interesting to users of others operating systems as well.
The misconception that leads to this reoccurrence is the idea that NFS is simple and it's like a local file system only shared in the network. But it isn't that easy. A filesystem that offers network wide multiple reader/multiple writer access to the same filesystem with caches on all nodes and locking can't be simple. At least not as simple as a filesystem that works just on block devices used by one node.
Okay, back to the customer: I thought Ö NFS and tar. This is a very well-known thing. I think I hit a problem there somewhere at the very start of this century. I looked into the ZFSSA configuration. There wasnít a single write accelerating SSD in it. Okay ĄDear Customer, buy a Writezilla and you will be fine. Best Regards Moellenkampď. But before spending a non-nearzero amount of money, the customer can demand a longer explanation. And the customer got this explanation. Even when the customer things that such a long runtime for tar Ėx may be okay, it causes some fears in regard of the performance. People want to know why something behaves differently than they expect so they can set their expectation right and understand why this may not hit them in production.
The first thing you have to keep in mind, is that tar is a purely sequential workload. It has to be this way. A tar has no centralized directory, where it stores the position of each file. You have to go through the file sequentially. It was designed that way out of very good reasons because tar stands for Tape ARchive. And tapes are notoriously bad at random reads compared to disks. When you plan an archiving tool for such a media, just designing for sequential access is acceptable. It's somewhat a strange thing that we use a combination of program that can only compress a single file and a tape archive program to ship software, but the computer industry is full of such strangenesses.
At first I had a look into the tar ball. 290.000 Files. Well Ö 61 minutes divided by 290.000 were 12.62 milliseconds per file. That not that bad Ö sometimes simple tasks just take longer, because you donít do them parallel and in large numbers. 290.000 times doing something short sequentially is still a lot time.
But why is it much faster on local storage. And why took it 12 milliseconds per file? The customer told me it was a matter of one or two minutes to unpack this on local disks. The answer is quite simple: A local filesystem like UFS or ZFS is not a shared filesystem. You have only one server writing into it. You have only one server reading from it. Writing to local disk is fairly easy. Your system is the only source of truth. You can safely cache things because when you change things you know you know it can be only you. You can combine the data in your cache and the disk to give a correct answer. On a networked filesystem this looks differently. A different node can change information and your cached content is stale. A different node doesn't have your cache to create the truth about the filesystem state. So things work pretty differently with NFS and you have to take this into equation when thinking about performance.
So what happens when you are locally do an untar:
I would like to point your attention to the flags of the open. With
Okay, but local filesystems are not the problem here. When you unbar the same file into a directory provided by NFS, you will see the following output from a packet sniffer. I would like to add that no special mount options where used that could have some impact how things are working internally. This was made with a plain standard mount.
As you see, a lot is going on. You may say, the performance difference between a local
And so you are for sure partially right, but that is only half the story and it doesn't explain the 12ms (which are interestingly not that far away from the worst case rotational latency of a 5400 rpm disk). The rest of the story is that the load is transformed by using NFS.
There is an important passage in RFC 1813. It states:
Data-modifying operations in the NFS version 3 protocol are synchronous. When a procedure returns to the client, the client can assume that the operation has completed and any data associated with the request is now on stable storage.Furthermore it elaborates:
The following data modifying procedures are synchronous: WRITE (with stable flag set to FILE_SYNC), CREATE,MKDIR, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK, and COMMIT.
Not in this list is the SETATTR command, which forces the data as well to stable storage before completing at least to my knowledge and whatís visible in the copies of nfs_srv.c out there in the internet.
When you are looking onto the tcpdump you see a significant number of the mentioned operations. Now imagine the same for a tar file of 290000 small files.
So essentially using NFS transform the load formerly consisting out of a multitude asynchronous write operations into one executing a significant number of operations that are synchronous by definition. What was a async write on application layer, is a sequence of synchronous writes when it hits the NFS server and thus hits the filesystem underneath. A rather total transformation of the load.
However, the situation is not that bad in reality. There are some optimizations. Letís have a look at a different tcpdump. Iíve created a tar file with files with four lengths in it: 1k, 32k, 33k and 64k. The tcpdumps are reduces to CREATE, WRITE and COMMIT calls.
We start with the 32k file. You see itís written as
Letís have a look onto the 1k file.
The write is directly send as a
For the interest of curiosity, I look at the 33k size and the 64k sizes. At first 33k:
Now at 64k :
I'm not an expert for the NFS code, but my assumption about the assumption behind this behavior that it is expected that there will be more writes into the same file if you fill up the wsize. If you don't fill it, it's expected that it may be the only or the last write.
So it's not as easy as saying that NFS is always writing synchronously. It was the case in NFSv2 and was a source of a lot of NFS performance problems. NFSv3 introduced the concept of asynchronous NFS writes with a synchronous COMMIT at the end. But the "many small files"-tar counters this as you may have taken away from the tcpdumps. So: If you write just 290000 files mostly smaller than 32k for example (the configuration at that customer) you will see a lot of NFSv3 sync writes and almost no asynchronous writes and almost no commits. If you had one of few big files in the tar with the same total size you would write the amount of data much faster, because then you would have a lot asynchronous NFS writes and final synchronous NFS write with a closing NFS COMMIT.
I was able to show by this to the customer that the execution time of the tar -x and the difference between the local filesystem and a remote filesystem via NFS is the expected behavior given the configuration of their storage pool.
How do you improve this situation? The problem is there for quite a time. And there were a multitude of solutions to it.
Just to give you a perspective how long this problem exists: Perhaps some of you are old enough to remember things like the Prestoserv NFS accelerator, which was quite useful in NFSv2 as each and every WRITE Call had to be answered synchronously (which didn't know the method of a sequence of ASYNC write with a following COMMIT, obviously ... COMMIT was introduced in NFSv3). It solved this problem for example.
Basically the answer to the problem is always: Making write latency as fast as possible. And so it boils down to get rid of the need to hit rotating rust before being able to answer on the mentioned NFS calls. One solution for example used on ZFS Storage Appliances are SSD for the log devices. A write is considered as non-volatile as soon as it's on the log device.
Out of this reasons when working with customers I have two rules: 1. You don't get a ZFS Storage Appliance from me without having write-accelerating SSD as long as the you and me can't prove that your load won't be hit by the mentioned necessities of the protocol (which is quite hard, when it hides even in such a simple thing like at tar) or you tell me, that you understand the problem but still want it otherwise. At the end you are the customer and customer is king as we say in Germany. 2. When you tell me you run NFS from a server (let's say DIY ZFS Storage Appliance) you should either have a storage array behind with a non-volatile write cache (that should be really used and not switched off, just saying out of experience) or you should opt for log devices as well. Without it you have essentially a storage array with cache but with a wrecked up battery which goes into write through mode because of this and you get the performance you can expect out of such a setup.
And as a fun fact: You may think that tar -x is not that problematic because you do it only from time to time. But on the contrary: I've seen quite a number of customers in the past, where tar is production critical. For example unpacking sensor data, development file servers where quite a number of tar unpacks ran all day, diagnostic data of machines, something like that. This small little tape archive tool is used quite frequently for a multitude of tasks where you don't expect them.
What the conclusion out of this:
Display comments as (Linear | Threaded)
"The following data modifying procedures are synchronous: WRITE (with stable flag set to FILE_SYNC), CREATE,MKDIR, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK, and COMMIT."
"So essentially using NFS transform the load formerly consisting out of a multitude asynchronous write operations into one executing a significant number of operations that are synchronous by definition. What was a async write on application layer, is a sequence of synchronous writes when it hits the NFS server and thus hits the filesystem underneath. A rather total transformation of the load."
I think that all these operations: WRITE (with stable flag set to FILE_SYNC), CREATE,MKDIR, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME, LINK, and COMMIT." are also synchronous in ufs, zfs or other filesystem. They can be optmized with a logging filesystem, etc. but they are synchronous.
#1 Jose on 2017-04-27 22:31
At least with ZFS this isn't correct. A rmdir for example doesn't trigger a zil_commit, as long as you don't specify SYNC_ALWAYS.
The author does not allow comments to this entry
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Martin about End of c0t0d0s0.org
Mon, 01.05.2017 11:21
Thank you for many interesting blog posts. Good luck with al l new endeavours!
Hosam about End of c0t0d0s0.org
Mon, 01.05.2017 08:58
Joerg Moellenkamp about tar -x and NFS - or: The devil in the details
Fri, 28.04.2017 13:47
At least with ZFS this isn't c orrect. A rmdir for example do esn't trigger a zil_commit, as long as you don't speci [...]
Thu, 27.04.2017 22:31
You say: "The following dat a modifying procedures are syn chronous: WRITE (with stable f lag set to FILE_SYNC), C [...]