Latency matters

Latency matters. Even in small quantities and thinking about storage networking just in IOPS without thinking about latency is like thinking of disks just in size but not in IOPS. The impact of latency isn’t just a theoretical one, as some observers suggest. This latency thing wasn’t such a problem in the past. You’ve only talked about it, when cabling over longer distances (100km for example) came into the focus. But with the increased popularity of SSD this starts to be a problem. No … it’s already a problem. I just want to guide you through a thought game to show you this challenge.

A thought game

We have an application, that reads a block, does something with it, writes it back to disk, reads another block based on the results of the earlier processing of the last block, processes this block and writes it back. And so on. This stuff is done sequentially. As data integrity is important for this application, the writes are done as synchronous writes. Let’s assume two configurations. Both are almost equal. Let’s assume a server (and an OS, a filesystem) that is capable to do at least 150.000 IOPS and you have a storage capable to serve at least 150.000 IOPS. Let’s further assume, that all the stuff between the server and the storage limits the amount of IOPS available to an application to 150.000 IOPS (e.g. HBA in the server not capable to deliver more). We will look into two configurations: The first configuration uses a 10 meter long cable to give you a direct connection to your storage. The second configuration uses a director or a switch and two five meter cables. So the only difference is the existence of this active component. Just for the uninitiated (i know I’m vastly simplifying things): A director is a very large switch not uncommon in large SAN deployments. A director is big, has many ports, and it’s most often implemented in a extremely available way. Those components introduce a certain latency while they process the frames. They have to: When a switch wants to switch a frame, the switch has to know the destination. At the beginning switches stored the complete frame, read the destination, and forwared it. This mechanism was called “store-and-forward”. But as the switching decision had to wait until the complete frame was received by the switch, it introduced rather large latency. Today, it’s a little bit different: There is a mechanism called cut-through which waits until the header is received by the switch, but doesn’t wait for the rest. While the payload is received by the switch, the switch can start to find the correct port and forward it to the next system. Thus latencies were reduced by quite a margin. Okay, back to the practical side: The Brocade 48000 FC director introduces a latency of 3,2 microseconds (data sheet latency, there are some documents talking about much higher latencies under load, but those numbers were published by competitors. I will use the 3,2 microseconds in my calculations). The Cisco MDS9500 is said to have an real-world latency of 13-15 microseconds from port to port. Okay, that doesn’t sound much. Light travels 3,9 kilometers in that time. But it is a really long time, when we talk about data in motion.

The impact of latency

Now let’s do some math. I simpliying things a little bit, so the stuff is a little bit more understandable. I think we can define the number of IOPS as folliwing:


The number of IOPS available to the application is 1 second divided by the result of the addition of the latency introduced by the switches and the time to fulfill the I/O-operation without the switches between the components. I’m calculating the time to complete the I/O-Operation as following: At 150.000 IOPS each operation takes 1/150.000 part of a second. Or to be exact: approx. 6.67 microseconds per I/O operation. At 1 Million IOPS the single operation takes 1 microsecond. Now let’s assume a 150 IOPS rotating rust disk. That’s 6.667 milliseconds per request. In 6.67 milliseconds light travels 2000km. The 2000km great circle centering on Hamburg in Germany covers almost the complete european continent. At first we tare the scale and define the direct connection as a zero latency connection. Yes … we use the new quantum tunnel fiber cables with integrated flux capacitors and tachyon/photon converters on both sides invented in a secret Sun Lab in Hamburg (almost ready for customers, we just have to get rid of the 40 cm plumb shielding around the cable) ;) Or, to say it more realistic: The latency of the cabling is included in the 6.67 microseconds mentioned before. I just want to show you the impact of latency added by other components. Okay … what happens when you want to read a block from your device: The connection via the director has a latency of 3.2 microseconds. A command is sent from your server through the HBA through the storage network connection and reaches finally your storage. The storage scratches the data from the rotating rust (or out of the cache), delivers the data back to the server through the network connections. Thus you introduce this switching latency twice. 6.4 microseconds. Now take into consideration, that reads are synchronous by nature. The filesystem can choose when it writes data to disk (as long as they are not synchronous writes), but it can’t choose when it reads data. Your application doesn’t think “Oh, you don’t have the data … then give me something else”. It waits until your storage delivers the data the application requested. Now you’ve added 6.4 microseconds latency to the 6.67 microseconds I/O-Operation time. Thus an operation takes 13.07 microseconds. Your application has to wait 13.07 before the data is starting to arrive. This results in:


You just halved the IOPS available to your application. It gets even worse, when you take the real-world data for the Cisco Director into account:


You get only a fifth of the IOPS available just because of the increased latency. The situation looks slightly better, when you calculate with FC switches instead of directors. A Brocade 5100 has a switching latency at 700ns, so you would yield


But there is a problem. With switches you have often a more complex SAN topology. I’ve seen quite often a layer of switches to connect the servers, and a layer of switches to connect the storage. Both layers are interconnected. Thus you have 2 switches to pass, thus you have 4 times the one-way latency:


For sync writes it is even worse. When you want to read data, you send the matching SCSI over FC command, and the other system answers with data. Writing data consists out of two messages that has to be transmitted and answered. You send a request FCP_CMD_WRT to your target (“Hey, i want to write data to you, do you allow me to do so?”), this has to answer with FCP_XFR_RDY (“Yes, i have some time, send me the stuff”), now you are allowed to send your data (“Okay, here is the data”) and the target has to reply to it (“Okay. It’s on disk”): Thus the switch latency doubles for writes - again. I think you are already calculating the impact of this in your head. For the best case of a single FC switch in the way between storage and the server we get to the following calculation for write IOPS:


For the worse case of a FC director in it ways, we get to a really low IOPS count for writes.


You can go now to your favorite vendor for shirts and order the custom print: “I’ve spent a fortune for 150.000 IOPS and all my application saw of it, were lousy 17044 IOPS” Of course you can still use your IOPS budget on your storage for other things and there are other mechanisms like buffers, buffer credits and end-to-end credits to ensure, that a system doesn’t wait to send more data, until the ACK comes back, but this doesn’t help the waiting application. Additionally there are several other ways to cut down latencies like Fibre Channel Write Acceleration (It cuts down the time for the first handshake by allowing the nearest switch to the host to answer with an “Yes, start with the transmission of your data” on behalf of the disk and storing the data on the switch nearest to the disk until the disk really says “Yes, start with the transmission of your data”), but as far as i see it, it’s more a solution for longer distance cabling then for circumventing switch latencies. But at the end we have an application waiting for the data it requested or for the confirmation, that the data was written to the disk/chip and the switches introduce latency by doing their job. I had a discussion with a reader because of the feasability of putting the swap on an SSD: Those latencies are the reason, why i strongly believe, that a directly connected SSD will yield much better results for SSD swap than a swap nailed in the cache of a big storage array. The problem isn’t the large storage box, it is the SAN in between. Why wasn’t that a problem in the past: That’s simple. Because they are relatively slow. Do you remember the 150 IOPS of your disk? Well, when using the same math, you see that this additional latency just reduces the IOPS of this disk by 1 IOPS.

You don’t loose much by this additional latency and the advantages of centralized storage vastly overshines the impact of one 1 lost IOPS.

No way out?

Okay, at the end you could say:” Well …. 30.000 IOPS is nice (or 17.000 IOPS), vastly more a rotating rust disk can deliver. Let’s take them from the table and call it a day”. Some vendors try to tell you exactly this: I remember a blog entry of an EMC employee writing “Hey, 1ms is better than 11ms or 30 ms. “. Yes, it is. And i would say “Yes, you are right” if this loss of IOPS would be inevitable, without alternatives. The kicker: It isn’t. Or to to be exact: It is inevitable for the market players, who have to protect a big box business in storage. For everybody else: There is help. There are alternatives. I know, it’s a bold statement, but for me it looks like FC SAN was developed for the rotating rust age. The rotating rust age was a time, when switching latencies were an order of magnitude faster than disk latencies. But now we have disk latencies in the same ball park of the switch latencies and even below. Perhaps we need something different for the solid state age. But what should we use instead. I don’t really have an idea at the moment. One of my weird ideas: Perhaps a flash memory controller that doesn’t speak SAS or SATA for host connectivity and Infiniband instead. A controller capable of SRP instead of SATA ;). A decent QDR Infiniband switch has a port-to-port latency of 300 nanoseconds. And 40 GBit/s instead of 3 or 6 GBit/s is a nice thing, too :) But perhaps it’s best to get rid of this problem by circumventing the problem in its complete and broad beauty. When there is no SAN between the server and the SSD, you don’t have to think about faster storage networking technologies. This is the reason why Sun and other people say “Put an SSD as near as possible to your CPU”. I don’t talk about the real distance here because calculating the light of speed here would be really splitting hairs (50cm copper cable: 1.66782048 nanoseconds, 10m fibre optics cable: 50.0346143 nanoseconds, as the information propagates in a fiber at 2/3 of light speed (the light propagates at light speed, but it’s reflected from the sides of the fiber, thus the way is longer)… hey 50cm copper is fscking long, whats with 5cm on the PCB of the system … hey … just 0.1 nanoseconds latency … ;)). No,”Near the CPU” means: Try to get as most latency introducing equipment as possible out of the way of the SSD. To get most out of your expensive SSD or expensive storage array with large caches you should connect them directly to your system. Just to get an impression: Let’s assume, your preferred supermarket is 10 km away. Your fridge is full. There is a big dinner at your house this evening. Would you store the ingredients, that didn’t fit in the fridge near the supermarket or in you cellar? Similar considerations lead to the development, that lead the development, that the CPU cache was created as discrete chip on the mainboard at first, then they moved on a CPU board (like in the UltraSPARC II age or the Slot A/Slot 1 in the x86 sphere) and now it’s on die. Okay, i had a discussion with a reader about this. He was correct with his objection, that this would hurt you on another side: Centralized storage can help you to do centralized backups, to employ disaster recovery procedures and helps you with a centralized management and obviously he is somewhat correct about this, when your application isn’t capable to do its storage management on its own. But this is a different discussion. And that is the point where all the approaches to use SSD just like a faster disk fall short, especially when you hide this fast storage in a large box. This is exactly the reason, why the hybrid storage pool of ZFS is a good idea. You can have both: Centralized storage and the SSD near your server. You put the pool on your centralized storage and put the L2ARC on the local SSD. The L2ARC is unimportant for a disaster recovery scenario. You have only to warm the cache at the other side. The location of the separated ZIL needs more thoughtful consideration. For a cluster failover it has to be available for all nodes of the cluster, so you integrate the outstanding changes written in the ZIL to the pool, but this depends on your application and your replication strategy. When you just do asynchronous replication between your sites, it can make sense to keep the SSD outside the general SAN and use a directly connected SAS-JBOD with some SSDs to share them between the cluster nodes. By using such a mechanism we can still use the centrally managed rotating rust for our data, but a locally provided SSD for speeding up things and reducing the load on our central infrastructure.

Conclusion

I’ve now written a long article about a rather small thing … latencies. But i hope i’ve shed some light to the challenges of a SAN between your server and your storage especially in conjunction with ultra fast storage systems. We need intelligent solutions to overcome those challenges. The hybrid storage pool is one of them … I’m sure the industry will show us other innovative solutions in the future.