The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Friday, September 4. 2009
Latency matters. Even in small quantities and thinking about storage networking just in IOPS without thinking about latency is like thinking of disks just in size but not in IOPS. The impact of latency isn't just a theoretical one, as some observers suggest.
This latency thing wasn't such a problem in the past. You've only talked about it, when cabling over longer distances (100km for example) came into the focus. But with the increased popularity of SSD this starts to be a problem. No ... it's already a problem. I just want to guide you through a thought game to show you this challenge.
A thought gameWe have an application, that reads a block, does something with it, writes it back to disk, reads another block based on the results of the earlier processing of the last block, processes this block and writes it back. And so on. This stuff is done sequentially. As data integrity is important for this application, the writes are done as synchronous writes.
Let's assume two configurations. Both are almost equal. Let's assume a server (and an OS, a filesystem) that is capable to do at least 150.000 IOPS and you have a storage capable to serve at least 150.000 IOPS. Let's further assume, that all the stuff between the server and the storage limits the amount of IOPS available to an application to 150.000 IOPS (e.g. HBA in the server not capable to deliver more). We will look into two configurations: The first configuration uses a 10 meter long cable to give you a direct connection to your storage. The second configuration uses a director or a switch and two five meter cables. So the only difference is the existence of this active component. Just for the uninitiated (i know I'm vastly simplifying things): A director is a very large switch not uncommon in large SAN deployments. A director is big, has many ports, and it's most often implemented in a extremely available way.
Those components introduce a certain latency while they process the frames. They have to: When a switch wants to switch a frame, the switch has to know the destination. At the beginning switches stored the complete frame, read the destination, and forwared it. This mechanism was called "store-and-forward". But as the switching decision had to wait until the complete frame was received by the switch, it introduced rather large latency. Today, it's a little bit different: There is a mechanism called cut-through which waits until the header is received by the switch, but doesn't wait for the rest. While the payload is received by the switch, the switch can start to find the correct port and forward it to the next system. Thus latencies were reduced by quite a margin.
Okay, back to the practical side: The Brocade 48000 FC director introduces a latency of 3,2 microseconds (data sheet latency, there are some documents talking about much higher latencies under load, but those numbers were published by competitors. I will use the 3,2 microseconds in my calculations). The Cisco MDS9500 is said to have an real-world latency of 13-15 microseconds from port to port. Okay, that doesn't sound much. Light travels 3,9 kilometers in that time. But it is a really long time, when we talk about data in motion.
The impact of latencyNow let's do some math. I simpliying things a little bit, so the stuff is a little bit more understandable. I think we can define the number of IOPS as folliwing:
The number of IOPS available to the application is 1 second divided by the result of the addition of the latency introduced by the switches and the time to fulfill the I/O-operation without the switches between the components.
I'm calculating the time to complete the I/O-Operation as following: At 150.000 IOPS each operation takes 1/150.000 part of a second. Or to be exact: approx. 6.67 microseconds per I/O operation. At 1 Million IOPS the single operation takes 1 microsecond. Now let's assume a 150 IOPS rotating rust disk. That's 6.667 milliseconds per request. In 6.67 milliseconds light travels 2000km. The 2000km great circle centering on Hamburg in Germany covers almost the complete european continent.
At first we tare the scale and define the direct connection as a zero latency connection. Yes ... we use the new quantum tunnel fiber cables with integrated flux capacitors and tachyon/photon converters on both sides invented in a secret Sun Lab in Hamburg (almost ready for customers, we just have to get rid of the 40 cm plumb shielding around the cable) Or, to say it more realistic: The latency of the cabling is included in the 6.67 microseconds mentioned before. I just want to show you the impact of latency added by other components.
Okay ... what happens when you want to read a block from your device: The connection via the director has a latency of 3.2 microseconds. A command is sent from your server through the HBA through the storage network connection and reaches finally your storage. The storage scratches the data from the rotating rust (or out of the cache), delivers the data back to the server through the network connections. Thus you introduce this switching latency twice. 6.4 microseconds.
Now take into consideration, that reads are synchronous by nature. The filesystem can choose when it writes data to disk (as long as they are not synchronous writes), but it can't choose when it reads data. Your application doesn't think "Oh, you don't have the data ... then give me something else". It waits until your storage delivers the data the application requested.
Now you've added 6.4 microseconds latency to the 6.67 microseconds I/O-Operation time. Thus an operation takes 13.07 microseconds. Your application has to wait 13.07 before the data is starting to arrive. This results in:
You just halved the IOPS available to your application. It gets even worse, when you take the real-world data for the Cisco Director into account:
You get only a fifth of the IOPS available just because of the increased latency.
The situation looks slightly better, when you calculate with FC switches instead of directors. A Brocade 5100 has a switching latency at 700ns, so you would yield
But there is a problem. With switches you have often a more complex SAN topology. I've seen quite often a layer of switches to connect the servers, and a layer of switches to connect the storage. Both layers are interconnected. Thus you have 2 switches to pass, thus you have 4 times the one-way latency:
For sync writes it is even worse. When you want to read data, you send the matching SCSI over FC command, and the other system answers with data. Writing data consists out of two messages that has to be transmitted and answered. You send a request FCP_CMD_WRT to your target ("Hey, i want to write data to you, do you allow me to do so?"), this has to answer with FCP_XFR_RDY ("Yes, i have some time, send me the stuff"), now you are allowed to send your data ("Okay, here is the data") and the target has to reply to it ("Okay. It's on disk"): Thus the switch latency doubles for writes - again.
I think you are already calculating the impact of this in your head. For the best case of a single FC switch in the way between storage and the server we get to the following calculation for write IOPS:
For the worse case of a FC director in it ways, we get to a really low IOPS count for writes.
You can go now to your favorite vendor for shirts and order the custom print: "I've spent a fortune for 150.000 IOPS and all my application saw of it, were lousy 17044 IOPS"
Of course you can still use your IOPS budget on your storage for other things and there are other mechanisms like buffers, buffer credits and end-to-end credits to ensure, that a system doesn't wait to send more data, until the ACK comes back, but this doesn't help the waiting application. Additionally there are several other ways to cut down latencies like Fibre Channel Write Acceleration (It cuts down the time for the first handshake by allowing the nearest switch to the host to answer with an "Yes, start with the transmission of your data" on behalf of the disk and storing the data on the switch nearest to the disk until the disk really says "Yes, start with the transmission of your data"), but as far as i see it, it's more a solution for longer distance cabling then for circumventing switch latencies.
But at the end we have an application waiting for the data it requested or for the confirmation, that the data was written to the disk/chip and the switches introduce latency by doing their job. I had a discussion with a reader because of the feasability of putting the swap on an SSD: Those latencies are the reason, why i strongly believe, that a directly connected SSD will yield much better results for SSD swap than a swap nailed in the cache of a big storage array. The problem isn't the large storage box, it is the SAN in between.
Why wasn't that a problem in the past: That's simple. Because they are relatively slow. Do you remember the 150 IOPS of your disk? Well, when using the same math, you see that this additional latency just reduces the IOPS of this disk by 1 IOPS.
You don't loose much by this additional latency and the advantages of centralized storage vastly overshines the impact of one 1 lost IOPS.
No way out?Okay, at the end you could say:" Well .... 30.000 IOPS is nice (or 17.000 IOPS), vastly more a rotating rust disk can deliver. Let's take them from the table and call it a day". Some vendors try to tell you exactly this: I remember a blog entry of an EMC employee writing "Hey, 1ms is better than 11ms or 30 ms. ". Yes, it is. And i would say "Yes, you are right" if this loss of IOPS would be inevitable, without alternatives. The kicker: It isn't. Or to to be exact: It is inevitable for the market players, who have to protect a big box business in storage. For everybody else: There is help. There are alternatives.
I know, it's a bold statement, but for me it looks like FC SAN was developed for the rotating rust age. The rotating rust age was a time, when switching latencies were an order of magnitude faster than disk latencies. But now we have disk latencies in the same ball park of the switch latencies and even below.
Perhaps we need something different for the solid state age. But what should we use instead. I don't really have an idea at the moment. One of my weird ideas: Perhaps a flash memory controller that doesn't speak SAS or SATA for host connectivity and Infiniband instead. A controller capable of SRP instead of SATA . A decent QDR Infiniband switch has a port-to-port latency of 300 nanoseconds. And 40 GBit/s instead of 3 or 6 GBit/s is a nice thing, too
But perhaps it's best to get rid of this problem by circumventing the problem in its complete and broad beauty. When there is no SAN between the server and the SSD, you don't have to think about faster storage networking technologies.
This is the reason why Sun and other people say "Put an SSD as near as possible to your CPU". I don't talk about the real distance here because calculating the light of speed here would be really splitting hairs (50cm copper cable: 1.66782048 nanoseconds, 10m fibre optics cable: 50.0346143 nanoseconds, as the information propagates in a fiber at 2/3 of light speed (the light propagates at light speed, but it's reflected from the sides of the fiber, thus the way is longer)... hey 50cm copper is fscking long, whats with 5cm on the PCB of the system ... hey ... just 0.1 nanoseconds latency ... ). No,"Near the CPU" means: Try to get as most latency introducing equipment as possible out of the way of the SSD. To get most out of your expensive SSD or expensive storage array with large caches you should connect them directly to your system.
Just to get an impression: Let's assume, your preferred supermarket is 10 km away. Your fridge is full. There is a big dinner at your house this evening. Would you store the ingredients, that didn't fit in the fridge near the supermarket or in you cellar? Similar considerations lead to the development, that lead the development, that the CPU cache was created as discrete chip on the mainboard at first, then they moved on a CPU board (like in the UltraSPARC II age or the Slot A/Slot 1 in the x86 sphere) and now it's on die.
Okay, i had a discussion with a reader about this. He was correct with his objection, that this would hurt you on another side: Centralized storage can help you to do centralized backups, to employ disaster recovery procedures and helps you with a centralized management and obviously he is somewhat correct about this, when your application isn't capable to do its storage management on its own. But this is a different discussion.
And that is the point where all the approaches to use SSD just like a faster disk fall short, especially when you hide this fast storage in a large box. This is exactly the reason, why the hybrid storage pool of ZFS is a good idea. You can have both: Centralized storage and the SSD near your server. You put the pool on your centralized storage and put the L2ARC on the local SSD. The L2ARC is unimportant for a disaster recovery scenario. You have only to warm the cache at the other side.
The location of the separated ZIL needs more thoughtful consideration. For a cluster failover it has to be available for all nodes of the cluster, so you integrate the outstanding changes written in the ZIL to the pool, but this depends on your application and your replication strategy. When you just do asynchronous replication between your sites, it can make sense to keep the SSD outside the general SAN and use a directly connected SAS-JBOD with some SSDs to share them between the cluster nodes.
By using such a mechanism we can still use the centrally managed rotating rust for our data, but a locally provided SSD for speeding up things and reducing the load on our central infrastructure.
ConclusionI've now written a long article about a rather small thing ... latencies. But i hope i've shed some light to the challenges of a SAN between your server and your storage especially in conjunction with ultra fast storage systems. We need intelligent solutions to overcome those challenges. The hybrid storage pool is one of them ... I'm sure the industry will show us other innovative solutions in the future.
Posted by Joerg Moellenkamp in English, Oracle, Solaris, Technology, The IT Business at 21:26 | Comments (13)
Defined tags for this entry: storage
Related entries by tags:
Display comments as (Linear | Threaded)
Introducing the FAWN
A Fast Array of Wimpy Nodes
My point being that FAWN uses Flash-memory locally :
"FAWN is a fast, scalable, and power-efficient cluster architecture for data-intensive computing. Our prototype FAWN cluster links together a large number of tiny nodes built using embedded processors and small amounts (2--16GB) of flash memory into an ensemble capable of handling 1300 queries per second per node, while consuming fewer than 4 watts of power per node."
I really love to see huge SSD-Disks with tons of cache and fast interfaces, but lets think about this for some seconds:
SAS is specified for operation via electrical cables with a probable maximum distance of about 10 meters at 3 Gbps or 5 meters at 6 Gbps.
SATA is only 1 meter
IB would really love to see Disks with IB-Connectors, too. But 12-15W for single port QDR is far too much for a single disk. Perhaps a complete system with tons of SSD in 1U or 2U.
And it is still interesting if FCoIB would be easier to implement than FCoE. Probably both of them will never make it.
Conclusion: if it's not SAS winning the race, it will be something completely differnt.
A 32G Fuision I/O with a fiber port, a supercap, and the logic to implement guaranteed atomic writes to a connected "buddy" card. maybe dual ports to do three cards ( 1 hop to each ).
And there you have the latency again: because the write has to transmitted to the other cards, written to the flash, and the successful write has to be confirmed to the other card ... it's easier to use this approach with a shared szil disk. Sun does this in our S7000 line ...
So in a situation where you have a zpool on SAN and your L2ARC on local SSD, is there a way to fail that zpool over to another host?
I don't think there is at the moment, since the L2ARC (or 'cache' in "zpool add" terminology) is still part of the pool. Does the HAStoragePlus agent in Sun Cluster handle this? Would be fantastic if it did, but I believe the answer is "no" at present. Fancy filing an RFE for me?
Putting the storage as close to the CPU as possible surely is a great concept.. but it gets increasingly difficult with rising space demands. If you're only doing file system math in the terabyte range, there's no way around the big boxes with tons of disks inside.
Figuring out the speed gain by using SSDs in these boxes instead of rotating rust is a wholly different equation, as now the transfer speed of the storage medium approaches the transfer speed of the cache modules.
Your whole math also (deliberately?) excludes several other factos that reduce the total available IOPS, like the latency in the HBA, processing overhead in some file system driver, overhead in the multipath manager, cpu bus wait times etc.
I'm currently deep into research to optimize how many of the IOPS that the storage box can delivery will really end up at the application. This is especially a problem with the current many-but-weak-cores architectures like the Niagara.. on an old V240 I have no problem saturating two 4G fibre links, on a T5220 I don't have the slightest chance to achieve the same, unless I spread my tests over two or three dozen threads.
1. No i didn't forget it. I've assumed that all those stuff is included in the 6.67 microseconds. My assumption was that the complete combination of Filesystem, Server, Storage, Cable and so on delivers 150.000 IOPS, as the article just looked into the additional latency of the switches
2. With the multi-petabyte you are right, but thats the idea behind L2ARC and sZIL. Having the accelerating stuff in/or near the server, and the data in the big box ...
3. It's obvious that you need more threads to saturate the lines. Thats the point of CMT. The single thread is slower, but you have many of them ...
4. The M-Class has no problems at staturating your lines
My position is that I think SAN solves most storage needs. It is not a religion. I just ask the question. "Can this server function in a SAN?" If yes, that is good since then I don't have to make a specialized setup for that server. If not, so be it and let that server be an exception. A rule has exceptions, no problem. Same thing with virtualization where some servers should not be virtualized. Same with DR where there may be a discussion of whether this one application should take care of its own DR instead of letting the disk system do it, maybe DataGuard is the way to go for an Oracle DB.
I guess there are servers which needs millions upon millions of IOPS. I have been to a few data centres and haven't seen one yet but I reckon they exist somewhere out there. My estimate is that out of every 1000 servers, less than one would require some special solution that you describe. From my perspective a SAN that is beneficial for 99,9% of your server park is a huge success and it makes SAN the rule. But you are turning this upside down. You make a rule out of a few rare exceptions. From a series of blog posts you take a borderline case(imaginary or not) and generalize. This special server(out of a thousand) is so IO hungry that it needs a large number of SSDs a few centimetres from the CPU. By your logic this expection will make SANs disappear in a few years. That is why I think your reasoning is agenda-driven. That is also probably why I haven't seen you take the same stance when it comes to other technologies? Like LAN for example. A core-edge LAN design will introduce even larger latencies between servers. Why don't you argue that all servers that communicate with each other should reside in the same box? And argue that data centre LANs will disappear. Why are you not a big fan of the mainframe?
It can not say for certain if your calculations are correct since I am not an expert on the lower levels of FC. If you have for example 150 kIOPS on each three components(HBA, SAN and disksystem) you add the latencies of each one and get an overall throughput of 50kIOPS. I suspect that simultaneous requests from a highly threaded application with buffer credits and queue depth will make the real throughput somewhere in between 50k and 150k IOPS. If you go to the movies and the ticket counter can process 10 ppl/minute, the pop corn kiosk 15 ppl/minute and the entrance to the theatre 20 ppl/minute, the overall throughput will be dictated by the slowest link in the chain, that is 10 ppl/minute. Not 4,6 ppl/minute if you add all the latencies. And remember that with a SAN you can have parallell queues by bundling HBAs(up to 32 I think for PowerPath) and just multiply SAN performance. But I will try to find out what is correct for FC latencies or maybe an FC expert can advice us in this discussion. I'll tip a few people.
OK. So SAN has a limit today and that limit is pushed further and further as technology evolves. With SSDs we will see increasing performance on HBAs, switches and disk systems. SSDs are certainly a wake up call for the vendors but they will surely adapt. So if I have a special server in my data centre server that needs an abnormal amount of IOPS and I would like to provide that within the confines of SAN and centralized storage and with todays technology. Then I wouldn't put it on the outside of a core-edge design with switch and director.Most of the servers are fine there but for my big UNIX box I want more. I could plug it directly into the core director. Or I could dedicate a few ports on the USP-V(out of a maximum of 224) and skip the switches entirely. The SAN can now produce the maximum IOPS of the HBA and you can have as many pipes as you want. My question for you: What percentage of servers do you think will fail to have their IOPS requriements met by this solution? It would be interesting if you came up with an estimate here because then it would be a better discussion. If you agree with me that we are talking ball park 0.1% of the servers, we could skip the debate on whether SAN is on the rise or withering away.
I'm in a training at the moment ... so just a short comment. Of course your componets have their IOPS count ... but the time inbetween the components is longer.
Or with the cinema example: Let's assume the tickets have 10 ppl/minute, the restrooms have 10 ppl/minute capacity, the popcorn counter has 10ppl/minute and the entrance to the cinema has 10ppl/minute. Of course all the stations have their capacity. Now think about the time, you need from your car to the chair in the cinema, when the restrooms aren't in the same floor and you have to go upstairs three floors, and go back downstairs again. Now reconsider the time from car to chair, if the restroom is directly left of the ticket counter.
The next part doesn't match into the readl world: The IOPS rate available to the application "fill the cinema", when the next guest can only buy a ticket, when the last guest sits in the chair. (perhaps the ticket sales is done be the same person, than the popcorn sales, the cleaning of the restroom after each person and the person showing your seat )
Each point is stil capable to process 10 ppl but you can't fill your cinema with this speed. Of course it's different when you have more employees
I don't think that SAN will be obsolete, i just think it will be augmented by intelligent concepts to get around the inherent challenges of a network between the components.
How much latency does an LSI SAS Expander chip (are there any other SAS multiplexers out there?) add?
1. For my calculcations this is irrelevant, as i assumed, that sas expanders in the storage box for example are included in the 6.67 microseconds.
2. There are other vendors for SAS expanders like PMC Sierra.
I understand you bundled in the SAS expander calc in with the 6.67 ms, but..say you have an SSD with .18 ms expected, and the expander has a .18ms latency. Then the expander latency halves the expected IOPS, so its really important.
I don't know what the latency is...I was googling around and found this page, probably because of Olli's comment.... oh well, I'll keep searching.
I can tell you this, in my test setup, the expander is halving the expected IOPS...but I don't know for sure if this is expected for this equipment, or not.
I assumed that SAS expander are as well in the direct attached and in the SAN attached version, so they would be hitted equally by the SAS-expanders. The calculation was set up to show the impact of the DAS compared to SAN.
The author does not allow comments to this entry
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Martin about End of c0t0d0s0.org
Mon, 01.05.2017 11:21
Thank you for many interesting blog posts. Good luck with al l new endeavours!
Hosam about End of c0t0d0s0.org
Mon, 01.05.2017 08:58
Joerg Moellenkamp about tar -x and NFS - or: The devil in the details
Fri, 28.04.2017 13:47
At least with ZFS this isn't c orrect. A rmdir for example do esn't trigger a zil_commit, as long as you don't speci [...]
Thu, 27.04.2017 22:31
You say: "The following dat a modifying procedures are syn chronous: WRITE (with stable f lag set to FILE_SYNC), C [...]