App benchmarks, incorrect conclusions and the Sun Storage F5100

I’ve just read through several comments regarding the F5100. Is it just me or is common sense on vacation at the moment at some locations. I will go through some of this comments and dissect them here.

Some voices with criticsm

A commentator at StorageMojo calling himself “KD Mann”. This is my favorite one, because he tries to make a point based on a lot of benchmarks that it doesn’t look like the SSDs in the F5100 delivers the performance. When you look at it really cursorily, you may even say “He has a point”. But when you really look into the comments and then into the benchmarks it looks different. There are two major problems with this comment: He forgets about an important side. And he can’t cite as he swaped results. ;) I know, that his blog entry gives boths commentators more space than they deserve, but both are good examples how you can fall in a lot of pitfalls while reading benchmarks.

Application benchmarks

I should explain something before: Benchmarks are always a mix of task. There are benchmarks with 99% CPU and 1% benchmark related IO (SPECint) and then there are benchmarks with 99% IO and 1% CPU not related to I/O (HDbench for example). You could put dozens of racks of F5100 into a SPECint benchmark configuration without yielding more performance. A benchmark just consisting out of I/O will eat away all your I/O to deliver better results as long the CPUs are capable to move the data around. And then there are a lot of benchmarks in between. Let’s assume a benchmark consists out of 50% IO (reading and writing data) and 50% CPU (doing something with the data). Let’s now assume you have a hyper-super-dooper storage reducing the memory access to almost zero. What is the maximum speedup you can expect from using such a device? 50%. With a practical example: You have task that is cpu-bound for 1 minute and i/o-bound for a minute. You need 2 minutes to fulfill the task. Now reduce the I/O to 1 second. You have still a minute and 1 second to go. So what’s the maximum speedup? I assume you made the right conclusion. With multiple threads this the improvement may better than in the example above: As multiple processes contend on the same resource, reducing the average latency keep more processes from waiting on data and doing nothing, while your storage subsystem is searching data for another thread. More important: Application benchmarks doesn’t throw away the data they do computational tasks with it. With this knowledge you should view those benchmarks. The first benchmarks he cites are i/o intensive, but they are even more computational intensive. One of them is finite elements analysis, the other one is MCAE.

Abaqus MCAE

For example take a look at the comments of KD Mann at storagemojo:

Got it. An F5100 array of 20 SSDs can outperform an array of six HDDs by 5%.

But now let’s add some perspective to it:

The test cases for the ABAQUS standard module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files).

When just 15% to 25% of runtime are related to I/O i think it’s pretty nice to accelerate it by 5% the whole runtime due to the usage of SSD. You have to consider here, that those SSD are just capable to reduce the I/O part. You can’t do something about the CPU. Let’s assume 20% I/O part in average. So this benchmark was 412 seconds in I/O. So the usage of 20 SSD shaved of 92 seconds of this 412 seconds.

NASTRAN

Pretty much the same is valid for the NASTRAN benchmark. It was accelerated by the F5100 by factor 2.1 when compared with a 4 disk RAID 0 configuration. You have to keep in mind that this benchmark is an HPC benchmark, not an storage benchmark. While needing fast storage you have to keep in mind that most of the time this benchmark is CPU-bound and not to I/O. It’s mostly an computational benchmark. So i don’t understand why this person uses this benchmark to think that there is something wrong.

Peoplesoft

At last i want to comment the benchmark about Peoplesoft. At first this is a nice example why many benchmark configurations have a large amount of disks: When you look at the benchmark of the Itanium gear he cites, has 58 disks having approximately 8 TB. But they just used 512 GB. Wouldn’t wonder if they used a technique called short-stroking. This means you just use the outer sectors of the disks thus reducing the way a head has to move. You can get really nice IOPS values even out of rotating rust when you shorten down the way the head has to go. Would be nice to see the configuration of the HP EVA to confirm this point. To the other results: My colleague could have done an error indeed by comparing a “large” to an “extra large” model. But i have not enough data to confirm or to dismiss this. I’m will just waiting for this until the official documentation is available and come back to it. Nevertheless there are some interesting data points in this benchmark. The sun configuration was able to yield better performance with just 8 instead of 16 threads with 25% instead of 88% CPU-load than the HP solution. You could say, an M3000 would be sufficient to do the job ;)

Capacity is still a factor

The commentator in the BestPerf blog asked why they used 40 FMods while claiming that the F5100 wasn’t in sweat. That one is simple. A single FMod has 24 GB. Let’s just take the number from the HP disclosure: They stated, that they used 512 GB RAID1 storage: Leads to 1 TB raw capacity. You need 40 FMods for this numbers. We need a little bit less, as we put some stuff to rotating rust disks. It’s really that simple. You just have to read the available docs.

Rotating rust for redo logs

The second question of that reader was about the usage of magnetic rust for the redo logs. Well, that’s easy too. The F5100 is optimized for writing 4K blocks. The log writer doesn’t write in such blocks. And than there is another tooth i have to remove from this reader: Writing the redo log isn’t a tough task when it’s the only job you do on the disks. Many people forgets in this SSD hype, that rotating rust has nice sequential write/read performance, especially since the introduction of perpendicular recording as more and more bits are moving under the head in a given span of time. It’s just the random access that kills those devices. It’s similar to the relation hard disks/tape. When a tape drive starts to stream data, you might have problems to feed it with data from your hard disks. It’s the winding that kills the performance. What does the log writer do? Well, just writing log records in an arbitrary size. You never read them, okay … almost never: You read them when your database has gone down in flames. So it can be quite reasonable to use magnetic rust for a dedicated log writer filesystem and SSD for your datafiles, as the access patterns to that files are much more random and head movement and rotational latency become a larger factor. If you don’t believe me, just dtrace your log writer. BTW: There is a great articles of my colleague Volker Wetter in regard of I/O on Oracle in the Sun Wiki. You should look at “Getting insights with DTrace - Part 1”. Given the fact that SSD may be optimized for certain block sizes, it may be a reasonable choice to use rotating rust especially when the flash device is optimized for other tasks. In regard of the nature of performance impacts to LGWR speed you should read Kevin Clossons “Manly Men Only Use Solid State Disk For Redo Logging. LGWR I/O is Simple, But Not LGWR Processing”, which give some interesting insights into this topic. There are several good use cases for SSDs like the flash-extended SGAs in the newes versions of Oracle or the indices. But redo logs aren’t the best choice in 99%.

Conclusion

At the end even small gains of performance are the expression of large increases on one area. In any case you should really look deeply into a benchmark before making statements about something.