On Benchmarks

Selling server systems is sometimes really hard. This has two reasons: Wrong benchmarks on the customer site or misunderstood architecture. You sit in the meeting room and the customers says “You told me that your system is fscking fast for webserving, but i made my benchmark and it was dog slow”. You thing about it, ask about the benchmark and a few seconds later you are in a discussion about the problem: wrong benchmark. There are two major classes of wrong benchmarks: The ones, where the benchmark doesn´t measure what the customer needs and the ones, where the benchmark doesn´t measure what the customer thinks. And i (as every sales engineer) was confronted with both when talking about servers with customers on numerous occasions One recent example for measuring the wrong thing: A customer wants to test the I/O capabilities of an UltraSPARC T2 system. He makes thousands of calls of a programm to concatenate two short files into a third one. The files were only a few bytes long. What´s wrong with the benchmark: At first, the commands were executed sequentially. Feeeeep … that´s an error on the T2, the I/O system on the T2 is designed for throughput matching to an thread rich environment, you can´t load test it with a single thread. After all with this small data all that what this benchmark measured was the CPU speed. This customer tested how fast he can start a few thousands instances of cat, he tested the caches and perhaps the memory subsystem, but not the I/O subsystem of the machine. I can´t remember a customer who does 10 byte file concatenation for a living ;) Another example is older: A customer was really fond of the T1. It was long before we introduced the UltraSPARC T2. He made some benchmarks with great results and wanted to buy some systems. The problem: The venerable benchmark didn´t used floating point. But a third-party library did. This made no difference on other systems, as the benchmark dated back to a time when FPU were standard on x servers. It wasn´t really much floating point, but enough to render the benchmark results missguiding. Luckly the customer called us before buying the systems and the Coolthreads Selection Toolkit disovered the FP issue. 2 T2000 less, but more important: The customer didn´t got disatisfied after spending money. What´s wrong with this benchmark: The benchmarking application didn´t represent the real workload in a really important point. While giving you a good perspective on the performance of the system, the results were meaningless for the application in question. You see: It´s really important at benchmarking, to know some things: What do i need, and how do i measure it? And whenever you use a synthetic benchmark to assess the performance of new system you have to check on a regular schedule if the synthetic benchmark still matches your computing needs . The other point about about incorrect benchmarking is the used dataset. Many people use an reduced set of data for their benchmarks. The reasoning is simple: You have to put much more effort and hardware into a benchmark with real data sizes. But working with something else than the real data can lead to misguidung results. There is an rule of thumb that a single thread of an UltraSPARC T1 or T2 is slower than a single thread of an UltraSPARC III or SPARC 64. Often this is correct. 1.4 GHz vs. 2.4 GHz speak for themself. Or that an UltaSPARC T2 thread is slower than a Power6 4,7 Ghz. In many cases this is correct. But not always. With certain workloads you get the strange result that a 4.6 Ghz thread is slower than one at 1.4 GHz. What´s the reason for such counterintuitive results? The answer lies in the access to data. Computers were designed to process data, not only to run small benchmarks ;) Whenever the dataset gets bigger than the size of the cache, it´s dependend to the access pattern, how the system reacts. Whenever the data isn´t in the cache, the thread stalls. You can clock your processor as fast as you want: When it waits for data, it simply doesn´t matter. UltraSPARC Tx has the same problem, it has to wait for memory, but these procs know a special trick: Whenever a thread has stalled, switch to another thread, which doesn´t wait for memory. The N1 and N2 have as much register sets than cores, so you don´t have the thread switching penalty in this case. So, in the case of working with an active datasets that fits in the cache of a proc you get the expected behaviour: 4,7 GHz is faster than 1.4 Ghz. But now switch over to an active dataset as huge as 1 GB with a high randomized access pattern reading a large amount of data while processing. Now the story may look very different. The results get even more misguiding, when the active dataset fits in the cache of one architecture, but not in the cache of the other architecture. Processors behave differently. There are many variables. The core itself, the caches, the way caches, memory and cores are connected to each other. Pipeline depth. Amount of functional units in each pipline stage. Even when they have share the same ISA (x64 from AMD and Intel) the differences may lead to counterintuitive outcomes. An Core2Duo depends at a large part on it´s caches to get it´s performance. But what happens with a workload with a vast amount of cache misses. Well, the fact that the result of a synthetic benchmark and real-world application performance is at best loosely correlated leads to interesting outcomes of customer application benchmarking beauty contests and surprised customers. But that is stuff for hours of stories from the field with some beer;) Okay, long story, short summary: