BlueToTheBone, Sun, SGI and the SPECjbb2005 benchmark

There is a nice example for my point, that BlueToTheBone tries to downplay everything coming Sun, even when doing so makes a complete fool out of him.

Blinded by huge numbers

This time he used a newly published benchmark from SGI in “SGI’s Itanium super smokes Java test”. They used SPECjbb2005 to show the capabilities of their Altix 4700. 9.6 Million BOP/s</a> is really an impressive number. BlueToTheBone tries to use this benchmark to tell you, that this system smoked the benchmark and thus try to downplay the recently announced SPECjbb result of the 1.6 GHz T2+ T5440. But did the SGI system really smoke this benchmark or was BlueToTheBone just blinded by the huge numbers?

Some basic math

What he forgets to correlate is the numbers of cores with the performance. Let’s dig into the numbers. The Altix 4700 yielded 9611262 BOP/s, the T5440 was able to yield 841380 BOP/s. Okay, the Altix 4700 configuration was a 128 blade configuration, with 256 processors with 512 cores. 9611262/512 is 18771 BOPS per core. The T5440 has 4 procs with 8 cores each, thus 32 cores. So this result in a performance of 26293 per core.

Embarrassingly parallel

I’ve read in the comments section: “But it’s a nice example that the SGI System can scale to such large configurations”. At first: Having a large number processors within a single OS image and scaling over that number of processors are two seperate questions. SPECjbb2005 isn’t really a good test to show the scalability of large shared memory systems. You have to take into consideration, that SPECjbb2005 (configured with multiple JVMs) is one of the problems of the “embarrassingly parallel” class. You can scale it by running multiple copies. At the end they configured 128 JVMs. That’s strange, exactly the number of blades. Thus given a resonable well working memory affinity logic (the reason why you need the ProPack in addition to the Linux on the system) the application didn’t have to use the NUMAlink between the blades. The JVMs are working independent from each other, so no communication between the the nodes, too. So the SPECjbb2005 benchmark on an Altix 4700 is not much more than throwing a cluster at a problem it’s best at. Okay, you have one advantage … only one OS image. But don’t look at the price tag for this advantage ;)

Your application needs a cluster? Use a cluster!

So: You could do pretty much the same by using a cluster of systems: To yield 9.6 Million BOP/s you would need 12 T5440. 44 RU, a little bit more than a single rack. On the other side: You can plug 40 blades into a Altix 4700 rack, thus such a system would take over 3 RACKS, over 120 Rack units. Or let’s calculate this with the T6320 with the 1.6 GHz T2 which yields 229576 BOP/s. You would need 42 of this blades. That’s a single 6048 RACK. So the really interesting number would be: How many BOP/s would an Altix 4700 yield when they use a single JVM instance. The number for the 1.6 GHz UltraSPARC T2+ T5440 is 688692 BOP/s. Now it’s your turn, SGI.

That other system

By the way: That’s the same reason why i wasn’t really impressed by the 3Leaf Systems result of 5.6 Million BOP/s on 128 cores on 32 Chip. This 3Leaf Systems product is a combination of a hypervisor and an asic to connect several smaller nodes to a bigger NUMA system. BlueToTheBone used it a few days ago to tell everybody that this Sun result isn’t that good at all.5.6 Million BOP/s are really nice , but they used 32 JVM. It’s just a rough guess, but that benchmark didn’t use this interconnect or at least not that often to be a problem. So the benchmark looks nice, but it really didn’t test the point behind this system: To connect a number of small nodes via Infiniband to create a larger system out of it. At the end the interconnect is just DDR Infiniband. 20 GBit/s with latencies in the range of a microseconds.The Opteron memory controller can suck 20 Gbyte/s from the memory with a latency in the low ns range. This baby screams at an embarrassingly parallel problem of course. But i would like to see real world applications on it. Or at least a benchmark that demonstrate scalability. Don’t get me wrong, i think it’s a nice concept. It should deliver good results for HPC tasks with the need for a large shared memory. But please show me a use case that really uses this technology. An application that has to fight with the impact of using an interconnect that has 10 times the latency and roughly a tenth of the bandwidth of the native memory controller. That’s a classic example where Amdahl’s law can hit you right in your face. To repeat my question : Show me a single-JVM result. I want to see the impact of Amdahl’s law.

Conclusion

At the moment i have my problems to see why both benchmarks highlight the advantages that make them a better choice for Java. Perhaps BlueToTheBone should stop to use those numbers as examples to convince the world of this opinion in regard of Sun.