The individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
Tuesday, July 8. 2008
Every once in a while a blogger, sales rep of a competitor or a misinformed journalist write "The UltraSPARC T1/T2 cores are just UltraSPARC II cores running at 300 MHz". Well, this is a rumour consisting out of two misunderstood points of the architecture. Well informed people already know the stuff in this article but this time your are not the target group of readers. But i think itīs time to write a little bit about all this rumours as i write or tell this answer again and again and again.
Okay ... the UltraSPARC T1/T2 cores are just UltraSPARC II cores ...": This part of the rumour is as old as the UltraSPARC T1 series. But itīs really simple to see, that it isnīt true. Okay, a SPARC CPU is a SPARC CPU is a SPARC CPU ... you will find similar functional units on all SPARC CPUs.
Well, to be honest ... there are some features of the UltraSPARC II that arenīt available in in the UltraSPARC T1 processor: For example, the UltraSPARC II an up are superscalar designs, where as UltraSPARC T1/T2 are scalar designs. The UltraSPARC T1 has short 6 stage pipeline, the UltraSPARC T2 is a little bit longer with 8 stages , UltraSPARC II has 9 stages and the UltraSPARC III has a 14 stages deep pipline. The T1 is a single-issue CPU (it issues one command at a time at pipeline), where as the II/III are multiple-issues CPUs. The cores are pretty much different. So wherever you read this "The N1 core is just an UltraSPARC 2 core", this is simply incorrect.
Okay, you might ask, why didnīt we simply used the UltraSPARC III cores and glued them together: UltraSPARC T1 and II/III/IV/IV+ were designed with different mindsets leading to different designs. The core in the UltraSPARC T1 CPU was developed from the ground up to reach two targets: To deliver a SPARCv9 compatible core in the least possible amount of space with the least possible power consumption (okay, in a given timeframe and budget). Processor design isnīt a single-direction road ... itīs more like a tradeoff game.
The first installment of the concept drove this concept even that far, that they outsourced the FPU commands to a single FPU. FPUs arenīt simple, they take a lot of space and they use a lot of power. We want a fscking fast database and web server. We donīt need fscking FPU commands. Tradeoff game starts. FPU ... there is the door ... please close it from the other side. Weīve learned that some application used FPU where nobody really expected it and now N2 has eight of them.
You want to reach a certain target, look at your given budgets (transistor budget, die size budget, time budget, people budget, money budget) and search for solution fitting in your requirements without loosing to much at another front. IBM wanted the 4,7 GHz CPU and sacrified the Out-Of-Order-Execution on that way, Intel wanted the GHz crown as well and designed the ultra-long pipeline Pentium (anybody remembering Prescott?). VIA wanted a lowest-power x86 CPU and sacrified performance for a low power consumption. Sun wanted a many core CPU and sacrified the core complexity in the first steps. You canīt have it all, escpecially with all this walls around us. The thermal wall, the GHz wall, the budget walls, the structure size wall.
BTW: At the the end the laws of nature will stop us all and i think we will see a development like with the race to reach the temperature of zero Kelvin. Every step to half the temperature needs the same energy (i hope the recollection of my school physics didnīt left me there).Every step to get nearer to the walls imposed by the laws of nature for chip manufacturing will take the same amount of money. I think itīs a save bet, that AMD or Intel wonīt build a electron accelerator in the size of the Large Hadron Collider just do produce a particle stream for the Higgs-Boson lithography to build Core10Quad. I think itīs more probable that we will see a Core4Hekaton based on a stacked structure of multiple dies but with a structure size in the range of reasonable economic investments.
We getting to the point where nobody can afford to build a fab for smaller structure sizes only for one or two generations of processors. The timeline of Intel procs is really interesting at this part. Each process technology was in use for two generations. 90nm, 65nm and 45nm (at least when you look at the official roadmap). This is a ruinous game at the long run. And this is the reason why the entire industry put itīs research dollars, euros or yen into manycores.
I strongly believe that all proc vendors will opt for manycore designs. The question is only how to handle the legacy single-thread code. You may end up with a few (4 or 8 ) heavyweight high-speed cores like Intel or AMD, 16 heavyweight high speed cores with scout cores like Suns Rock. So everybody has the same problem. Iīm aware of the developments at Sun to solve this problem and they look really promising, iīm aware of other initatives at other vendors to solve the same issue for their respective technlogies.
Okay, iīve lost the topic ... back to the deliberate tradeoffs:
For example: The UltraSPARC T1 has a 6 stage pipline. For modern procs this is really small pipeline. But it was a deliberate design decision. We could easily use a longer pipline to reach higher clocking, but here we get to the trade of game again: Yo may reach higher clock frequencies, but the longer the pipline is, the higher the penality for a thread switch gets. You have to empty the pipeline and reload the pipeline and very step in the pipeline takes a cycle.
Letīs assume that you have a 15 stage pipeline. Letīs assume that the current thread stalls in stage 10 as it waits for data in the memory. You can react on this problem in two ways: Wait for the thread to continue or switch to another thread waiting for execution. When you switch to another thread all commands in your pipeline are not longer valid. You have to unload it and start work through the the pipeline again. The penalty for the thread switch is 9 cycles, as you need nine cycles to bring the first command of the new thread to the pipeline stage before the stage, that had stalled before. Okay, and here begins the gamble: Do you expect that the thread will unstall within 9 cycles or do you think it will take longer. 9 cycles are in the pot. Hmmm, hard decision.
Now letīs calculate this with a 6 stage pipeline. Letīs assume, the thread stalls in stage 3. Your switch penalty is just 2 cycles. Itīs a safe bet, that you wonīt get your data within 2 cycles thus.
But there is a further difference between normal procs and the Niagara. It has an own register set for every thread. This has important reason two. When you want to switch from one thread to another thread, you have to save the register sets, load a new one and when you switch back you have reload the old one. This isnīt efficient. Such a task can take easily several hundred clock cycles.
Given the fact that Niagara has four indendent register windows to switch between threads at no costs, you can use a another trick to fill that bubble in. Try to issue command from every core in a round robin fashion. Even when the pipeline stage stalls, the next command in the thread is perfectly excutable, as itīs from another nonstalled thread. As long the thread is stalled no new commands will be issued to the decoder stage. This is meant by thread switching without any costs.
This design choice have an different, but really interesting implication as well. Letīs assume we have a four core four proc x86 system and a 64 thread Niagara II system. Letīs further assume you run an application with 64 threads. This leads to an interesting effect: Context switches are incredible expensive operations. A context switch takes place when you want to run a different process on your CPU. You remember the the cycle of storing thread A registers, loading thread B register sets, storing thread B registers, loading A register set. When you have 64 processes but just 16 cores and thus 16 register sets, you have to go through the cycle four times to execute all processes. When you have 64 processes and 64 threads and thus 64 register sets ... how many context switches do you have to do? Correct ... none. There are several benchmarks that suggests that a UltraSPARC T2 will keep the same performance over a larger thread count, whereas other architecture may start faster, but will break-in after the thread count is larger than the core count.
And this led to other design decision: Do you really need Out-of-order execution logic (complex, thus power hungry), when you can simply switch to another thread in case of a thread stall? The same for branch predictors. Is it nescessary to have big caches, when you augument the thread switching with four memory controllers and a fast crossbar? You canīt use your x86 knowledge and use it for SPARC. You canīt use your SPARC knowledge for x86?
And there we get to the second part of the rumour. Itīs "... running at 300 MHz": I know several presentations of our beloved competitors stating that each hardware thread has only 300 MHz worth of computing power. I knew that this bullet would be fired by our competitors at the moment i saw the first presentation about this chip. They used the following logic: The proc (one version of it) runs at 1.2 GHz A core runs 4 threads in parallel. 1200 MHz divided by four is 300 MHz. I assume that is the source of the UltraSPARC II rumour. The USII was clocked in that range. Well ... this logic is correct ... to a very limited point. When all cores work at full speed and all data fits into the L1 caches this is correct. For example executing the incrementation of a register ad infinitum. The problem: This isnīt a real workload. It isnīt even near of a real work load.
Real workloads are: Loading data from RAM or disk, work with it, save it to RAM or disk. And now a normal CPU does something quite usual for modern CPU ... it waits. Memory is slower than the CPU. Period. And when you clock at 3 Ghz even the first level cache miss and second level cache hit is a high latency event. This is even the case at some extend for clocks at 1 GHz.
As i told you before the Tx series has some logic to skip stalled threads and it issues commands to the decoder round robin. The next cycle is one with effective workand not spend with doing nothing. Itīs all about handling the one truth: You have to wait for memory.
So: The calculation clock frequency divided by four is only theoretical because the real world teaches us that cores doing real work will wait most of their time. You can measure almost every application and will find large amounts of cache misses. Did i wrote "You have to wait for memory" before? Well ... nevertheless i write it again: You have to wait for memory most of the time. And itīs an absolute amount of time you have to wait. Given the same kind of memory, itīs not relevant if you core clocks with 1,4 GHz or 3,6 Ghz. The time until the data from reaches you, will stay the same.
You can handle this problem on multiple ways: You can do thread multiplexing like Sun did it. You can but giant caches on your die like Intel or IBM (An Power6 CPU has more cache on die than my first 486 PC had as main memory) You can do out-of-order execution, branch prediction, clever preloading of the caches, scouting. At the end you do all this to reduce the latencies introduced by the speed differences of memory and CPU. And this latencies costs you the most of your performance.
My point is: When you divide the clock frequency of the Niagara by four because of the thread multiplexing, you should reduce the clock frequency of x86 or other high clocked processors by all the ticks wasted to latency. And how do you factor in the cryptographic units of the T1/T2? They work parallel to the cores. The N2 cryptographic units writes back itīs result directly to the cache and donīt use the pipeline ... interesting question and many of the "300 Mhz" people donīt even think about it.
I have to admit: Niagara isnīt a system you canīt use without thinking. At the end: Itīs a 1.4 GHz proc with a simplified SPARC core and the biggest advantage of the processor is the achilles heel of the system as well. When you measure single thread performance you certainly measure the wrong thing. When you only load it with a single thread, itīs the wrong processor. When the execution time of a single thread on a system loaded with a single thread is important to you itīs the wrong processor. I have a simple rule, when i support a sales rep: When i donīt know the application, i do not recommend the UltraSPARC Tx series. Period. The problem: The most admins donīt really now their application that good ... but the Coolthread Selection Toolkit solved this problem really nicely. Big kudos to the development team of this tool.
I assume that the people with a negative position to the UltraSPARC T1/T2 chips are exactly the people who tried to use an application on it with the wrong characteristics and were disappointed.
But: When you have several threads in parallel, when itīs important for you that the performance stays at the same level even when 256 threads of Apache do heavylifting data, then UltraSPARC Tīs are the correct choice for you. When you want good performance at a high thread count, itīs a good choice. No, itīs the best choice.
You can compare it with a E10k. A UltraSPARC T2 is a E10k on a chip. I wrote in a wiki article some years ago: The T-class systems are huge SMP/near-SMP designs, and they want to be used as such. Donīt let them confuse you by their size. Batoka will give you 256 threads on four rack units. Not long ago you could easily fill a mid sized datacenter with the machines needed for the same amount of cores.
There we get to another problem of modern IT: The availability of notebooks with processors with multitudes of GHz on the developers desktop led to a large heap of bad software. Itīs like with the F-4 Phantom. The F-4 phantom was the proof for the concept, that you can fly with a brick, when you put enough thrust at it. And the General Electric J79 for the modern software developer is the Core2Duo at excess of 3 Ghz. The software developer doesnīt get punished for making his/her life easier by using the design pattern of the singleton instead of thinking about proper code scaling on multiple processors. On their notebook with a test load this brick will fly. I spoke to admins full of bitter words about their colleages after they installed an app server per core to get some performance out of their machines. The "good" thing ... all vendors go the way of many cores and the need for good and scaling code isnīt a problem of Sun ... itīs an industry wide problem .. universal punishment for bad code is near. But: In my opinion the problem of bad code will haunt the industry for quite a while. All of us. It will cost all of many headaches and many R&D dollars to find workarounds for this problem.
Okay, the workload for the T class servers are specific ones: High thread count and many small requests but with large datasets, and the T class will run as hell ... This is the reason why we see such good results at customers for OLTP, for webserving, for mail, for enterprise backup any many other loads. This isnīt a niche and itīs the reason why we sell that many of the boxes. Sometimes this poses even a problem for us. Where we selled a big box in the past, we have sell just an T2 system. But better steal your own lunch than giving it undeliberately to somone else ...
I hope i gave you some insight into the differences of USII and UST1/T2 and what led to this design decision. As iīm not a chip guy, i hope iīve got everything right but please correct me, if iīm wrong. I hope that you got at least an understanding, that the world in chip design isnīt that easy to make simple and thus false comparisions.
Posted by Joerg Moellenkamp in English, Oracle at 21:02
Related entries by tags:
Display comments as (Linear | Threaded)
The author does not allow comments to this entry
The LKSF book
The book with the consolidated Less known Solaris Tutorials is available for download here
Martin about End of c0t0d0s0.org
Mon, 01.05.2017 11:21
Thank you for many interesting blog posts. Good luck with al l new endeavours!
Hosam about End of c0t0d0s0.org
Mon, 01.05.2017 08:58
Joerg Moellenkamp about tar -x and NFS - or: The devil in the details
Fri, 28.04.2017 13:47
At least with ZFS this isn't c orrect. A rmdir for example do esn't trigger a zil_commit, as long as you don't speci [...]
Thu, 27.04.2017 22:31
You say: "The following dat a modifying procedures are syn chronous: WRITE (with stable f lag set to FILE_SYNC), C [...]