QuicksearchCodenews SearchDisclaimerThe individual owning this blog works for Oracle in Germany. The opinions expressed here are his own, are not necessarily reviewed in advance by anyone but the individual author, and neither Oracle nor any other party necessarily agrees with them.
NavigationCategories
|
![]() Some early comments about Power7Tuesday, February 9. 2010Trackbacks
Trackback specific URI for this entry
No Trackbacks
Comments
Display comments as
(Linear | Threaded)
Have you seen the die-shots of P7? Those CPU's are frigging huge! Respect if IBM gets decent yields out of new design+new process.
I had to think about 'RFC1925 - The Twelve Networking Truths' number (3) : "With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea. It is hard to be sure where they are going to land, and it could be dangerous sitting under them." Websphere Application Server having a bit of a reputation of being a pig, it sure looks like POWER7 gives enough thrust to make WAS fly
The p7 die is really fscking large: 567mm2 at 45nm. Just as a comparison: Power6: 341mm at 65nm, Intel Core i7 - 231mm2 at 45 nm.
I think they will do this with a good yield management: I assume the highest bin will be really expensive and it's possible that the lower-cored version are version with defect cores ... business as usual.
oh well, ibm at least delivered p7 as continuation to P6, sun with all that fuss "announced" rock and then... canceled.
IBM >had< to deliver : http://www.cringely.com/2010/01/ibm-2010-customers-in-revolt/
(Cringely was right about the Resource Actions...) Now they can show of how smart they are and therefore are entitled to tax-payers money to make the planet smarter (God help us)
I thought the use of eDRAM was very interesting in the Power7 - great way to save a bunch of space!!!
I am still very unsure how bad the penalties were in the design decision. I have read very little in response to the questions that I had asked concerning them. I am also looking forward to the next T processor release - need more SPARC throughput at a reasonable cost.
You have two possibilities to avoid drawbacks with SMT4:
1. switch off SMT completely (smtctl -m off) 2. move an LPAR to Power6+ compatibility mode. In this case you will have only 2 threads per core, but loose new feature of Power7 - Active Memory Expansion. And prices at least for the USA and Express Configurations are available on IBM's own site
1. As far as i understand IBMs implementation of SMT, it was introduced to improve the utilization of the execution units of a core. Thus you can't expect the same amount of cummulative performance with 1 thread without SMT than you have with 4 threads with SMT4.
2. Thanks for the hint of the pricing ... was somewhat focused to the 770 ...
Got pointed to this from theregister.co.uk, cause here was some serious critic about POWER7. And damn you are desperate :)=
Lets just pick one of your arguments. "SAPS is a quite cache sensitive benchmark. So the large 32 MByte cache helped for sure. I want to see some additional benchmarks with a working set size larger than the caches." Ehhh.. when SAP changed the SAP 2Tier standard benchmark to use Unicode +50% memory consumption and limited the response time to 1 Sek, what happened ? What happened to POWER: Before: power 570 8 Core's at 4.7GHz 4010 users (6.0 2005) After: power 550 8 Cores at 5.0GHz 3752 users (6.0 Unicode EHP4) Now that is 6% fewer users with an 6% increase in Clock speed. Lets compare to the SUHHOracle T5440. Before: T5440 32 Cores at 1.4GHz 7520 users (6.0 2005) After: T5440 32 Cores at 1.6GHz 4720 users (6.0 Unicode EHP4) Now that is 37% fewer users with an 14% increase in clock speed. Now which architecture is cache sensitive, ever heard the expression: You shouldn't throw with stones when you live in a glass house ? // Jesper
Well ... sorry, when i'm the first to say that to you: But you were foiled by a benchmarketing trick.
The p570 result was done on a power6, the p550 was done on a power6+ . What looks like a minor speed bump is indeed an improved core as well. I can just speculate that, I really thought, everyone knew about that already because this was a really old trick
Yes, that tin foil hat really suits you.
You do know the rperf value of a 8way 5.0GHz power 550, and a 8 way 4.7 GHz power 570 ? That is 78.60 versus 74.89 that is a 5% difference And the specint_rate2006 values ? 263 versus 240 that is a 1% difference So just to make that SAP2Tier benchmark better, something was mystically added to POWER6+ compared to POWER6. Nahh, POWER6+ compared to POWER6 was mostly more storage keys. Yes, watch out for the Silent Blue helicopters, that are pouring stuff in your water And btw you should really try studying up on Multithreading. You don't really seem to understand SMT. Coarse grained multithreading was used in the RS64 processor, back in 1997, that were used in the F/H and S series. // Jesper
Sorry, i'm just reading IBM documents for my information about SMT
In 2004 an IBM document explained, that SMT will give you a performance increase in the range of 40-50% vs. a non-SMT-configured CPU. In 2007 an IBM document about the Power6 microarchitecture talked about an 30% speedup improvement due to changes in the SMT. Power7 drives this concept further .... it uses SMT2 instead of SMT4. Given this informations provided by IBM , it's a save assumption that single-thread performance of a core is vastly lower as the cumulative performance of all 4 threads on a core. If you know better, feel free to explain where the error is ...
Joerg.
Fist. Multithreading comes in several different forms. Coarse-grained multithreading. Is where you swap in another thread, when the main thread gets stuck in f.eks. a L2 cache miss, then the second thread is started up and then runs. First time I saw that was in the RS64 IV back in 2000. It was implemented by you statically divide the CPU into two logical CPU's. Here you will have an increase in throughput but have a penalty on single threaded performance. Fine grained multithreading. Best implementation IMHO is Niagara uses a very simple and very effective, perfectly suited for what the box was designed for type of multithreading. It simply dispatches 2x4 threads in a round robin algorith on each core, 4 on each execution unit. This basically means that you have 8 weak threads on each core. This is great for throughput and highly efficient, but terrible on single threaded performance. In practice you won't only get 1/8th of a whole core in throughput if you only run one thread. But you will take a serious impact. But as Niagara was designed for running ALOT of light threads, typically copies of the same binaries. Like a webserver, then this multithreading method is just great for Niagara. SMT Simultaneous Multi Threading (on power) Means that you execute several threads at the same time. You are not only executing one at the time and then switching but them all at the same time. But you do have priorities, so the core will execute a high priority tread to the max. And the rest of the threads will have to get the leftovers. Furthermore the Hypervisor will fold together unused logical CPU's, so if you have a machine with 8 physical cores, on top of that 16 virtual cores and SMT4 turned on you would have 64 logical cores. Now if you are running 8 threads on top of that those 64 logical cores. Then the hypervisor will fold together the 56 others. And you won't be using administrative overhead on those. You can actually see it if you use a monitoring tool like nmon, that all the folded processors just go to 0% CPU load. Nifty. So basically a POWER processor can give you both MAX single threaded throughput if you only run one thread on it, or one high priority thread, or multi threaded throughput if you put enough threads on each core. Sure there is a single threaded performance hit if you put a lot of threads on one core. I don't really know how big it is on power6 and power7. We did do some tests in a project once on POWER5, and found that it wasn't worth bothering about. The extra throughput with SMT turned on was simply to great. But there is a penalty due to the fact that memory and cache resources have to be shared. // Jesper
I'm aware, that a single thread can use the complete chip, but you won't get the performance out of the chip as when you have it used in smt2/4 mode.
My point was a different one: As far as i understand public papers about IBM SMT, the IPC of a Power5 processor in single thread mode is significantly lower than the IPC of a Power5 processor running in SMT with two threads running at Priority 0 (Reference: IBMs presentation Hotchips 2003 about SMT). The concept of SMT is based on the point, that a single thread can't utilize the capacity of all executions units in a CPU, thus you feed multiple independent streams of instructions to the CPU. This leads to a 40% performance avantage of SMT compared to a Power5 in ST (numbers not mine, they are from documents of IBM you can easily find with Google). The SMT of Power5 was somehow limited as a lot of units were just available once in the CPU. Power6 alleviated this by introducing a dedicated completion table per thread, own decode pipes and some other tweaks (Reference: Power6 microarchitecture whitepaper of 2007). Blue2theBone (TPM) wrote in an article about Power6: "The SMT features have been improved significantly, and apparently IBM is seeing as much as 55 percent performance improvement on two virtualized threads compared to a single thread with SMT disabled." Now SMT4 puts 4 instead of 2 streams of instructions to the core distributing it to execution units. Thus is can load the execution units even better. I think it's safe to assume, that SMT4 will further increase the utilization by a significant margin. However, those advantages of higher utilization of the execution units are just available in SMT mode and not in ST mode. Even at SMT2 you would loose up those 55% you've got beforehand when switching to ST. I think you will loose this advantages too when you just load it with 1-thread independent if you use ST oder SMT, as you aren't exploit the advantages of SMT. And there we are at the point that a SMT4 core will perform much better when loaded like a 4 cpus than when loaded like a 1 cpu, as the optimization potential by loading several is simply not available as far as i understand IBMs description of the SMT implementation. I would like to see some numbers from IBM to assess the impact of SMT4 in conjunction with the additional execution units in the P7 core.. In my personal opinion Power6 cores aren't really single cores, they are two cores in a per-core-licensing friendly package, so the comparison Sparc/Itanium/x86 core vs. Power Core is severely skewed At the end: If your could get the same performance from ST than from SMT, it would be pretty useless to implement SMT, as i could leave everything to the OS, so the transistor budget for SMT could be invested better otherwise. And even IBM itself doesn't say that ST performance equals cumulated SMT performance.
"I'm aware, that a single thread can use the complete chip, but you won't get the performance out of the chip as when you have it used in smt2/4 mode."
Yes, I 100% agree and have never claimed that running 4 threads on the same core won't affect the single threaded throughput. Sure it will. BUT as I said in my last post, if you only run one thread you will get the full ressources of the core. And you have build in features in the hardware and the hypervisor that tries to optimize that, as I described and yourself also with priorities and CPU folding etc. "My point was a different one: As far as i understand public papers about IBM SMT, the IPC of a Power5 processor in single thread mode is significantly lower than the IPC of a Power5 processor running in SMT with two threads running at Priority 0 (Reference: IBMs presentation Hotchips 2003 about SMT)." Sure Yes, That is also why if you compare the SPECINT_2006 score of a POWER6 which is 21.7 and with a single chip 2 core SPECINT_rate2006 score with 4 copies gives 60.9. That doubling the number of cores and quadrupling the number of threads. And that gives you a 40%+ benefit from SMT. (note that this is witout the autopar cheat, so it is true single threaded throughput). http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071030-02416.html and http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070518-01098.html So it is not a secret, that the IPC of a core will go up when using SMT. And it also clearly shows you that a single core does NOT give you 60.9/2=30.5 in SPECINT_2006, but rather 21.7. BUT again 21.7 is still a pretty good non autopar result. And a single core POWER7 will beat that number, of that I am quite sure. Just hope that they don't join the autopar cheat club. Now can you understand why I don't think that your argument with taking the a benchmark result and divide it by number of cores times number of threads is a true measure for POWER's single threaded 'performance'. And to be honest I think you know better, but ok we sometimes get carried away. The way I normally view SMT, is that it enables the threads that normally would otherwise be stopped and waiting to execute, to do some work at perhaps 15-40% of the 'speed that they normally would'. And that is pretty useful, rather than a stop go stop go stop go situation. CUT "I would like to see some numbers from IBM to assess the impact of SMT4 in conjunction with the additional execution units in the P7 core.." Well we will see then when they get their fingers out of their behinds and do some SPECINT results. "In my personal opinion Power6 cores aren't really single cores, they are two cores in a per-core-licensing friendly package, so the comparison Sparc/Itanium/x86 core vs. Power Core is severely skewed Nahh.. you are just looking for excuses. SPARC64 and nehalem are also using SMT, so that argument does not hold up. The difference between those are that they (or intel at least used to, been some time since I digged into that) statically devide the CPU up. So basically you get two CPU's with half the compute power. You don't do that on POWER, so your argument actually works the other way around. You could say that you get a compromise, and it works. That is why our POWER servers where I work beat the living crap out of Xeon servers using SMT, on single threaded workloads. // Jesper PS your stuff about hot node replacement and repair is not true btw. Try to look in a newer manual. It's actually much easier than it is on an M Series box. Try to have a look at this: ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/pow03023usen/POW03023USEN.PDF
Regarding Hot Node add:
Sorry ... but that isn't correct ... your document states on 26: "Clients should therefore consider taking proactive measures to insure minimal impact to their operations. Some highly recommended precautions to consider: • Schedule concurrent upgrades or repairs during "non-peak" operational hours. • Move business-critical applications to another server using the Live Partition Mobility feature or quiesce them.• Back up critical application and system state information. • Checkpoint data bases" The 770/780 dating on 10. Feb 2009 states "To guard against any potential impact to system operation during Hot-node Add, memory upgrade, or node repair, customers must comply with the following protective measures: - For memory upgrade and node repair, ensure the system has sufficient inactive or spare processors and memory. Critical I/O resources must be configured with redundant paths. - Schedule upgrades or repairs during "non-peak" operational hours. - Move business applications to another server using the PowerVM Live Partition Mobility feature or quiesce them. This means that all critical applications must be halted or moved to another system before the operation begins. Noncritical applications can be left running. The partitions may be left running at the operating system command prompt. - Back up critical application and system state information. - Checkpoint databases." Note that the wording changed from "highly recommended" to "must comply" ... sound like some bad experiences. Well ... Sun is in the hot-repair/swap/add business pretty much since the last century and had to learn a thing or two on the way. p570 hot node add is a feature since last year when my memory serves me right ... sounds like the outcome of some teething pain. BTW: I knew the document you've liked in your postscriptum already ... my favorite part is "For Power 570, ensure that system cable with connections for additional node is added with system powered off during a scheduled upgrade window in advance, prior to concurrent node add.". Can you explain me whats the point of hot-node add when i have to add the the cable in a downtime? I could simply add the additional node in that downtime as well without having to go through all the "highly recommended" practices, that got "must comply" practices later ?
Ok, so you choose to ignore our whole talk about SMT and triumphiant jump to the one point where You think you have a point. You just dropped a notch in standing here.
Well. "Clients should therefore consider taking proactive measures", is lawyer talk. If you don't know that you haven't read enough manuals. We have hot-node repair's here. Not often.. cause the machines very very very very seldom have to be repaired. But sure there is logic in doing things in non peak hours, or use Live partition mobility to move load off the machine. Why ? Cause you are removing 25% of the CPU and Memory ressources of the machine. And if you cannot free that up easily, then the easiest is simply to move workload of the machine. And Live partition mobility lets you do that. You are 100% correct to note that it is stupid that you have to have the cables in place in a service window. It's idiotic, but again the power 570 is a midrange server... MIDRANGE, it's not like Oracle midrange servers have this feature. So you are comparing midrange POWER servers which have Oracle/Fujitsu Highend Server functionality and then think this is a bad thing that the POWER midrange system in one aspect isn't as fully evolved as the Highend Server. And yes the 8 socket POWER 770 is almost as fast as a fully populated 64 Socket M9000 with the newest SPARC's in some ways. But that's the M9000's problem not the POWER 770. So get real, and get serious. "Well ... Sun is in the hot-repair/swap/add business pretty much since the last century and had to learn a thing or two on the w ay. p570 hot node add is a feature since last year when my memory serves me right ... sounds like the outcome of some teething p ain." Yeah and mainframes have been doing it since.... So, we are back to the old board based machines discussion, versus other desings. You don't want to go there, but ok. On a M9000 you have a one to one mapping between physical hardware and the Cores you see in the OS. It's not like on a POWER sys tem where the hypervisor virtualizes that layer. On an power 570 you will normally have one VIO server per CEC, hence it owns all the IO in that CEC.(that is how we do it anyway) So you just shut down the VIO server in the CEC you want to remove. No disruption with that. So when you pull out a CEC you just have to make sure that you have 1/4 of the 'Desired capacity' free, and 1/4th of the memory. If you are using AMS then it might already be free. And sometimes that can very well be a Live partition migration of a virtual machine is the quickes and easiest way to get that free ressources. On an M9000 you have 8 CPU boards, each with 8 sockets adding up to potential 32 cores. And if you use the machine for consolidation with 24 Domains. Then if you have designed for redundancy then quite a few domains might actually be on one Board. Now shringing and moving domains around to free the board you want to change can be quite a bit of work, as there is a one to one mapping between physical hardware and what Solaris sees. It's not always trivial. And why some of the places where I have worked the standard procedure was to do a cluster failover and then shut the machine down. You fail to see that the fact that there is a abstraction layer between the physical cores and the virtual cores on POWER, means that you don't have to fiddle around with trying to free up a board, and that is a clear advantage compared to for example the MX00 seris boxes. "The 770/780 dating on 10. Feb 2009 states .... " You haven't been involved with POWER system much have you ? This is from a draft redbook. This is not IBM official documentation, this is what the writers of the book thinks is best. Personally I think it smells a bit of CMA (cover my a**). There are nothing in the official manuals that dictates what they are writing. So, now you learned that. Not that using live partition moblility to free up ressources so you can do a online repair isn't something that I'd do. It's a lot easier than starting to change the settings of a lot of virtual machines. // Jesper
Well ... i have a day job and a life
Thus it's possible that answers to comments take a while ... the thing with the hot-node repair was just a quick comment that needed just some cut and past
Well, I do infrastructure architecture for an outsourcing firm, so this is something I work with every day, designing solutions and making methods to avoid down time and doing TCO calculations on how to get the most bang for the buck, while still having a solid infrastructure.
But actually right now I am on maternity leave, while having to work full time anyway, in between changing diapers, and singing a lullaby So I know the feeling. But writing stuff like this is what I do to relax, and try to keep my skills sharp. You don't get more knowledgeable about an area by just chatting with people who agree with you. But have a good weekend // Jesper
Who said, i wouldn't answer ... but as this is my place here, i answer when i think i want to answer
Sorry, i don't have the privilege of having some sparetime between diapers, albeit i wouldn't change with you, as i don't like changing diapers
Well, regarding you statement at the end at first: I don't know if you know such documents are created. But before you get such an document out of a company with a logo, you have gone through many stages of lawyers, reviewers, managers. I don't think this will look any different at IBM, even when they integrated the real CMA statements that they haven't reviewed it. The legal implications of any publication with a logo on it are just to problematic to let anything out of a company without reviews. I know that there isn't a statement in the current 570 manual, but do you have a 770 manual to you disposal? Given that the wording is relatively similar to the 570 concurrent repair manual i assume at the moment, that the change of "has to comply to" was a change done willingly by the people writing the redpaper and i think, that this will reappear in the manual. But that's my personal opinion. Just to make a thing clear beforehand. I'm more fluent in the Sun speak, as i'm a Sun employee. But that doesn't necessarily means that my knowledge is reduced to this. And it doesn't justify the arrogance, you use from time to time. Obviously you are right, most often there is a difference between what runs and what works and whats suggested. We will have to wait what IBM writes in the final manual. But do you really thing any customer would opt for an "should work, will work" when the manual and offical IBM papers suggest otherwise. This document totally devalues this feature at the moment. I'm pretty aware of the fact, that the 570 is just a mid-range system. Could you please tell the IBM sales reps to stop to sell this gear when the RfP mandates M8000/M9000 class of RAS ? If an 750 is really in the same ballpark of the M9000 has to be proved in reality. I think it's the same like with the T5440 SAPS value, interesting ... but you have to take it with a large grain of salt. You should just think about the point that there is no significant storage at a saps benchmark and that the influence of the database is rather miniscule. So there is no really I/O heavy lifting for example. Of course i could divide the SAPS value by the number of cores to get the number of SAPS each core throws into the value and this value is significantly lower than the value of the power6 gear. After thinking a little bit without the sinusitis in my head i finally found the misunderstanding between you and me: You think of single-thread-performance as "how could i get the most performance out of a single core for a single thread". I think of single-thread-performance of "Let's load the system to the max. Pedal to the metal. And let's look now, how much performance i have expect from the single thread in the system." Of course i could measure the single thread performance of a UltraSPARC T2 cpu as well by just loading a CPU with one thread but from my view this isn't really relevant. At first because you don't use the proc in such a mode, at second as you give away performance and at third all the effects leveraging for example memory wait-times aren't used. And it's pretty much the same with the p7. Why do i think that Power6 was a licensing friendly 2-core proc hidden as a 1-core proc? For example: A Power6 core has two integer execution units. The SMT stuff is used to keep both execution busy. Thus you have essentially two cores working at a job. As some resources are shared between both execution units this hidden two-core has significantly less performance than a real two-core proc, but significantly more performance than a real one-core proc. It's okay that IBM did it this way, but everybody should be aware, that word "core" means something different in IBM world than the same word in "core" in Sun M-world. However, any licensing model solely based on the word "core" favored IBM as they actually hided 1,5 to 2 cores behind something that just looked as a single core. That's okay. Essentially Sun did something similar with T2, as it contains 2 integer pipelines per core. However the SMT in a SPARC64 core is somewhat similar to the one in T1 CMT. There is a single execution unit in the core and the threads are alternated clock by clock cycle. When you look at some facts about the Power5/6 even a cursory observer could get this impression: A budget of 790 million transistors despite having a smaller cache vs. 600 million transistors should easily explain that there are some additional functional units per core At the end i congratulate IBM for this neat chip, but well ... RF was on John Fowlers slider for this year on the Sun+Oracle webcast and the cards are newly mixed. And Nehalem-EX will be an interesting contender, too as long as you don't want to go in the RAS midrange/highend. It's the classic game of leap-frogging in creating CPU. However it would be interesting, where IBM sees the next step after Power7 ... i didn't heard any substantial about something like Power8 (think this system would be a hard sell at Airbus BTW: Did you made a pricing of the 750 ... just made a quick calculation and without this special offer that you get half of the activations for $0, i've got to a price of $400.000 .... quite a lot of money. And it sounds like a neat hack for benchmarketing with price. When you buy the procs in an initial packet, you get them for a low price. But you can't get these activations for free, when you buy addtional procs later, but that isn't relevant for benchmarks, but that is relevant for reality, given that a single p7 chip without memory costs you 1*17.700+8*9000=89700. I could decide to purchase complete 6 M3000 servers for the same price Or a T5440 you could buy a 1.4 GHz T5440. given that a T5440 1.6 has 25.000 SAPS , let's assume a 1.4 has 20.000 SAPS. A p7 has 85000 SAPS. Just for the price of the 4 procs in the 750 you could get 4 T5440 with 512 GB Memory getting similar performance. Don't misunderstand me, p7 is a cool proc and it's good that RF isn't far away ... but it isn't a cheap processor, it isn't a kill-all-other-procs processor.
"Sorry, i don't have the privilege of having some sparetime between diapers, albeit i wouldn't change with you, as i don't like changing diapers ;-)"
Well they always say that if it is your own kid then you can stand the smell, but I must say I'm getting enough of that sour milk smell diaper smell "...The legal implications of any publication with a logo on it are just to problematic to let anything out of a company without reviews." Well have written a redbook myself, many years ago in a former life. And it is a howto written by people from customers,business partners and people who work for IBM. It is not manuals, but the personal oppinion of the people who write the books. "I know that there isn't a statement in the current 570 manual, but do you have a 770 manual to you disposal? Given that the wording is relatively similar to the 570 concurrent repair manual i assume at the moment, that the change of "has to comply to" was a change done willingly by the people writing the redpaper and i think, that this will reappear in the manual. But that's my personal opinion." I haven't seen any of the manuals online yet. But it is highly unlikely that a POWER 770 with have fully redundant Service processor and system clock cards, both which hot failover and and .. would have less hotpluckability (that's not a word) "Just to make a thing clear beforehand. I'm more fluent in the Sun speak, as i'm a Sun employee. But that doesn't necessarily means that my knowledge is reduced to this. And it doesn't justify the arrogance, you use from time to time." Well nothing wrong with SUN speak, I grew up on SPARC in the late 80'ies early 90ies. Well I am sorry if I have offended you. But as you well know, there are a lot of dorks writing about techincal stuff, and you are obvious not one of those Obviously you are right, most often there is a difference between what runs and what works and whats suggested. We will have to wait what IBM writes in the final manual. But do you really thing any customer would opt for an "should work, will work" when the manual and offical IBM papers suggest otherwise. This document totally devalues this feature at the moment. "I'm pretty aware of the fact, that the 570 is just a mid-range system. Could you please tell the IBM sales reps to stop to sell this gear when the RfP mandates M8000/M9000 class of RAS ? ;-)" Well I have no incluence what so ever on how IBM sales persons act :)= Well I'd agree that the power 570 is not there in RAS features, it will match/exceed a M8000 in any benchmark, not to mention on price. But I guess the POWER 770 is, IMHO there are things where the POWER 770 is better and places where the Big M's are better. The POWER 780 is most likely better than M8000/M9000 with regards to RAS, in pretty much all aspects. "If an 750 is really in the same ballpark of the M9000 has to be proved in reality. I think it's the same like with the T5440 SAPS value, interesting ... but you have to take it with a large grain of salt. You should just think about the point that there is no significant storage at a saps benchmark and that the influence of the database is rather miniscule. So there is no really I/O heavy lifting for example." Well a 750 is not in the ballpark of the M9000, more than a DL785 is or an x3950 is. But with regards to performance the power 750 is fast. SPECJBB2005: 2.478.929 for a 32 Core power 750 versus 1.757.035 fo r a M9000 with 256 2.52Ghz cores. This might sound redicoulus, but you have to remember that a POWER7 chip has 8 Decimal floating points units which kind of rocks on business code, and there is support for it in Java. If you on the other hand look at a benchmark that really beats the memory subsystem like SPEC OMPM2001 then a 64 core 2.52GHz M8000 does 104,714 and a power 32 core 750 104,175. So that is a pretty close match. On the IO side AFAIR a M8000 has a theoretical peak Bandwidth of around 60GB per sek where as a POWER 750 is 30GB per sek, and the M900 is 240GB or something. So no the power 750 is not a match for the Big M's in RAS and IO or ammount of Memory (sure the power 750 has AME which can extend it's memory but nothing beat the real thing), but if you are talking talking numbercrunching, or APP servers well... And then there is price.. I mean you can get 3 power 750'es for what a M8000 processor board costs. "... I think of single-thread-performance of "Let's load the system to the max. Pedal to the metal. And let's look now, how much performance i have expect from the single thread in the system." No I understand what U mean. And I am talking about the same things. Things just work very differently on POWER compared to Xeon, SPARC64 and Niagara. In a Real Life system you don't run at 100% utilization the whole time, unless you are a HPC system. Most traditional non virtualized UNIX systems seldom go over 30% average utilization.. average as in measured over 365 days, 24 hours a day. Most systems are sized to peak usage. So when a power6 system has for example has 8 threads to run on 4 cores with SMT enabled hence 8 logical cores. First 4 threads want to run, it schedules all threads to run on 4 of the logical cores, but only one thread to each physical core. Now the system runs at 100% utilization. Now lets then say that 8 threads want to run. Then 1 thread is scheduled to each logical processor where then two logical processors are scheduled to the same core, and they will run at the same time. So you might say that the more threads you put on a POWER system the more it goes into 'throughput mode'. But if you then stop one thread, and one of the other 7 remaining threads have a higher priority, it will be able to eat a whole core by itself. Looking at a POWER system with AIX/Linux and for example a SMT enabled x86 system running Linux is just so different. The first constantly trying to optimize and adjust to what you are running on it, the later just static and.. well kind of stupid. So I don't think you are right. And I don't think you have understood it. Try to have a look at this little movie. http://public.dhe.ibm.com/systems/power/community/aix/AIX_Movies/PowerBasics_Logical_CPU_SMT.wmv it's a pretty good intro into the dynamics of SMT on a POWER system. It is very different cause everything is dynamic. Much like comparing containers to old board based domains. "When you look at some facts about the Power5/6 even a cursory observer could get this impression: A budget of 790 million transistors despite having a smaller cache vs. 600 million transistors should easily explain that there are some additional functional units per core ;-)" Well there is a Decimal floating point unit and an altivec vector execution unit hidden there. But on the other hand POWER5 was a fat core and POWER6 a mostly inorder speed devil. So it's hard to compare. "At the end i congratulate IBM for this neat chip, but well ... RF was on John Fowlers slider for this year on the Sun+Oracle webcast and the cards are newly mixed. And Nehalem-EX will be an interesting contender, too as long as you don't want to go in the RAS midrange/highend. It's the classic game of leap-frogging in creating CPU. However it would be interesting, where IBM sees the next step after Power7 ... i didn't heard any substantial about something like Power8 (think this system would be a hard sell at Airbus Well, POWER7 is more than a leap-frog. It is a factor 4+ in socket performance compared to previous generation. And at a lower price. I am currently helping some of our contract people recalculating our cost models. And the hardware price of a POWER7 based power 770 with four times the cores 3 times the memory is 25% cheaper than the price of our former power 570 standard machine. That is not leap-frogging.. that is several frog leaps at one time. (5.0GHz power 570 versus 3.1 GHz power 770). And what nehalem EX brings to the table is still to be seen. IMHO Xeons look really really good on SPEC, but when it comes to real life performance, then a 'on the paper' much slower UNIX box will normally beat a Wintel box to a bloody pulp. If you look at this Oracle APPS benchmark. Here a 6 core POWER7 actually outperforms a 2 socket 8 core Nehalem box. http://www.oracle.com/apps_benchmark/doc/E-Bus-R12-Pay_ORA_Med_IBM_p7-6-core-Single-Audited.pdf http://www.oracle.com/apps_benchmark/doc/E-Bus-R12-Pay_ORA_Med_HP_DL380-G6-Audited.pdf 143885 records per hour for the 8 core Nehalem 165000 records per hour for the 6 core POWER7 Now if we then tried to estimate a value for the highest bin Nehalem EX and for the highest bin POWER7, then we would hit something like: 143885/(2,93/2,26)=110983 records per hour per chip. 165000/(6/8*3,3/3,86))=257333 records per hour per chip. Sure this is only one benchmark and only resturant table math, but it does hint that perhaps Nehalem EX isn't going to match POWER7, on performance. And leap-frogging an airbus is a bit of a jump "....Given that the M4000 servers contains already 32 GB memory, make 2 out of if for the price of a single p7 chip plus a little bit of memory." Eh ? I don't know where you got your prices. POWER servers have never been cheap in the price you pay for the hardware. But on the other hand you get a lot of stuff for free (or rather included in the price) that you have to pay a arm and a leg for at HP. But looking on the net the prices I can see there are no where near what you are saying. Or a T5440 you could buy a 1.4 GHz T5440. given that a T5440 1.6 has 25.000 SAPS , let's assume a 1.4 has 20.000 SAPS. A p7 has 85000 SAPS. Just for the price of the 4 procs in the 750 you could get 4 T5440 with 512 GB Memory getting similar performance. Hmmm.. how did you come up with that pricing scheme ? SAP 2Tier numbers: http://www.sap.com/solutions/benchmark/sd2tier.epx?num=200 A 1.4GHz T5440 is 89KUSD (1.6GHz is 116K and 25830 SAPS) and I'd say it does 22600 SAPS, Good thing about Niagara is that it scales well in cores/threads/GHz. http://shop.sun.com/is-bin/INTERSHOP.enfinity/WFS/Sun_NorthAmerica-Sun_Store_US-Site/en_US/-/USD/ViewStandardCatalog-Browse;pgid=lkpIezKOHTdSR0kLFbvpraZJ0000slANxLC9;sid=njtp0DI6inJpx3lAQWwh1d2lpzEYIQm0pzZ4oVOaHDBkRc7Acls=?CatalogCategoryID=ZXVIBe.d7kYAAAEZYYsJ0gWj&PriceIt=true&Country_Territory=Sun_Catalogue_US A 3.55 GHz power 750 does 85220 SAPS and costs (I don't have the price) a 3.3 does 79219 and costs 174KUSD and a 3.0GHz 72017 SAPS and costs 102KUSD. http://www-03.ibm.com/systems/power/hardware/750/browse_aix.html So Yes if you buy the cheapest T5440 it might be four times cheaper than a 3.55GHz power 750 with the highest bin POWER7 CPU's, but if you buy a 3.0GHz Power 750 then you get a machine that is 3+ times faster for only 13KUSD more. And SAP is priced on a per user scheme soo.. Nahh.. comparing the most price efficient T5440 with the least price efficient power 750 isn't fair. "Don't misunderstand me, p7 is a cool proc and it's good that RF isn't far away ... but it isn't a cheap processor, it isn't a kill-all-other-procs processor." Power7 is a kill the rest, at least for some time IMHO. RF is gonna be 16 cores each with the same SMT/fine grained multithreading as NF and VF, so unless they do some serious magic then T3 will only double in performance, and that by increasing the number of threads by a factor of 2. Please correct me if I'm wrong. And 512 threads for a midrange system is.. well.. POWER7 is also faster than current Nehalem, so Nehalem EX with it's lower clock and worse scalability will also need an upgrade to reach POWER7, IMHO. Now as you said you work for SUN, I can only hope the best for you and the rest of your SUN colleges, and hope that Oracle treats you well, and understand what a truely great asset they have gotten much to cheap. So may the UNIX Gods protect you guys :)= Cause the marked needs competion. Said from one UNIX guy to another or rather, an AInt uniX guy to a Slutlaris guy But now it is time for a belgian beer // jesper
Sorry about the Power6/Power6+ statement ... i thought about something different unrelated to SAP at that moment. Sorry, for that
However from my opinion this is still a good proof why SAP is a cache-sensitive benchmark. Considering that UltraSPARC T2+ provides 4 (thus 16 MB with 4 of them) MB cache and Power6 provides 4 MB L2 per core as well + 32 MB L3 per processor. I assume, that T2+ is hit harder by the duplication of the data than the Power6 due to an increase of of cache misses. So obviously would get an massive boost when you move that l3 cache that near to the core. I think the major progress of power7 isn't the core, it's the embedded edram. It's a nice chip IBM developed there.
If you really need single-thread performance, switch off half of the cores, redistribute the Cache to the remaining 4 cores and and clock those cores higher.
End result: ~50% higher single thread performance. Here in German from heise.de: --------- Für den ultimativen Performance-Kick bietet IBM bei den High-End-Modellen, bis dato ist von der 780 die Rede, den sogenannten TurboCore Mode an. Damit kann der Administrator vier der acht Prozessorkerne deaktivieren, was den L3-Cache für die verbleibenden Kerne verdoppelt. Zusätzlich erhöht das Systeme den Prozessortakt, wodurch die Gesamtleistung pro Core um bis zu 50 % anwachsen soll. Der Prozess ist umkehrbar, der Administrator kann die schlafenden Prozessorkerne wieder aufwecken, allerdings muss er dazu aktive LPARs herunterfahren oder verlagern und einen Kaltstart der Maschine durchführen. ---------- http://www.heise.de/newsticker/meldung/Technische-Details-zu-IBMs-POWER7-Prozessoren-und-Servern-926007.html
I heard to redistribute cache to fewer cores, you need to reboot the machine and change microcode programming in the CPU, is that true? Does any one knows more on this?
The document "IBM Power 770 and 780 Technical Overview and Introduction" states on page 13: "The customer may switch into and out of TurboCore mode on their own, but switching requires a system reboot."
That sucks badly. No one will use it in practice. It is only marketing ploy.
Well ... would say this aloud ... you have to think of the point that Power7 was primarily developed for HPC. There are several codes that are more cache-sensitive than core-sensitive.
Java tour for the 780 : The TPMD function is comprised of a 'risk' (sic) processor...
I hope they make this mandatory for banks! |
Links in this articleThe LKSF bookThe book with the consolidated Less known Solaris Tutorials is available for download here
Web 2.0Contact
Networking xing.com My photos SyndicationTagged articlesAMD Apple avs Bahn Blogging Blogosphere braindump Business Travel CeBIT cec cec2006 CMT del.icio.us deutsch dtrace fliegen Fundsache General Hamburg IBM i hate sundays Intel iscsi jumpstart Links Linux lksf Mindfuck Movies Music Musik Niagara Opensolaris Opteron Photographie policy of ... Politik Security Solaris storage Sun suncec2007 sunw t1 The IT Business Ultrasparc ultrasparc t1 Wirtschaft Work ZFS
CommentsThu, 09.09.2010 13:04
Okay, I must have overseen tha
t
Thu, 09.09.2010 12:59
1. Gerne:
zb. für ne SAN Migr
ation. Ich weiss das Sun das G
efühl hat, dass man sowas nich
t braucht.
Das ist ähn [...]
Thu, 09.09.2010 12:00
Believe it or not ... there ar
e company obeying the licenses
. So that's a very practical c
hange ....
Thu, 09.09.2010 11:54
So practically, there is zero
difference:
90 day evaluati
on period which wouldn't expir
e anyway, vs. a "perpetu [...]
Thu, 09.09.2010 11:49
There is no timelimit ...
Buttons![]() This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Germany License
![]() ![]() ![]() Blog Administration |