Original Link: https://www.anandtech.com/show/2919



Performance per Watt rules the datacenter, right? Wrong. Yes, you would easily be lead astray after the endless "Green ICT" conferences, the many power limited datacenters, and the flood of new technologies that all have the "Performance/Watt" stamp. But if performance per Watt is all that counts, we would be all be running atom and ARM based servers. Some people do promote Atom based servers, but outside niche markets we don't think it will be a huge success. Why not? Think about it: what is the ultimate goal of a datacenter? The answer is of course the same as for the enterprise as a whole: serve as many (internal or external) customers as possible with the lowest response time at the lowest cost.

So what really matters? Attaining a certain level of performance. At that point you want the lowest power consumption possible, but first you want to attain the level of performance where your customers are satisfied. So it is power efficiency at a certain performance level that you are after, not the best performance/Watt ratio. Twenty times lower power for 5 times lower performance might seem an excellent choice from the performance/watt point of view, but if your customers get frustrated with the high response times they will quit. Case closed. And customers are easily frustrated. "Would users prefer 10 search results in 0.4 seconds or 25 results in 0.9 seconds?" That is a question Google asked [1]. They found out to their surprise that a significant number of users got bored and moved on if they had to wait 0.9 seconds. Not everyone has an application like Google, but in these virtualized times we don't waste massive amounts of performance as we used to in the beginning of this century. Extra performance and RAM space is turned into more servers per physical server, or business efficiency. So it is very important not to forget how demanding we all are as customers when we are browsing and searching.

Modern CPUs have a vast array of high-tech weapons to offer good performance at the lowest power possible. PowerNow!, SpeedStep, Cache Sizing, CoolCore, Smart Fetch, PCU, Independent Dynamic Core Technology, Deep Sleep, and even Deeper Sleep. Some of those technologies have matured and offer significant power savings with negligible performance impact. A lot of them are user configurable: you can disable/enable them in the BIOS or they get activated if you chose a certain power plan in the operating system. Those that are configurable are so for a good reason: the performance hit is significant in some applications and the power savings are not always worth the performance hit. In addition, even if such technologies are active under the hood of the CPU package, it is no guarantee that the operating system makes good use of it.

How do we strike the right balance between performance and energy consumption? That is the goal of this new series of articles. But let's not get ahead of ourselves; before we can even talk about increasing power efficiency at a certain performance point, we have to understand how it all works. This first article dives deep into power management, to understand what works and what only works on PowerPoint slides. There is more to it than enabling SpeedStep in your server. For example, Intel has been very creative with Turbo Boost and Hyper-Threading lately. Both should increase performance in a very efficient way. But does the performance boost come with an acceptable power consumption increase? What is acceptable or not depends on your own priorities and applications, but we will try to give you a number of data points that can help you decide. Whether you enable some power management technologies, how you configure your OS is not the only decision you have to make as you attempt to provide more efficient servers.

Both AMD and Intel have been bringing out low power versions of their CPUs that trade clock speed for lower maximum power. Are they really worth the investment? A prime example of how the new generation forces you to make a lot of decisions is the Xeon L3426: a Xeon "Lynnfield" which runs at 1.86GHz and consumes 45W in the worst case according to Intel. What makes this CPU special is that it can boost its clock to 3.2GHz if you are running only a few active threads. This should lower response times when relatively few users are using your application, but what about power consumption? AMD's latest Opteron offers six cores at pretty low power consumption points, and it can lower its clock from 2.6GHz all the way down to 800MHz. That should result in significant power savings but the performance impact might be significant too. We have lots of questions, so let's start by understanding what happens under the hood, in good old AnandTech "nuts and bolts" tradition.

Warning: This article is not suited for quick consumption. Remember, you come to AnandTech for hard hitting analysis, and that's what this article aims to provide! Please take your time… there will be a quiz at the end. ;-)



How Does Power Management Work?

The BIOS settings, the power manager of the Operating System, the hardware circuits on the CPU, monitoring hardware, sensor banks... when I first started reading about power management, it quickly became very chaotic. Let's make some sense out of it.

It all starts with ACPI, the Advanced Configuration and Power Interface. In 1996, the three most influential companies in the PC world (Intel, HP, and Microsoft) together with Toshiba and Phoenix standardized power management by presenting the ACPI Specification. ACPI defines which registers a piece of hardware should have available, and what information the BIOS/firmware should offer: these are the red pieces in the graph below.


The most important information can be found in the ACPI tables, which describe the capabilities of the different devices of the platform. Once the kernel has read and interpreted them, the role of the BIOS is over. This is in sharp contrast with the power management (APM) systems that we used throughout the 80s and 90s, where for example CPU power management was completely controlled by the BIOS. The basic idea behind ACPI based power management is that unused/less used devices should be put into lower power states. You can even place the entire system in a low-power state (sleeping state) when possible. The ACPI system states are probably the best known ACPI states:

  • S0 Working
  • S1 processor idle and in low power state but still getting power, RAM powered
  • S2 Processor in a deep sleep, RAM powered, most devices in lower power
  • S3 CPU in a deep sleep, RAM still getting power, devices in the lowest power states, also known as "Standby"
  • S4 RAM no longer powered, disk contains an image of the RAM contents, also known as "Hibernate"
  • S5 is the soft power off

We translated the ACPI system states to their most popular implementations; the standards are actually a bit vague... or flexible if you like. You can find more details in the latest ACPI specification (revision 4.0, June 16, 2009). Windows 2008 R2, the operating system used in this article, uses the older ACPI 3.0 standard. ACPI 3.0 made it possible for different CPUs to enter a different power state.

The boss of the ACPI based power management is the power management component of the kernel. The kernel power manager handles the devices' power policy, calculates and commands the required processor power state transitions, and so on. Of course, a kernel component does not have to know every specific detail of each different device. Focusing on the CPUs, the power manager will send for example the right P-state towards a specific processor driver: in the case of Windows 2008 R2, this is either intelppm.sys or amdppm.sys. The processor driver will direct the hardware to enter the P-state requested by the kernel. This mostly happens by writing to machine specific registers, the famous MSRs. So it's clear that the CPU driver contains architecture specific code.

Processor states

There are two processor states: P-states and C-states. P-states are described as performance states; each P-state corresponds with a certain clock speed and voltage. P-states could also be called processing states: contrary to C-states, a core in a P-state is actively processing instructions.

With the exception of C0, C-states are sleep/idle states: there is no processing whatsoever. We will not go into the details as Hardware Secrets has written a very comprehensive article on C-states. We will give you a quick overview of the ACPI standard C-states, and then immediately look at the actual implementation of those C-states in modern CPUs. The ACPI standard only defines four CPU power states from C0 to C3:

  1. C0 is the state where the P-state transitions happen: the CPU is processing.
  2. C1 halts the CPU. There is no processing, but the CPU's own hardware management determines whether there will be any significant power savings. All ACPI compliant CPUs must have a C1 state.
  3. C2 is optional, also known as "stop clock". While most CPUs stop "a few" clock signals in C1, most clocks are stopped in C2.
  4. C3 is also known as "sleep", or completely stop all clocks in the CPU.

The actual result of each ACPI C-state is not defined. It depends on the power management hardware that is available on the platform and the CPU. For example, all Intel Xeons of the past years support an Enhanced C1E state, which is entered automatically if the CPU stays in C1 for a while. Modern CPUs will not only stop the clock in C3, but also move to "deeper C4/C5/C6" sleeps and drop the voltage of the CPU. The C1E, C4, C5, and C6 states are only known to the hardware; the operating system sees them as ACPI C2 or C3. We will discuss this in more detail further on in this article. Before we go into more detail on how the CPUs actually handle these C- and P-states, let's see what we assembled in our labs for testing purposes.



The Hardware

One of the latest low power Xeons is the Intel Xeon L3426. It is the Xeon version of the "Lynnfield" architecture. What really makes this CPU special is the ability to boost its clock speed from 1.86GHz (default) to 3.2GHz without dissipating more power than 45W. The Turbo Boosted clock speed of 3.2GHz can only be reached if only one core is under load. Combine this with the fact that the CPU can cope with eight threads at once (four cores + Hyper-Threading) and you will see why this CPU deserves special attention. It should offer very low power when the load on the server is low as in that case the CPU steps back to a 1.2GHz clock. At the same time the CPU can scale up to 3.2GHz to provide excellent single threaded performance at 3.2GHz. And it also holds the promise that the CPU will never consume more than 45W, even when under heavy load. Performance will be "midrange" in those situations as the CPU cannot clock higher than 1.86GHz, but eight threads will still be running simultaneously.


At $284 this CPU looks like the best Intel offering ever in the entry level server space... but it does have drawbacks of course. In the desktop space, the affordable Core i7 860 "Lynnfield" CPU relegated the expensive Core i7 900 series "Bloomfield" CPUs to the shrinking high-end desktop CPU market. This is not going to happen in the server world: the "Lynnfield" CPU has no QPI links, so it cannot be used in multi-socket servers. The triple channel Xeon X55xx series ("Gulftown") will continue to be Intel's dual socket Xeons until they are followed up by quad- and six-core 32nm Westmere CPUs. So the first drawback is that you will be limited to four physical cores per server (and four logical SMT cores). For those of us running non-virtualized workloads, this is probably more than enough CPU power.

Talking about virtualization, that is the second drawback: each of the memory channels supports three DIMMs. As a result the CPU does not support more than 24GB. This is another reason why the cheap "Lynnfield" Xeon is not going to threaten the Xeon 5500 series anytime soon: a dual Xeon "Gulftown" server supports up to 144GB.


We also have the Xeon X3470 in the lab. When running close to idle, this CPU can also throttle back to 1.2GHz and save a lot of power. For the occasional single threaded task, the X3470 offers the best single threaded performance in the whole server CPU industry: one core can speed up to 3.6GHz. Yes, this is the "Xeonized" Core i7 860. Why does this CPU interest us? Well, the Xeon 3470 (2.93GHz) is only one speed bin faster than the X3460 (2.83GHz). The X3460 costs around the same price ($316) as the L3426 and can Turbo Boost up to 3.46GHz. And the X3460 brings up an interesting question: is a L3426 really the most interesting choice if you want a decent balance between performance and power? The L3426 has the advantage that you are sure you will never break a certain power limit. The X3460 however offers low power at low load too, and has more headroom to handle rare peaks. Below you can find the specs of our Intel server:

Intel SC5650UP
Single Xeon X3470 2.93GHz or Xeon L3426 1.86GHz
S3420GPLC, BIOS version August 24, 2009
Intel 3420 chipset
8GB (4 x 2GB) 1066MHz DDR-3
PSU: 400W HIPro HP-D4001E0

While we did not have a comparable AMD based server yet, this article would not be complete without a look at an AMD based system. AMD promised to send us a low power server, but after some back and forth correspondence it became clear the system would not be able to meet our deadline. Rest assured that we will update you once we get the new low power system from AMD. At the moment AMD looks a bit weak in the low cost server arena as its honor is defended by the 2.5GHz - 2.9GHz Opteron "Suzuka" CPU. That is a single CPU solution based on "Shanghai": four K10 cores and a 6MB L3 cache. This platform is almost EOL: we expect the San Marino and Adelaide platform around CeBIT 2010. Servers based on these new AMD platforms will save quite a bit of power compared to their older siblings. The six-core Lisbon that will find a home in these servers will be a slightly improved version of the current six-core AMD Opteron. Below are the specs of our AMD server:

Supermicro A+ Server 1021M-UR+V
Dual Opteron 2435 "Istanbul" 2.6GHz or Opteron 2389 2.9GHz
Supermicro H8DMU+, BIOS version June 18, 2009
8GB (4 x 2GB) 800MHz
PSU: 650W Cold Watt HE Power Solutions CWA2-0650-10-SM01-1

AMD picked the components in this server for our previous low power comparison of servers. We went with the Opteron 2435 as the six-core Opteron offers a very decent performance/watt ratio on paper. We will update the numbers once the Opteron 2419 EE arrives.

Making the comparison reasonably fair

So we had to work with two different servers. While the AMD versus Intel side of things is not our main focus, how can we make a reasonably fair comparison? The difference in power supplies is hardly a problem: both AMD and Intel feel that these power supplies are among the best available as they were chosen for their low power platforms. Both power supplies are 90% efficient over a very wide range of power load. The problem is the fans.

The fans in the AMD machine are small and fast, with speeds up to 11900 rpm! We disabled fan speed control to keep the power consumption of the fans constant. There are four fans and we measured the fan power consumption by taking out the fans that blow over the memory while keeping the two fans that cool the CPUs. This way we were sure that our CPU would not overheat and leak more power. We carefully measured the temperature of the CPU and jotted down the power measurements in all of our tests. We found out that each fan consumes about 8W. We did the same thing for the Intel machine: the power consumption of each fan was measured at the electrical outlet. The memory DIMMs were also checked: there was no significant difference between DDR2-800 and DDR3-1066, both in idle as well as under load. By taking the fans out of the equation, we can get a very reasonable comparison of both platforms. So how well do the current CPUs manage power?



AMD Power Management

Variable clock rate and CPU power management started with the Intel 386SL, but that would take us a bit too far back in history. Let's start with the introduction of the K6-2+ and PIII mobile. From that moment on, both Intel and AMD have been using Dynamic Frequency and Voltage Scaling (DFVS) per CPU. DFVS has been marketed as "PowerNow!", "SpeedStep" and many other names. In a multi-core CPU this means that all the cores will clock at the clock speed of highest loaded core. A clock speed requires a corresponding core voltage, so all cores also use the same voltage.

With the introduction of the K10h family (aka "Barcelona") in 2007, AMD reduced dynamic power by three different technologies:

  1. Dynamic Frequency Scaling per Core. Each core runs at its own clock.
  2. Separate power planes for the core and "uncore" part of the CPU.
  3. Clock gating at the CPU block level

The effect of (1) on performance/watt is not a complete success story: power is linear with frequency, and some OS schedulers will always try to "load balance" across the cores to avoid having one core get hot (which increases static power). As a result the power savings due to (1) are relatively small, and the lag in transitioning from one P-state to another reduces performance as our benchmarks will confirm. AMD Opterons typically support 4-5 P-states. The Opteron "Shanghai" 2389 in this test supports 2.9, 2.3, 1.7 and 0.8GHz. The six-core Opteron 2435 supports 2.6, 2.1, 1.7, 1.4 and 0.8GHz. [2]

Separate power planes provide several benefits. The first benefit is that the cores can go to a sleep (C-state) while the memory controller is still working for another external device (e.g. via DMA). Another advantage is that AMD is able to run the Northbridge and L3 cache out of sync with the cores. This lowers power significantly, while performance only decreases slightly. Overall, performance/watt is clearly increased.

Clock gating reduces power by 20 to 40% according to some publications [3]. This is probably the most important technology for the server market: as server code does not perform floating point code a lot, disabling the clock to the FPU by a clock gate saves quite a bit of power. As a matter of fact, the highest power numbers are measured by floating point intensive benchmarks like LINPAC; typical server benchmarks based on databases or web servers do not even come close. LINPAC needs 20-25% more power than our integer based benchmarks, despite the fact that in both situations the CPU reports utilization as 100%.


AMD added "Smart Fetch" to the newer "Shanghai" Opteron, which is essentially clock gating at the core level (making it new technology number four). The main goal is to make idling cores go to a "clock disabled" sleep state (AMD's C1-state) instead of a low frequency state (P-state). The problem is that snoops from the active core(s) might wake up the sleeping core too quickly, and those snoops would get a very slow "just woke up" answer. To avoid this, the idle core will dump the contents of its L1 and L2 caches into the L3 cache before it goes to the clock gated C1 state. This could not be done on Barcelona, as the 2MB L3 cache would fill up quickly if three cores dumped their L1 and L2 data into the L3 cache. However, it is important to remark that even when three cores are clock gated, it is unlikely they will take 1.7MB away (512 KB * 3 + 64 KB * 3) as shared cachelines between the cores are always kept inside the otherwise exclusive L3 cache of all quad-core Opterons. Clock gating at the core level reduces dynamic power to zero, which allows the new Opteron to save up to 5W per core.

That was quite impressive: a Shanghai Opteron uses about 10W for a quad-core in idle, while a quad-core "Barcelona" Opteron uses around 25W. This is also confirmed in the measurement on desktop CPUs performed by LostCircuits. AMD still has some catching up to do: the six-core Opteron "Lisbon" (set to launch around March 2010) will go from C1 to the hardware controlled C1E state.

Intel Power Management

Intel moved to pretty aggressive clock gating at the CPU block level in its "Woodcrest" server CPU in 2006. Intel also introduced cache sizing: the necessary data in the L2 cache is reduced to a minimum and cache blocks are turned off. While Intel was an innovator when it came to block clock gating and cache power reductions, AMD was first with independent power planes and independent core frequencies. It shows that even in the power management race, AMD and Intel are leapfrogging each other. Intel caught up with AMD and leapfrogged AMD again when it introduced the Xeon "Nehalem" 5500 series, where core and uncore got independent power planes.

However, Intel went one step further. It not only enabled clock gating for each core, but also power gating. Clock gating only reduces the dynamic power, while power gating reduces both dynamic and static (mostly leakage) power. Thanks to the built-in Power Control Unit (PCU, hardware circuit), Intel promises us that cores can go to the lowest C6 sleep state while other cores continue to work "undisturbed".


Below you can see how the operating system sees this. We asked the Windows 2008 kernel to tell us what ACPI state the cores use when the CPU is running completely idle. Notice that the clock speed of each logical core is reduced to 1.2GHz, another sign that the CPU is not processing anything significant.


So while the operating system demands the CPU to go the ACPI C2-state, the PCU overrides the orders of the operating system and should force the idle cores to go relatively quickly to C6, achieving lower power consumption. In C6, the core is not only completely clock gated, it is power gated too. So that means that the leakage of the idle core is reduced to almost zero. The older 5400 Xeon series was only capable of placing two cores into C6 at the same time (i.e. if only one core was idle, it couldn't enter C6). And the deeper the sleep, the slower the core wakes up. Intel severely lowered the time that is necessary to go to the C6 state and back in the Nehalem architecture.


The real magic of the "Nehalem" based architecture is that the integrated power switch makes this transition extremely quickly. Instead of 200µs [4] in the older Penryn processor (the Xeon 54xx is based on this architecture), the transition time is reduced to only 60µs. This should allow the Xeon 5500, 3500, and 3400 series to transition quickly to C6 with a small performance impact. We will check these claims.

The latest Intel Xeons have lots of P-states: one for each 133MHz speed bin from 1.2GHz to the maximum advertised clock speed. In other words, every 133MHz ratio between the lowest frequency P-state and the highest frequency P-state is a valid P-state.

Below you will find an overview of AMD's and Intel's techniques to reduce power while processing.


Intel supports lots of P-states but makes much less use of them than AMD. Despite the fact that the infrastructure is there (each core has its own PLL) Intel doesn't generally run cores at different clock speeds. Sanjay Sharma of Intel:

 

In the steady state, all active cores run at the same frequency, which equals the highest requested frequency of any of the active cores. When there is a frequency change request from one core that results in a change in the resolved frequency, all cores will change to that new resolved frequency. However, not all cores will necessarily change frequency at the same time, since the instruction stream on each core needs to reach an end of macro instruction boundary before it can change frequency. If a core is running a very long instruction when the frequency change request arrives, that core will change frequency later than the other cores that reached the interruptible point sooner. As a result, for very short time periods, it is possible that cores could be running at different frequencies.

 

The most likely reason why Intel does not allow cores to run at different clock speeds for prolonged times is the fact that you have to keep the voltage that is needed for the highest clock speed. AMD has some catching up to do, as the lowest C-state of an idle core is only C1. This situation will improve when the improved Magny-cours and Lisbon Opterons arrive, as those CPUs will support a C1E state like their notebook siblings.



Not So Fast!

Power management, especially dynamic voltage and frequency scaling, does come with a performance cost. Since its introduction both Intel and AMD have been claiming that this performance cost is "negligible", but we all know better now. On dual-core Athlon X2 and Phenom I, it was for example impossible to use DVFS and get decent HD-video decoding. There are three important performance problems with dynamic power management:

  1. Transitioning from one P-state to another takes a while, especially if you scale up.
  2. Active cores will probe idle or lower P-state cores quite frequently.
  3. The OS power manager has to predict whether or not the process will need more processing power soon or not. As a result the OS transitions a lot slower than the hardware.

Suppose that the OS decides that the CPU can clock down to a lower P-state. Just a few ms later, a running process requires a lot more performance. The result is that the voltage must be increased and this takes a while. During that time, the CPU is wasting more power than it should: processing is suspended for a small time and the clock speed cannot increase unless the higher voltage is reached and is stable enough. If this scenario is repeated a lot, the small power savings of going to a lower P-state will be overshadowed by the power losses of scaling quickly back up to a higher clock and voltage. It is important to understand that each voltage increase results in a small period where power is wasted without any processing happening. The same problem is true for entering a C-state: enter it too quickly and performance is lowered as it takes some time to wake that core up again.

The last problem is a bit more subtle: if you lower the P-state of one core, another core that sends a snoop towards this "slow" core will get a much slower answer. As a result the performance of the active core will be lower. According to some researchers [5], this performance decrease is about 5% at 800MHz on a "Barcelona" Opteron. If P-states could go as low as 400MHz, the performance impact would be 30% and more! That is the reason why lower P-states are not used: a core with P-states lower than 800MHz would wreak havoc on the performance/watt ratio of the CPU. That is also why "Smart Fetch" dumps the L1 and L2 caches in the L3 cache. This avoids not only waking the idle core up too soon, but it also avoids the performance hit associated with snooping a "napping" core. Intel's CPUs do not have this problem: the inclusive nature of the L3 cache means that if data cannot be found in the L3 cache, you will not find that data in any core's L1 or L2 caches.

The bottom line is that power management is quite complex: there is no silver bullet. Go to low/idle states too quickly and you end up burning more power while delivering less performance. At the same time, if the OS keeps the clock speed too high, the CPU might never achieve decent power savings. The OS must take into account the most likely behavior of the application and the capabilities of the hardware.



Our Benchmark Choice

For this article we chose Fritzmark 4.2, the chess benchmark designed by Mathias Feist. The benchmark has the disadvantage that it is not real world for most IT professionals, but it allows us to control the number of threads very easily and precisely. It is also a completely integer dominated benchmark which runs completely in the CPU caches. This allows us to isolate the CPU power savings and the performance and power measurements will still have some resemblance to the typical server loads. This is in contrast to an FP intensive benchmark like LINPACK.

Software Power Management: Windows 2008 Power Plans

On our 64-bit Windows Server 2008 R2 Enterprise two power plans are available:


The interesting thing is how the power plan affects the processor power management (PPM). With Balanced, Turbo Boost never came into action. The L3426 was stuck at 1.86GHz and the X3470 never clocked higher than 2.93GHz. When running idle, both CPUs stayed at 1.2GHz (9x multiplier). The Opterons scaled back to 0.8GHz.

Once set at the Performance power plan, the CPUs never scaled below the default clock speed. According to most clock speed utilities, the Xeons always tried to achieve the highest possible Turbo Boost clock speed. The L3426 switched between 3.066 and 3.2GHz. Note that this did not increase the power consumption significantly: it only used 2W extra on the L3426. The Opterons ran at their top speed. To measure the effect of the power plans we measured the power consumption of the different servers running idle in Windows 2008 R2. This is the power consumption of the complete system, measured at the electrical outlet minus the fans.

Idle Power

We focus on a comparison between the green and blue bars. Comparing the CPUs on each platform offers some interesting insights. Let's first check out the AMD Platform. The Opteron 2389 "Shanghai" clearly needs a higher voltage to achieve 2.9GHz (1.15 - 1.325 V). Despite the fact that the six-core has two cores more to power, the six-core Opteron needs 4W less than the 2.9GHz quad-core Opteron. The reason is that the 2.6GHz never needs more than 1.3V (min: 1.075V) and is also making very good use of clock gating with cache dump (a.k.a. "Smart Fetch").

The idle power measurement of the Xeons shows us how little power is saved by scaling back the frequency: only 2W. The power savings are a result of fine-grained clock gating and core power gating.



Saving Power at Low Load

Measuring idle power is important in some applications as operating system schedulers may choose to "race to idle", i.e. perform the task as quickly as possible so the CPU can return to an idle state. This strategy is only worthwhile if the idle state consumes very little power, but lots of server applications are running at relatively low but almost never "zero" load. One example is a web server that is visited all around the globe. Thus it is equally interesting to see how the processors deal with this kind of situation. We started Fritz Mark up with two threads to see how the operating system and hardware cope with this. First we look at the delivered performance.

Fritzmark integer processsing: 2 thread performance

In performance mode, the Xeon L3426 is capable of pushing clock speed up to 2.66GHz, but not always. Performance is equal to a similar Xeon at 2.5GHz. This in contrast with the Xeon X3470 which can almost always keep its clock speed at 3.33GHz, and as such delivers performance that is equal to a Xeon that would run always at that speed. The reason for this difference is that the PCU of the L3426 has less headroom: it cannot dissipate more than 45W while the X3470 is allowed to dissipate up to 95W. Still, the performance boost is quite impressive: Turbo Boost offers 34% better performance on the L3426 compared to the "normal" 1.86GHz clock.

Now let's confront the performance levels with the power consumption.

integer processing: 2 threads

The six-core Opteron is clearly a better choice than its faster clocked quad-core sibling. In power saving mode it is capable of reducing the power by 8W more while offering the same level of performance. It is a small surprise: do not forget that the "Istanbul" Opteron has twice as many idle cores that are leaking power than the "Shanghai" CPU.

The Nehalem based core offers very high performance per thread, about 40% higher than the Opteron's architecture is capable of achieving, but it does come with a price, as we see power shoot up very quickly. Part of the reason is of course is that the Nehalem is more efficient at idle. We assume - based on early component level power measurements - that the idle power of the Xeons is about 9W (power plan Balanced), the Opterons about 14W (power plan Balanced). Note that the exact numbers are not really important. Since the RAM is hardly touched, we assume that power is only raised by 1W per DIMM on average. Based on our previous assumptions we can estimate CPU + VRM power, measured at the outlet.

System Power Estimates
System Power Calculation CPU + VRM Power Notes
Xeon X3470 performance 119W - 4W (4 x 1W per DIMM) - 60W idle + 13W CPU = 68W (idle power of system was 73W = 13W CPU, 60W for the rest of the system)
Xeon L3426 performance 99W - 4W - 60W + 11W = 46W  
Xeon L3426 90W - 4W - 60W + 9W = 35W  
Opteron 2435 performance 102W - 4W - 70W idle + 18W = 42W (total idle power was 88W, 18W CPU)
Opteron 2435 balanced 100W - 4W - 70W idle + 14W = 40W  
Opteron 2389 performance 114W - 4W - 70W idle + 22W = 62W  

First of all, you might be surprised that the Turbo Boosted L3426 needs 46W. Don't forget this is measured at the power outlet, so 46W at 90% efficiency means that the CPU + VRMs got 41W delivered. Yes, these numbers are not entirely accurate, but that is not the point. Our component level power measurements still need some work, but we have reason to assume that the numbers above are close enough to draw some conclusions.

  1. AMD's platform consumes a bit too much at idle, but...
  2. The six-core Opteron CPUs are much more efficient than the quad-core in these circumstances
  3. Intel's 95W Xeons offer stellar performance but the high IPC requires quite a bit of power
  4. The low power versions offer an excellent performance / Watt ratio

So if we take the platform out of the picture, the low power Xeon with Turbo Boost consumes about the same as the "normal" six-core Opteron, but performance is 16% better. Is this a success or a failure? Did Intel's Power Controller Unit save a considerable amount of power? Or in other words, would the power of the Xeons be much higher if they didn't have a PCU? Let's dive deeper.



Analysis: What Happened?

The measurements on the previous page are fine but we also want to understand how well the hardware and operating system coped with the "low load" scenario. What did Windows 2008R2 do? We asked the Windows Driver Kit "Powertest" tool to tell us more. The first thing we want to know is the clock speed the CPU was ordered to run at in "Balanced" mode. The differences are very telling. First the Xeon's clock speed changes:

Xeon L3426 Core Speeds
Frequency Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
10 times 1863 1863 1863 1863 1863 1863 1863 1863
20 times 1863 1863 1863 1863 1863 1863 1863 1863
1 time 1463 1463 1463 1463 1463 1463 1463 1463
10 times 1863 1863 1863 1863 1863 1863 1863 1863
1 time 1729 1729 1729 1729 1729 1729 1729 1729
Many 1863 1863 1863 1863 1863 1863 1863 1863

The Xeon L3426 almost always ran at 1.86GHz. In a period of 30 seconds, we noticed only two P-state change requests: one speed bin lower (-133MHz) and 3 speed bins lower (-400MHz). All cores were always asked to run at the same clock speed.

Next those of the Opteron:

Opteron 2435 Core Speeds
Frequency Core 0 Core 1 Core 2 Core 3 Core 4 Core 5
1 time 800 1400 800 2600 800 800
1 time 800 800 1400 1400 800 800
1 time 800 800 800 800 800 800
1 time 800 800 800 2600 800 800
1 time 800 800 800 800 800 800
1 time 800 800 800 800 2600 1400
1 time 800 800 800 800 800 800

Where the Xeon hardly gets any P-state changes, the six-core Opteron 2435 frequently switches between 0.8GHz, 1.4GHz, and 2.6GHz. A lot of times one of the cores runs at 1400MHz, another one at 2600MHz, and the rest at 800MHz. Basically, the above table is repeated over and over again. This means that the frequency scaling is far from ideal: we should see two cores at 2.6GHz most of the time as the application spawns two threads that require 100% core power. This in turn explains the 15% performance hit between "Balanced" and "Performance". If the hardware and OS worked together better, the performance hit should not be more than a few percent. This makes us conclude that in this case, the 4W power savings are not worth the performance hit.

Sleeping

We have focused on the active cores so far, but the important power savings can also come from putting idle cores in sleep states. Did the CPU driver and OS scheduler work well together? Again, there are remarkable differences.

CPU Sleep State Comparison
  % Idle ACPI C1 ACPI C2 ACPI C3
Opteron 2435 86 100 0 0
Xeon L3426 81 7 93 0
Opteron 2389 72.4 100 0 0

The six-core had more idle cores than the quad-core Opteron, and as a result it did experience more idle time. All idle time with the Opterons was spent in the C1/"Halt" status.

The Xeon was quite a bit more aggressive: 93% of the idle time was spent in the C2 state, but C2 at the operating system level does not mean the hardware actually runs in C2. In theory, the hardware is capable of putting the core into a "deeper" CC (Core Sleep) state. Intel promised that the idle Nehalem cores would be able to reach even the deepest C6 sleep while other cores were working. Did that actually happen?

Software tools read out the API of the OS and thus - as far as we know - always read out the ACPI states. We followed the guidelines in Intel's White Paper, "Intel Turbo Boost Technology in Intel Core Microarchitecture Based Processors", and did some programming (in assembly) to find the actual hardware C-states.

First we read out the Time Stamp Register

RDTSC
0x000086FCCA7EBD0E

Next we read out the right Machine Specific Register

RDMSR 0x3FDH
High 32bit(EDX) = 0x00007265, Low 32bit(EAX) = 0xF842A000

We wait for 1500ms and then repeat the previous procedure:

RDTSC
0x000086FD78268DC2
RDMSR 0x3FDH
High 32bit(EDX) = 0x00007265, Low 32bit(EAX) = 0xFA3F0000

In some cases, the MSR did not get one tick more, clearly indicating that the CPU had not entered C6 during the 1.5 second period. Both the "real" physical and logical core report the same TSC and MSR info, so it is quite easy to make a distinction between the real cores and the logical cores which are a result of SMT (Hyper-Threading).

With the "Performance" power plan we get:

"Performance" Power Profile C6
  Clockticks Ticks spent in C6 Percentage C6
Core 1 2913456308 33316864 1.14%
Core 2 2933155470 0 0.00%
Core 3 2950461391 2809569280 95.22%
Core 4 2957802638 0 0.00%

So on average the CPU is in C6 24% of the time, which is quite impressive. However, the way we measure this is not perfect: the measurement puts an extra load (slightly less than a chess thread) on the CPU. So the load on the CPU is not two but rather three threads. This means that the CPU probably spends even more time in C6 mode with two active threads.

Next the same measurement but with the "Balanced" power plan:

"Balanced" Power Profile C6
  Clockticks Ticks spent in C6 Percentage C6
Core 1 2961019252 0 0.00%
Core 2 2991271044 2371919872 79.29%
Core 3 3012220038 74088448 2.46%
Core 4 3012878436 22192128 0.74%

This time we spend a little bit less time in C6: about 21%. Setting the power plan to Performance allows the idle cores to go just a little bit more into deep sleep as the active cores are working harder. Of course total power does not decline as the higher power consumption of the Turbo Boosted cores is much more important than the small effect of some cores being in deep sleep an extra 10% of the time.



How Much Power?

All this hardcore testing just made us more curious. Would we be able to determine how much power the PCU of Nehalem actually saves? Let's add a little more machine code to our hardware C-state scripts. The MSR 3FCh contains the info we need. We test once again with two active chess threads.

PCU Sleep State Comparison
  Clockticks Ticks spent in C3 Ticks spent in C6 Percentage C3 Percentage C6
Core 1 2961889630 3497984 71450624 0.12% 2.41%
Core 2 2989850634 4128768 768581632 0.14% 25.71%
Core 3 3022277437 186195968 1032536064 6.16% 34.16%
Core 4 3033988899 171286528 387645440 5.65% 12.78%
Average       3.02% 18.76%

At first you may think that these measurements contradict our previous measurements even though they were measured in the same circumstances (two active threads + one measurement thread). But if you calculate how much time the cores spend on average in C6, you get 19%, in the same ballpark as our previous measurement (21%). Notice that the PCU forces the Xeon cores to move quickly from C3 to a deeper C6 sleep: only 3% (!) is spent in C3.

So this means that the ACPI C2 state consists of 13.85% C3 and 86.15% C6 (18.76/ (3.02 + 18.76). Let's take the ACPI readings again.

ACPI C-State Comparison
  % idle C1 C2 C3
Opteron 2435 86 100 0 0
Xeon L3426 81 7 93 0
Opteron 2389 72.44 100 0 0

So now we can calculate how much time the CPU actually spent in the real hardware C-states.

% time spent in C1 = 7% of 81% idle

The "software" ACPI C2 states are mapped by the Xeon CPU to two "hardware CPU" states:

  1. % time spent in C3 = 13.85% out of 93% C3, at 81% idle = +/- 10.3%
  2. % time spent in C6 = 86.15% out of 93% C3, at 81% idle = +/- 65%

So our two threads of Chess caused the L3426 cores to spend:

  • 19% in C0
  • 5.7% in C1
  • 10.3% in C3
  • 65% (!) in C6

…on average.

What effect would this have on the power consumption of the chip? Intel gives us a good idea of what each C-state consumes with the Xeon X3400 series. In the thermal specifications and design guidelines [6] we find this table.


Intel does not give us C1 power, but let's assume it is 25W on the L3426; our industry sources tell us this should be close enough. If the complex circuitry of the PCU was not available, the CPU would be limited to the C1 state to save power. Other C-states would only be available if all cores were idle or the system was idle. We assume that C0 consumes 45W, which is not far from the truth either as the CPUs with low TDP tend to be quicker.

Total power w/o PCU
= 45W * 19% (C0) + 25W * 81% (C1)
= 28.8W
Total Power with PCU
= 45W * 19% (C0) + 25W * 5.7% (C1) + 17W * 10.3% (C3) + 4W * 65% (C6)
= 14.5W

The actual absolute numbers are not that important, but our simplified calculation shows that the fact that the PCU forces the CPU to go very quickly to C6 allows the "Lynnfield" Xeon to morph from a rather mediocre low power CPU into a "real" low power CPU. 14W for four complex out of order processors is very impressive, less than 4W per core! Intel's claims are justified: the PCU enables the "Nehalem" based cores to run in a deep sleep C6 state, even if other cores are hard at work. To end with an interesting note: even with four threads active on the Xeon L3426 we found out that the cores spent 11% of the time in C6.



More Performance Please!

Now let's push the processor cores to their best performance.

Fritzmark integer processing: max performance

When the going gets tough, the tough get going. The L3426 has a very tight TDP limit with no headroom for any extra Turbo Boost action. The X3470 has a TDP limit with a large margin, and as a result it's still capable of boosting the clock speed a bit higher (3.066GHz) when running eight threads.

At the same time, this graph shows how superior the integer engine of the "Nehalem" based cores is over AMD. A 1.9GHz quad-core offers about the same performance as a 2.9GHz quad-core. It also shows that for - well multi-threaded - integer applications, AMD's six-core was a decent countermeasure.

integer processing: full load power

Turbo Boost delivers better performance, but comes with a power price. The influence is not really shocking at the system level - 9% higher power - but it is significant if you only look at the processor power level. We are still working on our methodology to measure power at the component level, but looking at the idle power and the spec sheets we can estimate CPU power rather well. Looking at the CPU level, Turbo Boost probably needs from 15% to 17% more power to the CPU VRMs.



Overview

So let's summarize what we have seen so far. First we look at system power.


At the system level, the power savings of the Balanced power plan are pretty disappointing. Especially when benchmarking with two threads, we would have expected better power savings. Of course, we tested with only one CPU. The power savings should be better as you add more CPUs to the server. On the X3470 power savings are better, but the reason is not so much "SpeedStep" but more the fact that this turns off Turbo Boost. The most spectacular power savings cannot be seen in this picture: the automatic power and clock gating. As we have shown in this article, power and clock gating happen in both power plans and are responsible for some very significant savings, especially on the "Nehalem Lynnfield" Xeons. SpeedStep and PowerNow! are no longer very impressive for the following reasons:

  • Clock gating and "deep sleep" core C-states already save a lot of power
  • They are limited to frequency scaling; voltage scaling is only possible if all cores are running at a low P-state

In the case of Intel, frequency scaling is demoted to a pretty insignificant role: it is more important to power gate cores than to clock them down to a lower P-state.


For AMD, P-state changes are still important, but their effect is dubious: performance is up to 20% lower. Even the specific AMD driver in Windows 2008 (amdppm.sys) is not capable of working optimally under low load. Cores frequently stay at 800MHz too long or only get to 1400MHz before a thread is moved to another core. Ideally our two thread benchmark should get at least two CPU cores to quickly ramp up to 2.6GHz and stay there, but that's not what happened in our tests.

We'll make a simple performance/watt calculation by multiplying the performance numbers by 1000 and dividing them by our power numbers.


Look at the green bars of the AMD Opteron processors: performance/watt is clearly lower when running in Balanced mode. This is of course not the case when we run four threads on the quad-core and six threads on the six-core Opteron. In that case the operating system gets very few chances to drop the power. Still, it is important to note that PowerNow! results in a significant performance loss, a performance loss that is not justified by the meager power savings. The result is that the six-core Opteron in Performance mode offers the best performance/watt ratio when we focus on the Opteron family. It is a shame that the Windows 2008 CPU driver does not adapt better: the six-core Opteron is pretty competitive in performance/watt.

The situation is a lot more complex in the Intel Xeon family. With low thread counts, the Xeon is capable of using Turbo Boost. From a "system power" perspective, power is only a bit higher, while performance goes up by a third. But with low TDPs, there is little wiggle room, and Turbo Boost is quickly put out of action. With more than two threads, the L3426 never clocks above 1.86GHz. With "normal" Xeons, there is no significant difference between the two power plans.



Limitations

First of all, let's discuss the limitations of this review. The benchmark we used allowed us to control the number of threads very accurately, but it is not a real world benchmark for most IT professionals. The fact that it is an integer dominated benchmark means that it has some relevance, but it's still not ideal. In our next article we will be using MS SQL Server 2008 R2. That will allow us to measure power efficiency at a certain performance level, which is also much more relevant than pure performance/watt. Also, the low power six-core Opteron 2419 EE is missing. This CPU just arrived in the labs as we finished this article, so expect an update soon.

"Academic" Conclusion

The days where dynamic frequency scaling offers significant power savings are over. The reason is that you can only lower voltages if you scale the complete package towards a lower clock. In that case the power savings are considerable (P ~ V²), but we did not encounter that situation very often. No, both AMD and Intel favor the strategy of placing the idle cores in higher C-states. The most important power savings come from fine grained clock gating, from placing cores in a completely clock gated C-state (AMD's Smart Fetch + C1), or even better placing them in a power gated stated (Intel's Power Gating into deep C6 sleep).

Practical Conclusions

Windows 2008 makes you choose between Balanced and Performance power plans. If your application runs at idle most of the time and you are heavily power constrained, Balanced is always the right choice. But in all other cases, we would advise using the "Performance" plan for the Opterons. For some reason, the CPU driver does not deliver the performance that is demanded. With Balanced, when you ask for 25% of the total CPU performance, you'll get something like 15% to 20%. The result is that you get up to 25% less performance than the CPU delivers in "Performance" mode, without significant power savings. That's not good. We can already give away that we saw response time increases in MS SQL Server 2008 due to this phenomenon. It is also worth saying that our new measurements confirm that the performance/watt ratio of the six-core Opterons is significantly better than the quad-core Opterons.

The Xeons are a different story. For the normal 95W Xeons it makes sense to run in Balanced mode. The "base" performance is excellent and Turbo Boost adds a bit of performance but also quite a bit of power. Ideally, it should be possible to run in Balanced mode and use Turbo Boost when your application is performing a single threaded batch operation, but unfortunately this is not possible with the default Windows 2008 settings.

For the low power Xeons, it is different. Those CPUs run closer to their specified TDP power limit and will rarely use Turbo Boost as soon as they are loaded at 25% or more. If your application is limited by regular single threaded batch operations, it makes a lot of sense to choose the Performance plan. Turbo Boost pays off in that case: the clock speed is raised from a meager 1.86GHz to an impressive 3.2GHz. As Xeons based on the "Nehalem" architecture place idle cores in C6 very quickly, the Performance mode hardly consumes more than the Balanced mode. As we have shown, frequency scaling does not save much power, as most of the cores are power gated automatically. This aggressive "go to C6 sleep" policy allows the architecture with the highest IPC in the industry to morph into a high performance server CPU with modest power consumption. There is a huge difference between this CPU inside a machine where it is pushed towards 100% load and inside a server where it hovers between 20 and 70% load most of the time. The latter situation allows the CPU to put cores in C6 mode a significant amount of time. As a result the power savings in a server environment are nothing short of impressive.

Now that we understand the nuts and bolts, we are able to move on to our next question: How can we get the best power efficiency at a certain performance point? We will follow up with a power efficiency case study based on SQL Server 2008.

References

[1] "Planet Google": One Company's Audacious Plan to Organize Everything, page 82, Randall Stross, Free Press New York.

[2] "AMD Family 10h Server and Workstation Processor Power and Thermal Data Sheet Publication Revision: 3.07, September 2009"

[3]"Power Reduction through RTL Clock Gating," F. Emnett and M. Biegel, SNUG (Synopsis User Group) Conference San Jose, 2000.

[4] "45nm Next Generation Intel Core™ Microarchitecture (Penryn)", Varghese George Principal Engineer Intel Corp, HOT CHIPS 2007

[5] "Analysis of Dynamic Power Management on Multi-Core Processors", W. Lloyd Bircher and Lizy K. John, The University of Texas at Austin. ICS '08 June 2008

[6] "Intel Xeon Processor 3400 series thermal/mechanical specifications and design guidelines, December 2009

Log in

Don't have an account? Sign up now