[ Print Article! ]

Intel Core i7 (Nehalem) Performance Preview
November 02, 2008

Summary: Performance junkies look out: Intel's next-generation Nehalem CPU has arrived! The CPU's architecture has been designed from the ground up to deliver improved IPC, while it's also capable of dynamically OC'ing itself to further enhance performance. See how the new Core i7 CPUs stack up against CPUs ranging from the Core 2 Duo and Athlon X2 6000+ up to the Core 2 Extreme QX9700. We also managed to OC these chips to really high levels. Read the full scoop inside!


Intel Core i7 (Nehalem) Performance PreviewPage:: ( 1 / 13 )

[image]

<% print_image("01"); %><% print_image("02"); %>


If you recall, this was one of the chief weaknesses in Core 2’s predecessor, Pentium 4/D. Pentium 4 processors sacrificed the amount of work performed per clock in exchange for more pipeline stages, 31 in the case of latter Pentium D processors. Essentially Intel made a conscious decision to sacrifice IPC in exchange for higher clock speeds. Ultimately this decision came back to haunt them when Pentium 4/D had trouble scaling to higher clock speeds of 4GHz and beyond.

Core 2 never hit the clock speeds of Pentium 4, but because of its improved IPC, it didn’t have too in order to achieve breakthrough performance.

But Intel didn’t stop there. To further enhance performance, Core 2 also featured more accurate branch prediction, improved SSE/SSE2/3 performance, and a unified L2 cache with more advanced prefetchers residing in the L1 and L2 caches to reduce memory access.

Ultimately Core 2 was over two times faster than Intel’s previous Pentium processor, and it also significantly outperformed AMD’s fastest Athlon X2 and FX processors, all while generating very little power and with tons of frequency headroom for overclockers. It wasn’t uncommon for Core 2 Duo E6300 and E6400 chips to push 3GHz.

[image]
<% print_image("03"); %><% print_image("04"); %>

Late last year Intel gave Core 2 a midlife upgrade with their Penryn architecture. Besides its smaller 45-nm manufacturing process, Penryn also featured double the divider speed over Conroe when handling math computations and a new super shuffle engine. This is a 128-bit wide, single-pass shuffle unit that improved Penryn’s performance with SSE2, SSE3, and SSE4 instructions that have shuffle-like operations.

Penryn was also the first Intel processor to support SSE4.

The final ingredients Intel added to Penryn to improve performance were faster bus speeds and a larger L2 cache. Quad-core chips shipped with up to 12MB of L2 cache while dual-core parts featured 6MB of L2.

As a result of all these improvements, Penryn generally performed around 10-15% faster than Conroe/Kentsfield clock-for-clock. In apps that took advantage of SSE4, this advantage was even greater. In comparison, AMD’s fastest Phenom CPU, the Phenom 9950, is just now approaching the performance of Intel’s older quad-core Kentsfield CPUs like the Core 2 Quad Q6600 and Q6700.

And now, just as AMD’s approaching the eve of the arrival of their first 45-nm CPUs, Intel’s back again with the “tock” of their tick-tock model that follows every process shrink (in this case Penryn) with a next-generation microarchitecture (Nehalem) each year.

As you probably know by now, Intel’s next-generation microarchitecture (previously codenamed Nehalem) was officially given a brand name by Intel in August of this year: Core i7. Over the course of the past 18 months, Intel has slowly divulged most of the tech goodies that make up Core i7 including its integrated memory controller, Intel’s Quick Path Interconnect (Intel’s equivalent of AMD HyperTransport that previously went under the codename CSI), its new L3 cache, the return of Hyper-Threading, and Nehalem’s Turbo Mode, but we’re going to briefly go over these changes before we take a look at the new Core i7 platform and the processors behind it.



Nehalem ArchitecturePage:: ( 2 / 13 )

[image]

<% print_image("05"); %><% print_image("06"); %>


This modular design helps to reduce power consumption. Features like the memory controller and QPI all run at voltages independent of each other.

Intel has incorporated a number of improvements into Nehalem that are designed to improve IPC. For instance, the number of micro-ops (microinstructions) in flight has increased from 96 in Conroe/Penryn to 128 in Nehalem. Intel also increased the size of the load and store buffers to ensure that they wouldn’t become a limiting factor.

Intel also improved Nehalem’s branch prediction. A new second-level branch target buffer has been added to improve branch prediction in applications that have large footprints such as databases. This second predictor has a much larger history table which should allow it to predict branches more accurately than the first level predictor. Intel has also added a new renamed return stack buffer (RSB). RSBs store forward and return pointers associated with call and return instructions. The RSB should help Nehalem avoid return instruction mispredictions.

With its faster synchronization primitives, Nehalem has also been tweaked to handle threaded software better.

Speaking of threading, with Nehalem we see the resurgence of simultaneous multi-threading (Hyper-Threading). With Hyper-Threading, one processing core can run two threads at the same time. With four processing cores inside Core i7, the OS “sees” eight cores and sends eight instructions to the CPU, effectively doubling the number of overall threads that Nehalem can run simultaneously over a conventional quad-core CPU.

Whereas Hyper-Threading (HT) never really took off on the Pentium 4, Intel feels that Nehalem has a distinctive HT advantage thanks to its larger cache and greater memory bandwidth, all of which should allow it to deliver better HT performance. Additionally, there are also more apps capable of taking advantage of HT than there were a few years ago. As you’ll see in our Lost Planet, Cinebench, and Valve benchmarks, Nehalem delivers a significant performance increase in HT-aware apps.

New cache subsystem

While Nehalem has the same 32KB instruction/32KB data L1 cache configuration as previous Core 2 CPUs, Intel has totally revamped the L2 cache and added a new L3 cache.

Nehalem’s L2 cache is much smaller than Penryn. Each core has its own 256KB L2 cache for handling data and instruction. While this is significantly less than previous processors, Nehalem’s L2 is lower latency than its predecessors.

In addition to the L1 and L2 caches, like AMD’s Phenom Nehalem also features an L3 cache that is shared across all the cores. Unlike Phenom however, Nehalem’s L3 is inclusive and not exclusive like AMD’s. Intel feels that this inclusive architecture gives them an advantage over AMD, as an exclusive architecture doesn’t store data from the lower level L1 and L2 caches. As a result, if a data request misses on the L3 cache, each processor core must be snooped (searched) in case its L1 or L2 cache has the requested data. This increases latency and snoop traffic between the cores.

With Nehalem these snoops are unnecessary, as the CPU already knows that the data doesn’t reside in L1 or L2, this helps to reduce latency and thus improve performance as well as reducing power consumption.

Like its two-level branch prediction, Nehalem features a two-level 512 entry translation lookaside buffer (TLB). Nehalem is the first CPU to feature a second TLB. This is another improvement Intel has incorporated into Nehalem to improve its performance with server apps like large databases.

[image]

<% print_image("07"); %><% print_image("08"); %>

SSE4

Nehalem is Intel’s first CPU to offer SSE4.2 support. 7 new application targeted accelerators have been added to the new instruction set providing improved performance in string and text processing operations. One example Intel provides is the parsing of XML files at a much higher speed. The other two instructions are focused on accelerated searching and pattern recognition of large data sets (useful for voice/handwriting recognition) and the seventh is a CRC instruction focused on new communications capabilities such as accelerated network attached storage.



QPI, Turbo ModePage:: ( 3 / 13 )

Intel QuickPath Interconnect

Rather than relying on the FSB for yet another processor, Intel has developed their QuickPath Interconnect (QPI) to link the CPU to the outside world.

QPI is a high-speed, point-to-point interconnect that provides connections between the CPU to memory, CPU-to-CPU, and CPUs to the I/O hub. The QuickPath interconnect boasts up to 6.4 Gigatransfers/second links (one link is the equivalent of 12.8GB/sec of bandwidth), since it’s bi-directional, QPI effectively delivers 25.6GB of total bandwidth. In comparison, Conroe’s 1066MHz FSB topped out at 8.5GB/sec of peak bandwidth.

Integrated memory controller

Nehalem sports an integrated triple-channel memory controller that supports DDR3 memory exclusively. Memory clocks are limited to just two speeds: 800MHz DDR3 and 1066MHz DDR3. Nehalem can run with faster DDR3-1333 and DDR3-1600 memory, but in this case the modules would be underclocked to run at 1066MHz (unless of course you decide to OC).

By integrating the memory controller on the processor die, memory latency is dramatically reduced.

Obviously with a triple-channel memory controller you’ll have to install memory modules in groups of three rather than in pairs. Like Phenom, Nehalem’s memory controller supports NUMA (non-uniform memory architecture). In a multi-socket system each CPU will have its own local memory so you’ll need six modules for say a 2P server to deliver optimal performance.


Turbo Mode

One lesson Intel’s learned over the years is just how slow the software industry is to adapt to the multi-core CPU world we live in today. Games for instance are just now being written with dual-core in mind, there are only a handful of titles that truly take advantage of quad-core. As Intel goes from two, to four, and eventually eight processing cores in the future, there’s potential that many of these additional cores will sit idling completely untapped by the software. With this in mind Intel has developed a new power control unit (PCU) right onto the CPU die. The PCU is solely responsible for power management, actively monitoring the cores for aspects such as utilization and temperature. The PCU can then completely shut off cores that aren’t being used, helping to reduce overall CPU power consumption. This brings us to Turbo Mode.

In cases where cores aren’t being used, the PCU can shut down those cores and selectively overclock the core(s) that are being taxed. Say for instance you’re running a single-threaded game. In this case the PCU shuts off three processing cores and overclocks the one core you’re using. All this is completely invisible to the OS and the end user, providing a performance boost without any user intervention on your part.

The PCU will OC the active core by up to two clock speed bumps (+266MHz) max, so your 2.66GHz CPU becomes a 2.93GHz processor. If the PCU detects that your power usage, current, or temps are too high at that level, it will automatically drop you down to just one speed bump (+133MHz), knocking you down to 2.80GHz.

If you can keep your power usage and core temps down, the PCU will potentially run all four cores at 266MHz over your CPU’s base clock frequency. Keep in mind that this also applies when OC’ing, so if you’ve overclocked your CPU 533MHz over stock in BIOS, Turbo Mode will OC the processor core(s) another 133MHz or 266MHz above that, netting a clock speed boost of up to 800MHz.

Turbo Mode is a feature that can be completely adjusted in BIOS, so if you’re reluctant to OC your processor you can turn it off, or if you’re an enthusiast who loves to OC, you can tweak Turbo Mode settings in your motherboard’s BIOS to get just the right clock speed. We’ll be discussing Turbo in more depth further in this article.



The Core i7 CPUsPage:: ( 4 / 13 )

Intel is offering three Core i7 SKUs at launch: the flagship Core i7 965 Extreme Edition clocked at 3.2GHz, the midrange Core i7 940 running at 2.93GHz, and the entry-level Core i7 920 which runs at 2.66GHz:

Intel Nehalem SKUs
ProcessorCore i7-965 Extreme EditionCore i7-940Core i7-920
Clock Speed3.20GHz2.93GHz2.66GHz
QPI Speed6.4 Gigatransfers/sec4.8 Gigatransfers/sec4.8 Gigatransfers/sec
L3 Cache Size8MB8MB8MB
Unlocked Clock MultiplierYesNoNo
Memory Speed SupportDDR3-1066DDR3-1066DDR3-1066
TDP130W130W130W
Price$999$562$284



Nehalem is built on Intel’s 45-nm manufacturing process high-K metal gate transistor technology with a die size of 233 square millimeters and approximately 731 million transistors. In comparison Penryn’s transistor count was 820M transistors and a 214mm2 die.

[image]

<% print_image("09"); %><% print_image("10"); %>

Memory Compatibility

As some sites have mentioned ahead of the Nehalem launch, officially the CPU supports DDR3 memory rated up to 1.6V. According to Intel, memory running at voltages higher than 1.6V can potentially damage the CPU. Most memory manufacturers have announced their own triple-channel Nehalem-ready memory kits ahead of today’s launch, we recommend anyone interested in building their own Nehalem system go with one of these kits. Intel will be providing a list of certified memory modules on their developer website as well that you’ll want to check out before purchasing anything.

We opted to play it safe for now and stick with the stock memory voltage for all our Nehalem testing.

Motherboard Compatibility

Intel’s X58 chipset is the only platform that supports Core i7 at this time. X58 is Intel’s flagship chipset, with support for up to 36 PCIe lanes and supports PCIe 2.0. PCI Express Graphics solutions supported include 1x16, 2x16, and 4x8, with the chipset supporting ATI CrossFire and NVIDIA SLI (although as we’ve reported in the past motherboard manufacturers must submit their X58 boards to NVIDIA for proper SLI certification).

We’re going to try and do a dedicated SLI/CrossFire article around the end of this month.

The motherboard we used for our Core i7 testing is Intel’s own DX58 Smackover board. The Smackover board is a fairly nice board, with a good layout and enough features to please the mainstream user, although enthusiasts will probably want to opt for a higher-end motherboard from ASUS, EVGA, Gigabyte, or MSI with 6 DIMM slots and CrossFire/SLI support (Smackover doesn’t support SLI at this time).

[image]

<% print_image("11"); %><% print_image("12"); %><% print_image("13"); %>

The motherboard offers base clock speeds up to 240MHz (Nehalem’s stock base speed is 133MHz with the i7 920 relying on a multiplier of 20.0x (20.0x133=2660), the 940’s multiplier is 22.0x (22x133=2926) and the 965 has a multiplier of 24) in 1MHz increments. Memory multipliers of 6.0 and 8.0 are also selectable in BIOS (6.0x133=800MHz DDR3, 8.0x133=1066MHz DDR3), as well as a 10.0x (1333MHz DDR3) and 12.0x (1600MHz DDR3). The latter two multipliers were only selectable for our Extreme Edition CPU however.

In terms of voltages, the board provides CPU voltage settings up to 1.6V in 0.0125V increments, chipset voltages up to 1.50V (0.025V increments) and voltages for the QuickPath Interconnect up to 1.8V (0.025V increments). Memory voltage settings up to 2.5V are available in increments of 0.04V. The QPI data rate is also adjustable.

[image]
<% print_image("14"); %>

Overclocking

We were pleasantly surprised with how far we were able to push our Core i7 processors. The Core i7-920 managed to hit speeds of 3.6GHz (20.0 multiplier x 180MHz host bus) and 1.4875V of juice, with the chip pushing 3.9GHz thanks to Turbo Mode. At stock voltage the chip maxed out at 3.1GHz (20x155MHz bus).

We were actually able to run the system at even higher speeds within Windows, but the system wasn’t 100% stable at higher speeds and the occasional Windows crash. With more voltage we’re pretty sure we could’ve got the PC to run stable, but we weren’t willing to crank the voltage up beyond 1.5V with our shiny new Core i7 processor.

[image]

<% print_image("15"); %><% print_image("16"); %>

The Core i7-965 EE topped out even further, hitting speeds of 4.08GHz (30.0 x 136) with 100% stability. Once again we needed 1.4875V to get everything running stable, although in this chip’s case we’re pretty confident we hit the ceiling of its capabilities. At any higher speeds Windows failed to load.

To cool the processor, we used a Thermalright Ultra-120 eXtreme RT for all our OC attempts.




System SetupPage:: ( 5 / 13 )

Intel Core 2 Extreme Edition QX9770
Intel Core 2 Quad Q9650
Intel Core 2 Quad Q6700
Intel Core 2 Duo E8600
Intel Core 2 Duo E6400

ASUS P5E3 Premium

4GB (4x1GB) OCZ DDR3 PC3-16000 Platinum

Intel Core i7-965 Extreme Edition
Intel Core i7-920

3GB (3x1GB) Qimonda 1067 CL7 non-ECC

AMD Athlon X2 6000+
AMD Phenom 9950

ASUS M3A32-MVP Deluxe

4GB (4x1GB) OCZ DDR2 PC2-8500 Platinum

80GB Intel X25-M Solid State HDD

Windows Vista Ultimate 64-bit w/Service Pack 1


Benchmarks

Lost Planet
World In Conflict
Crysis
Far Cry 2
PCMark Vantage



Synthetic benchmarksPage:: ( 6 / 13 )

PCMark Vantage








SiSoft Sandra 2009




Valve Particle Simulation Benchmark





Power/Media Encoding/Rendering BenchmarksPage:: ( 7 / 13 )














World In Conflict PerformancePage:: ( 8 / 13 )

World In Conflict – Direct3D







Company Of HeroesPage:: ( 9 / 13 )

Company of Heroes – Direct3D







CrysisPage:: ( 10 / 13 )

Crysis – Direct3D







Lost PlanetPage:: ( 11 / 13 )

Lost Planet – Direct3D







OverclockingPage:: ( 12 / 13 )








ConclusionPage:: ( 13 / 13 )


As glowingly as we all raved on Conroe and its Penryn successor however, things weren’t as rosy for Intel in the server space. While Phenom has been a lackluster performer on the desktop, its server equivalent, Barcelona is highly popular among the IT crowd, particularly as you ramp up the number of CPUs. In this realm AMD is much more competitive with Intel. Nehalem is designed from the ground up to counter this very real threat.

Nehalem’s QuickPath interconnect is Intel’s answer to HyperTransport, while the chip also sports an integrated memory controller and L3 cache just like AMD. The second level TLB and branch predictor should improve Nehalem’s performance when dealing with large data sets and the chip also features improved virtualization; all these goodies inside Nehalem should improve Intel’s standing in the server segment.

But what about us gamers?

Fortunately some of these enhancements also benefit gaming. The integrated memory controller and QPI reduce latency and improve peak bandwidth, while the triple-channel memory improves overall memory bandwidth. Hyper-Threading is another new feature that could reap dividends if the app is multi-threaded. The only problem is most games are only dual-threaded, with only a handful of RTS and FPS titles using four or more threads. In this article we tested most of them: World In Conflict, Far Cry 2, Crysis, and Lost Planet. In the case of Far Cry 2, the Core i7 965 Extreme Edition ran 7% faster than Intel’s fastest quad-core Penryn, the QX9770 (this is the same margin as the multi-threaded RTS WiC), while Lost Planet ran up to 32% faster on the i7-965. Finally, the Core i7-965 Extreme Edition ran 8% faster than the Core 2 Extreme QX9770 in Crysis.

Other than Lost Planet, this probably isn’t the earth shattering performance improvement some gamers may have been hoping for.

At the same time however, the Core 2 Extreme QX9770 is one blazing-fast chip. Our benchmarks were run with DDR3-1600MHz memory and obviously a 1600MHz FSB. When you compare Core i7’s performance against more conventional Penryn CPUs and the Core 2 Quad Q6700, the Nehalem CPUs really begin to shine.

What’s really remarkable is the performance showing of Intel’s $284 Core i7-920. Despite its pedestrian 2.66GHz clock speed, this chip was able to give the QX9770 a run for its money in most of our gaming benchmarks. This is without a doubt the chip we’d wholeheartedly recommend to our readers interested in upgrading to the Core i7 platform. With a little bit of OC’ing, this sub-$300 chip becomes even more of a screamer.

The biggest downside to Core i7 is probably the cost. Keep in mind we’re not referring to the price of the CPUs themselves, in fact we feel Intel has priced the CPUs very aggressively considering the performance you’re getting. The Core i7-920 is only a little slower than QX9770 yet it costs significantly less, while the Core i7-940 is also priced to move at $562. You can even make an argument that the Core i7-965 is a steal at $999. It is after all the world’s fastest processor and it's priced $400 less than the QX9770.

The real problem Core i7 faces is the cost of its underlying platform. X58 motherboards are expected to sell for $300+ when they go on sale later this month, while triple-channel memory kits currently start at $125. That’s over $400 that you’ll have to spend to upgrade to Core i7 before you even pick up the processor (assuming you don’t already have DDR3 memory).

Fortunately Core i7’s enhancements can really reap dividends with the right software, and for some users a Core i7 upgrade would be a worthwhile investment. While Lost Planet was the only game that showed a substantial performance improvement thanks to Hyper-Threading, our 3D rendering apps are all multi-threaded and here Core i7 blew away the QX9770. Over time these apps will continue to become more prevalent, eventually becoming the norm rather than the exception. If you’re the type of user who only upgrades his processor once every few years, you should definitely keep this in mind.

So there you have it, our take on Core i7. Unlike Conroe, Intel’s latest microarchitecture delivers an evolutionary rather than revolutionary performance increase over its predecessor, although in some apps it has the potential to deliver performance that’s truly groundbreaking. Core i7 is without a doubt the finest processor Intel’s ever produced and we don’t see AMD delivering anything that’s performance competitive with this CPU in the near future.

The only downside is we wish Intel offered a lower cost alternative to X58 at launch. As it stands now, the Core i7 CPU we’re recommending most, the Core i7-920, will probably end up selling for about the same price as the X58 motherboard underneath it. The cost of upgrading to the Core i7 platform is probably going to keep a lot of enthusiasts on a budget from upgrading today, and that’s a shame in our opinion, as it’s certainly a fun platform to play with. Turbo Mode in particular is a really exciting feature.

In any case, Intel’s done it again boys and girls. Core i7 is indeed a pretty sweet CPU. If Intel continues to execute on their roadmap like this, AMD could have a hard time playing catch up at the high-end of the CPU market. Intel’s clearly the king when it comes to CPU performance.


© Copyright 2003 FS Media, Inc.
[ Print Article! | Close Window ]