Does This Look Familiar?
Although Intel’s Gelsinger made no reference to his competition, he had to know the press would, at some point, put some of the design elements of Nehalem up against AMD’s Barcelona and Phenom architectures. As far as we see it, there’s nothing wrong with that. It’s been said over and over that AMD has an extremely elegant solution. Even if it isn’t winning gold medals for performance right now, the company’s engineering principals are solid.
The die shot Intel showed us featured four cores. However, Gelsinger says Nehalem will scale from two to eight cores. Each core has its own L1 and L2 cache, along with access to a central pool of shared, inclusive L3 cache. And the execution resources are fed with data piped in through an integrated three-channel DDR3 memory controller. If that isn’t Phenom-ish enough for you, logic for a HyperTransport-like QPI (QuickPath Interconnect) officially replaces the front-side bus Intel has relied on for so long.
Nehalem, Intel's 45nm tock, due in early '09
One of the keys to Nehalem, according to Gelsinger is its scalability. As mentioned, Intel can build the 45nm chips with as few as two or as many as eight cores. It can implement one QPI link or more, if the bandwidth is needed. The L3 cache and memory controller are also separate components. Hopefully, as memory technology evolves, the modularity of the memory controller will allow Intel to adjust the logic in kind.
The last component of Intel’s scalability story is an integrated graphics block. High-end Nehalem-based processors will naturally rely on discrete graphics for optimal performance. But the more mainstream models will include integrated graphics on the CPU package (not on-die). Gelsinger wasn’t ready to elaborate on the potency of its upcoming integrated graphics solution. However, he did say it’d be an evolutionary step forward from what we see built-in today and not related to Intel’s work with Larrabee. Given what we’ve seen from G35, Intel has significant work to do before it’s able to compete with the performance of AMD’s graphics technology.
Intel plans to use Nehalem's modularity to create a complete product family
A Micro-Architecture Unveiled
Intel is singing a much different tune today than it was during NetBurst’s heyday. Back then the story was all about clock speed and tweaking the execution core in whatever way enabled the fastest frequencies. Now Intel is focused on maximizing IPC and managing power.
Nehalem retains the ability to process four instructions per clock cycle and in that way is similar to the Core 2 Quad preceding it. Intel is bolstering performance by bringing back SMT (we knew it as Hyper-Threading back in the day), extending the SSE4 instruction set, adding more cache, as mentioned, and improving the way that data moves through the entire platform, ideally delivering two to three times more peak bandwidth, according to Intel. At the same time, Nehalem chips will sport dynamically managed cores, threads, cache, and interfaces, which we interpret to mean the processor’s building blocks will be throttled up and down—or even turned off completely—in response to loading characteristics. This is really good news if you’re one of those power users eyeballing the virtualization features in Windows Server 2008, for example. One powerful Nehalem processor, complemented by several gigs of memory, can drive three or four virtualized operating systems without breaking a sweat. Then, when the heavy lifting is done, scale back to a much more energy-efficient state. It remains to be seen just how granular Intel gets with power management in its next-gen architecture.
There are a handful of on-chip changes that’ll speed things up as well. Nehalem boasts increased parallelism, boosting the number of micro-ops (pieces of x86 instructions) that can be in-flight at any given time. Intel enhanced commonly used algorithms to help minimize “dead cycles,” too. Expect to see gains in threaded software, where faster synchronization primitives improve performance. Intel’s branch predictor—a tool for guessing whether a conditional branch is taken or not—should be more accurate thanks to a new second-level branch target buffer and new renamed return stack buffer. Gelsinger says to expect the second-level BTB to improve branch predictions in apps with large code footprints, like databases. The RSB should help Nehalem avoid return instruction mispredictions.
Nehalem's cache structure, employing L1, L2, and L3 repositories
By adding Hyper-Threading to Nehalem, Intel is going to make it possible for a single processor equipped with four cores to operate on eight threads at the same time. In the past, Hyper-Threading received mixed reviews because it didn’t always spit back higher performance numbers. Perhaps it was ahead of its time, though. Threaded software was still rare and the only real way to show it off was in a multi-tasked
environment. Threading is much more prevalent now, though, and Intel’s other enhancements (bigger cache, higher bandwidth) make Hyper-Threading a more attractive feature.
Speaking of cache, Nehalem sticks to the same 32KB instruction / 32KB data L1 cache configuration as existing Core processors. Each core gets its own 256KB L2 repository. And there’s an 8MB shared L3 cache available to all four cores.