[ Print Article! ]

Intel 8-core Xeon X5365 V8 Performance Preview
May 07, 2007 Alexis Dang

Summary: With two quad-core Xeon X5365 processors running in tandem at 3.0GHz, Intel's V8 eight-core system packs quite a punch. Join us as Alan and Alexis take a look at the performance potential of this platform in a range of benchmarks.


IntroductionPage:: ( 1 / 7 )

The arms race continues. Today, we are taking a look at Intel’s latest weapon, dual quad-core Intel Xeon X5365 processors. The platform has affectionately been dubbed the V8 platform by Intel as you’ve effectively got eight cores of processing power available at your disposal. Quad-core Xeon processors have been available for Intel since the end of last year, while the Xeon X5365 is available in select systems today and broad availability is expected in Q3’07.

Back in the day, a car with a V-8 engine got everyone’s attention; now, you need at least 12, or maybe 16 cylinders to get us interested. Similarly, only a year or two ago, having more than one CPU in a computer was out of the reach of an ordinary consumer. Advances in both technical manufacturing and design have played a part in allowing for this power growth, but the software engineers are also to thank for writing code that makes the best use of this power. Within the past 18 months, the growth in computing power has shattered Gordon Moore’s famous law. This has come through more efficient computing, not just faster computing.

Will we get to the point where we can have too much power? With cars, I’ll tell you that having nearly 300 horsepower in stop-and-go San Francisco traffic doesn’t help anyone except the gas companies. With airplanes, transportation at over mach 1 could not be sustained as evidenced by the retirement of the Concord. Well, the timing of the V8 comes on the heels of Microsoft Vista.

We’re playing with fire today.

[image]

<% print_image("01"); %>

Intel Xeon X5365 Quick Specs
Core:Clovertown
Core Frequency:3.0GHz
System Bus Frequency1333MHz
TDP150W
SteppingB-3
# of CPU Cores4
L2 Cache8MB (2x4MB)
Core to bus ratio limit1:9
Max processor input voltage1.4125V
PECI EnabledYes
Enhanced Intel Speedstep TechnologyYes
Extended Halt State (C1E) EnabledYes
Execute Disable Bit (XD) EnabledYes
Intel 64 TechnologyYes
Intel Virtualization TechnologyYes
Package/SocketFC-LGA6






The platformPage:: ( 2 / 7 )

The Intel V8 “Media Creation PC” isn’t actually a new platform. In fact, it’s nothing more than an Intel workstation motherboard paired with a pair of 3GHz “Clovertown” Xeon CPUs. The Clovertown CPU is based upon Intel’s Core 2 platform with the key architectural advantages being a 1333 MHz FSB, and quad core. Unlike the Apple Mac Pro, Intel’s reference motherboard designs have a single PCIe x16 slot. For this reason, Intel positions the V8 as a Media Creation PC although comparison to AMD’s 4x4 (two dual-core CPUs with four PCI Express graphics slots therefore effectively giving you Quad Core with Quad SLI) can be made.

Motherboard: Intel S5000XVN dual-socket motherboard

The Intel S5000XVN is designed as a motherboard that provides server-class performance and reliability while providing PCIe x16 support for high-end graphics. This isn’t just marketing speak.

Most significantly, the S5000XVN has a Serial Attached SCSI controller. Although SAS features similar connectors to SATA, there are several improvements to the interface. Like the SCSI vs IDE debate, SAS is all about higher performance and higher reliability. Although SAS is 3.0Gbps, communication can be performed in full-duplex mode, allowing a maximum of 6.0Gbps/sec transfers. You can have 128 devices on a single bus, and most importantly, it implements the SCSI TCQ protocol.

Like SATA NCQ, SCSI TCQ allows multiple I/O requests to be accessed in an optimal fashion on a drive. What makes TCQ better are deeper queue lengths and queue prioritization, allowing more efficient data access. Besides the performance benefits, SAS also features more robust error recovery and reporting and 8m cable lengths. The debate between SAS and SATA is no different from UW-SCSI and the IDE generation; SCSI is always better, but the price/performance ratio puts it out of the reach of consumers.

The S5000XVN also features many of the same capabilities as the Intel vPro platform, namely Intel Gigabit Ethernet controllers with hardware TCP checksum offload, receive-side scaling (allows packet processing to occur across multiple CPUs), fully integrated server management capabilities (remote power on, flash the BIOS over the network, etc.) and a Trusted Platform Module.

Memory: 4GB Samsung DDR2 FB-DIMM @ 667MHz

Although DDR-2 from companies such as OCZ and Corsair DDR-2 run up to 1GHz, servers and workstations continue to run at slower rates. This has less to do with reliability than it has to do with capacity.

FB-DIMMS are a natural evolution of Registered DDR. Traditional RAM requires that the data lines from the memory controller (be it the CPU or North Bridge). When dealing with capacities with 32GB, and the like, it becomes a problem. In fact, during the Opteron’s hey-day, one of the strengths of the platform was its ability to “seamlessly” handle large memory capacities as compared to the Intel platform. With FB-DIMMs there is an intermediate chip, the “Advanced Memory Buffer” or AMB. The memory controller speaks to the AMB, which can then buffer and resend the signal in a serial fashion on its own. This repeater function makes it possible to handle larger memory capacities (imagine the difference between memorizing every telephone number in your company and calling the operator). Unlike AMD’s Direct Connect architecture and integrated CPU memory controller, the Xeon may run into memory bandwidth problems earlier and face higher latency.

CPU: Dual Xeon quad core X5365 @ 3.0GHz, Clovertown cores. FSB 1333MHz

The chips in this setup are difficult to find as Intel has been giving Apple priority. They run at 1333MHz bus and draw about 150 watts a piece, more than other Clovertown CPUs. The included heatsinks are a little wimpy, compared to what we can get on the aftermarket for Core 2 Duo and Athlon64. It’s just a solid chunk of copper with thin fins and a loud fan. You would think that if you paid $1200 for a CPU that it would come with something better than a $20 heatsink and fan.

Power Supply: OCZ ProXStream 1000W Power Supply

Normally, beefy power supplies are overkill for gamers. In this case, we definitely needed more power. The Clovertown CPUs pull a peak of 150W. We have a pair so that sets our power budget at 300W from the get go. Compare this to the Core 2 Duo’s which only require 65W.

If you’re going to have a Octo-core system, you might as well have a flagship GPU. Right now, that means you’re looking at another 175 to 185W. Each FB-DIMM runs a little over 10W, and the 750GB perpendicular recording drives from Seagate in our system eat up another 13W each during seek. You’re looking at about 75W for the motherboard, case fans, keyboard/mouse, USB hubs, etc. That’s about 625W of potential peak load. Add in 15% for surge compensation and 20% “fudge factor” for electrolytic aging and the need for a 1kW power supply seems pretty clear. Remember, we’re not even using 10,000 or 15,000 RPM drives in this machine.

Several manufacturers have kilowatt power supplies, but we’ve gone with OCZ for now. Although OCZ had a bit of a slow start with their original power supplies, the new ProXStream and EvoStream are engineered by 3Y Power Technology, a California based company that was recently acquired by Forton in order to gain the R&D experience for producing high-end PSUs. Remember, Forton is one of the better PSU manufacturers and is behind PSUs such as Zalman’s.

One drawback of the TJ07 chassis is that is works best with a standard sized power supply; the kilowatt OCZ ProXStream is a perfect solution as it fits in an ordinary form factor. In contrast, although we love PC Power and Cooling power supplies, their kilowatt power supply is too big for standard cases.

Chassis: Silverstone Temjin TJ-07

The TJ-07 continues to be the Bugatti Veyron of the PC chassis industry. The monoblock unibody design has no trouble supporting all the components, and the extensive cooling zones keep everything running within spec.

Video Card: ATI Radeon X1950 XT

Vista x64 has compatibility issues with NVIDIA GPUs, the latest Forceware Drives (including 158.18) and our professional-grade NEC LCD Monitors. If you use a different LCD monitor, it’s OK. If you use the Vista-bundled drivers, it’s OK. If you use regular Windows XP, it’s OK too. It’s just this combination of Forceware 100+ and Vista x64 and certain monitors that causes problems. The Radeon has no such problem, so we’re going with the fastest ATI card on the market (at the time of this article)

Hard Drive: Seagate Barracuda 7200.10 750GB SATA2 x2

Going with Serial Attached SCSI would have offered the best performance with our system, however the argument for high-capacity 7200rpm SATA drives is hard to beat. We keep bouncing back and forth between Seagate and Hitachi drives in our system. Currently, we like the 7200.10 Barracuda’s with perpendicular recording.

OS: Windows Vista Ultimate 64-bit

As if we were going to run anything else…


The system build went relatively smoothly. Besides the NVIDIA problem with the NEC monitor, we also ran into a problem where the system would suspend but never wake up. We were never able to fix that problem.

I will also have to add that I do not now how Microsoft calculates its performance index as the V8 system generated a score of 5.9 while an Intel Core 2 Duo 4300 PC gets a 5.2 at stock speed. Unless this is a log scale, the numbers are quite misleading. Since you cannot spec a similar system on a PC, we spec’d it out on a Mac and it came out to $5,743.

[image]

<% print_image("02"); %><% print_image("03"); %><% print_image("04"); %>



The benchmarksPage:: ( 3 / 7 )

Benchmark Systems:

Intel V8 – As noted in previous page
Intel “I4” – Same system, single CPU active

C2D E4300 Stock 1.8GHz
- Intel Core 2 Duo E4300
- GigaByte GA-965P-DS3
- Corsair 2GB XMS6400 DDR2 (800 MHz)

C2D E4300 OC 3.0 GHz (333MHz x9)
- Intel Core 2 Duo E4300
- GigaByte GA-965P-DS3
- Corsair 2GB XMS6400 DDR2 (833 MHz)


SiSoft Sandra




It’s always easy to start off a CPU review with a purely synthetic test. SiSoft Sandra is a perfect option for that. In this purely synthetic, non-memory limited test, the power of the Intel V8 platform is undeniable. Bigger is better, and the Intel Blue leads the way. It’s downright amazing to look at this.

When NVIDIA launched the GeForce 4, they claimed it had more computational power than the entire planet had in 1985. I wonder what could be said about this system...

CineBench 9.5




CineBench is also a well-known multi-core benchmark. Although this is still a synthetic test, it is based upon a real-world test set. The Intel V8 continues to do well with this benchmark. Although the scaling is not completely linear, 3D professionals will still benefit from the added performance of the extra CPU cores. Due to the nature of CineBench, things are almost “embarassingly parallel,” allowing maximum benefit from each CPU core.



Benchmarks (cont’d)Page:: ( 4 / 7 )

Microsoft Excel 2007


New to our test suite is Excel 2007. Have you ever read that OpenOffice is awesome? Or read a review that said “for Microsoft Office, you don’t need a fast CPU?” These statements are only true for casual users.

When it comes to true power users of Excel (or for that matter, Microsoft Word) there is no alternative to Microsoft Office. Microsoft has the monopoly because it’s that good. In the case of Excel 2007, Microsoft has taken the initiative of implementing multi-core calcuation support. This provides tremendous performance benefits for financial users who rely on complex macros and spreadsheets.




In the Excel 2007 “Common Calculations” stress test, approximately 28,000 sets of calculations are performed (addition, subtraction, division, rounding, and square root) along with min, max, and median. Although 28,000 sets is a larger number than what most Excel users will need, it’s not unreasonable to have a dataset with 10,000 cells. In that regard, the Intel V8 platform means perceptibly instantaneous calculations versus waiting 7 seconds on Core 2 Duo 3.0GHz. Depending on your workload, those extra 7 seconds may be worth several thousand dollars.

The second data set is a more complex model of Black-Sholes Pricing, which again is designed to be more extensive than typical real-world use. (Though it is a reasonable data set for an Economics grad student). Again, the advantage of 8 cores is obvious.

If you’re using Excel to do heavy comptutation, make sure you’ve upgraded to version 2007 to get the benefit of multicore computation, and ask your boss for the fastest computer you can get your hands on.



LS-DYNA benchmarksPage:: ( 5 / 7 )

In the above cases, we saw superb scaling of the V8 core over quad core and dual core systems from Intel. The story was different as our benchmarks got more complex.



LS-DYNA is a general purpose transient finite-element solver. It’s used to simulate all sorts of things ranging from metal forming applications and structural analysis to large deformation studies like bird strike simulations in aerospace applications. It has its roots from Lawrence Livermore Laboratories as the solver used to simulate nuclear warhead design. It’s auto-parallelization is superb and this is a great real-world test of something other than “embarassingly parallel” computational science.

In this benchmark, I chose to perform a car crash simulation using a dataset available in the public domain. To make testing easier, I only simulated a single time step of a 535,000 element model of a Plymouth Neon crashing into a barrier.


There are a few interesting points to be drawn from the graph. There was a near linear increase in performance going from the 1.8GHz Core 2 Duo to the 3.0GHz Core 2 Duo. Going from two 3GHz cores to four 3GHz cores represented only a 66% improvement, and doubling the number of to eight cores only added another 35%.

Based upon data from Sun Microsystems, even AMD Opteron with Infiniband scales substantially better on the same benchmark. Going from one 3GHz Opteron to two single-core Opteron 3GHz’s improved performance 99%. Doubling that to four single-core Opteron 3GHz’s resulted in a 87% improvement. Doubling that to eight single-core Opteron 156’s resulted in a 93% improvement. Doubling that to sixteen single-core Opteron 156’s resulted in an 81% improvement...

In other words, for a memory-bandwidth intensive application such as LS-DYNA, going from dual core to octo core on the Intel platform meant improving productivity 2.25x. Going from two single-core Opterons to eight single-core Opterons with Infiniband interconnects improves productivity by 3.6x – this number should be even higher with native Hypertransport interconnects.



Photo processingPage:: ( 6 / 7 )

We unleashed the Intel V8 on our next test application, Bibble 4.9.5. Bibble is a RAW processing tool used by digital photographers. It is well-known for highly optimized code and multi-core support. In fact, when I originally tested the dual dual-core Opteron systems, Bibble was processing RAW files so quickly that I had to run my benchmarks several times to make sure everything was working. It was too fast.



Although Bibble is 8-core aware and sends data to all eight processor cores, the software was unable to saturate the CPU. This was certainly a disappointment given the prior performance that was seen with the Opteron platform. Importantly, our tests from several years ago have shown that memory bandwidth and latency plays a significant role in RAW processing.

Although the memory bandwidth of the Intel V8 was pushing 6GB/sec in Windows Vista x64, AMD’s 4x4 platform pushes closer to 14.5GB/sec of memory bandwidth. Suffice it to say, the Intel V8 platform is memory bandwidth limited.


What about Games?

Dual core games are now in the mainstream, however it still remains unclear when we will begin to see games taking advantage of quad and octo-core systems. We did have troubles with our NVIDIA cards in Windows Vista x64, and so we will have to revisit these numbers at a later date.

Intel does have a technology demo titled “Ice Storm Fighters” that was produced by FutureMark. Ice Storm Fighters was designed to showcase Intel’s Quad Core CPUs. The game is designed to calculate artificial intelligence and physics (using Ageia PhysX) on as many CPUs as possible.

On a dual core CPU, you can have 10 robot hovercrafts active before things get choppy. On a quad core machine, you can reach 20 hovercrafts and on the eight-core machine, you can have 40 hovercrafts. This is perfect linear scaling. Intel only allows you to adjust the number of hovercrafts in increments of 5, howeve it does show the potential of future multi-core game. You can download the tech demo at:
http://www.intelcapabilitiesforum.net/ISF_demo?s=a



ConclusionPage:: ( 7 / 7 )


For the 3D modeler or power Excel 2007 user, the Intel V8 is amazing. However, this system is almost too fast for today’s software. We aren’t seeing the across the board increase in performance that we saw in the past with the move from single to dual core processing. On some of the benchmarks, the 8 cores were not being saturated. Was this caused by limitations in the multi-core code in the software? Memory bandwidth limitations? Or both?

One aspect was slightly disappointing, the memory bandwidth. In its current design, Intel’s memory bandwidth doesn’t scale as well with the processor as AMD’s design. For certain applications, including computational science and based upon our best analysis, advanced digital photography, AMD’s memory controller on the CPU will have the advantage. We will have to reserve any conclusions until we get some AMD OctoFX action into our labs. Likewise, it’s important to realize that even though the Intel architecture doesn’t “scale as well” as AMD’s architecture, the dual Clovertown’s at 3GHz will still give you the best performance in most applications. Equally as important, since most of us are only deciding between dual and quad core, the scalability limit isn’t there yet.

In terms of practicality, the V8 in its current form probably isn’t going to make it into many homes. One detail we haven’t mentioned thus far is that with the Intel workstation motherboard, boot time is delayed by at least 35 seconds as the motherboard goes through its diagnostic checks – even in the “fast boot” mode. FB-DIMMs are still too costly and with limited availability. 4GB remains a sweet spot for power users, and the AMD platform has reached 16 and 32GB without the same problems.
There is a solution to this though: time. Give it another year or so and we might be reviewing a small form factor V8.

A few years ago, gamers had to choose between the Pentium III and the Athlon XP. It was a fierce battle between AMD and Intel and things were exciting. We saw the race to 1GHz CPUs with AMD edging out Intel by just a matter of hours. With the Pentium 4 Northwood, the balance of power slowly shifted away. The scalability of the Pentium 4 core when it came to clock speed helped Intel pull further and further away. Just when AMD’s future seemed doomed, the Athlon64 entered the scene and in a blink of an eye, the AMD64 platform became the unanimous choice for enthusiasts. With Athlon64 X2, it seemed like AMD was unstoppable... until Core 2 Duo entered the scene. Since then, Intel changed the course of the war and the Core 2 Duo became the de facto CPU for hardware enthusiasts.

So our conclusion? With the Intel V8, we are seeing two things. The Intel V8 is fast. Faster than anything we’ve ever tested before. However, for the first time, we are seeing limitations of the platform behind Intel’s Core 2 architecture. This is monumental because we are seeing the potential for AMD’s memory architecture to play a major role in the upcoming 8-core world and beyond. Suddenly, CPUs have gotten interesting again.

© Copyright 2003 FS Media, Inc.
[ Print Article! | Close Window ]