The High Performance Computing Center
NVIDIA also houses a world-class high-performance computing cluster. Besides the usual rack after rack of Intel Clovertowns, NVIDIA also has several Unisys ES700/one racks. These machines have 768 GB of RAM. Yeah, that’s actual accessible RAM not something like “96 nodes with 8GB of RAM each”.
NVIDIA also has several hundred nodes with Intel Clovertown CPUs. Believe it or not, these are connected using 100MBps and GigE – no fancy Myrinet or Infinband needed.
To back up the data, NVIDIA has several large tape libraries. They don’t need to backup all of the data (i.e. the intermediate computations), just the critical elements
NVIDIA tries to keep their cluster at near full capacity. Idle computers mean wasted money, but running at maximum capacity means that there’s no room to handle increased demands. NVIDIA’s limitations are power and cooling. Although today’s Clovertown CPUs offer exceptional performance per watt, a full rack of Clovertown nodes will produce more heat than what can be cooled by air. They are seriously looking at water cooling as their computational demands increase.
Believe it or not, NVIDIA tries to keep their compute nodes for 3 or 4 years. They’re still in the process of swapping their Pentium 4 based compute nodes for Clovertowns.
So what’s the point of this compute cluster? All of these systems working in synchrony allow NVIDIA to simulate their chips is near real-time. That is, they can validate their chip, run performance metrics, debug, and even start writing drivers before any silicon is actually produced. This ensures that when a chip is taped out, it has already been validated and proven in a simulated environment. The silicon failure analysis lab then crosses the gap between the theory of the simulation and the actual product, ensuring the fastest possible turnaround from paper to product.
For you and me, it’s these two labs that let NVIDIA keep up the pace with our demands for faster hardware and more immersive graphics.