Instruction Prefix
Why is the Athlon a speed champ?
So now let's look at what any self-respecting next-generation technology needs to make a fast CPU. First and foremost, we need an efficient (and fast) way to decode slow x86 instructions into a fast, efficient RISC-like set. The Athlon has the ability to decode up to 6 CISC instructions per clock cycle.
This is nothing new. CISC to RISC had been commonly implemented back in late 1995, with Intel's introduction of the Pentium Pro (and with NexGen's NX586!). Beyond the more esoteric talk of "superpipelining," "dual independent bus" and "dynamic execution" talk, a major distinction between the Pro and the original Pentium CPU was the fact that it would convert long, complex CISC operations to easier-to-manage RISC micro-operations, or micro-ops.
CISC has its benefits
The main benefit of converting to RISC is uniform instruction length. As its name implies, CISC instructions tend to be long,variable-length affairs. Moving to RISC allows a CPU to more efficiently sort and schedule operations, especially out-of-order ops to individual execution units. There are drawbacks, however. First and foremost, decoding to reduced instruction set incurs a performance hit, and is probably the single biggest bottleneck in current processors. The faster your CPU can decode, the faster it will run.
The important thing to note is that the exact number of decodes depends on what kind of code is pumped through the processor. Just as some code is particularly suited (or optimized) for 3DNow!, certain types of code require fewer or more cycles to decode. Realistically, it looks like the Athlon will be able to decode between 2.5 to 3 real-world instructions per clock cycle.
Instruction Execution
Once x86 instructions have been "decoded" to RISC, they have to be buffered before they can be executed by the CPU. Instruction operands are handled by "execution units." This is where you hear a lot of low-level microprocessor buzzwords. High-performance CPUs such as the K6-3, Athlon, and Pentium III (all the way back to the Pentium Pro in fact) make use of multiple execution units for parallel processing of certain operations.
This is what is known as a Superscaler instruction pipeline. The Athlon has a 9-issue superscaler architecture for its general functions. They can be broken down into 3 pipes for integer operations, 3 floating point units, and 3 addressing units. Compare this to the 2-issue pipelining on the Pentium II processor, which executes in 12 stages.
Buffering Data
With so many execution units, it would seem difficult for the processor to keep each execution unit filled and fully utilized. While this usually isn't the limiting factor in determining CPU performance, your processor won't be achieving its full potential if its execution units aren't being kept full. Think of it like a 6-lane freeway road. It can sustain a continuous single lane of cars, but it can also handle 6 lanes of continuous traffic just as well, and in the latter case, you're getting a lot more processing accomplished.
This is where L1 cache and internal buffering come into play. The Athlon contains 72 buffers to store RISC86 micro-ops waiting to be fed to its 9 execution units. By pushing instructions into the buffer, the Athlon should be able to keep its pipelines full and churning at full speed.