L2 cache implementation
By integrating the L2 cache on the processor die, L2 cache bandwidth increases by 300% over the original Athlon processor. The hit latency is also reduced, as it takes less time to access the needed data from the processor die than reading it off an external cache chip.
Improvements in the cache implementation
The Thunderbird's L2 cache is 16-way set associative - eight times more associative than the original Athlon processor. By increasing the associativity, the chances of the processor finding the data it needs in cache memory is increased. This is known as a "hit." If the processor can't find the data it needs in cache it is known as a "miss."
If the processor can't find the instruction it needs in cache, it must look in system memory, and if it can't find it there it must look on the hard drive. This takes time and can lead to stalling the processor.
As a downside, with increased associativity comes additional time to search through cache for the needed instruction. If a miss occurs, the penalty for looking through a highly associative cache is greater than the time it would take a less associative cache to perform the same operation.
To put things in simpler terms, with a highly associative cache a more thorough search can be performed through cache memory at the cost of speed. Since it's more desirable to fetch data from cache than system memory, AMD has implemented a 16-way set associative cache to increase the chances of the processor finding the data it needs in cache rather than slower system memory.
In contrast, the Pentium III features an 8-way set associative L2 cache.
More cache details
The final difference between the Thunderbird L2 cache and traditional Athlon's is the use of an exclusive cache architecture in Thunderbird, versus the inclusive architecture used in Athlon "classic" and Pentium III.
With an inclusive architecture, every cache block from L1 cache is duplicated in L2. While this provides redundancy, it also decreases the amount of new data the L2 cache can store. Since every Athlon processor features 128K of L1 cache, Athlon classic processors must allocate 128K of their 512K total L2 cache, leaving only 384K for new instructions. Keep in mind that the L2 cache is also running at half the speed of the processor core on the original Athlon.
With Thunderbird's exclusive cache architecture, all 256K of L2 cache is dedicated to new instructions. While this number is less than that of Athlon classic, the L2 cache on Thunderbird operates at the same speed as the processor core, resulting in greater overall performance and efficiency.
Unfortunately, the cache interface between the processor and cache is still 64-bits. Intel's Coppermine Pentium III processor features a 256-bit interface to keep the processor fed with data. If the cache interface present on Thunderbird was wider, AMD would get even more performance out of their CPU.