New 3.0 shaders
The RADEON X1800 is ATI’s first shader model 3.0 graphics part. As we learned with the GeForce 6800 launch a year ago, shader model 3.0 brings with it support for more instructions, thus allowing developers to write more complex shader programs. In addition to this, another important feature that shader model 3.0 added was dynamic branching (flow control), allowing developers to add loops to their programs.
This particular feature was designed to make writing shaders easier for developers, one common example used was multiple light sources. In previous shader models, the developer would have to write a shader for each light. Dynamic branching makes it possible for the developer to write one shader, which then loops through a certain number of vertex lights and exits once all the lights have been processed. This helps to reduce shader count complexity. Another potential advantage to branching is reducing the variety of shaders used (i.e. many different shaders versus one).
Besides eased development, shader model 3.0 also presents potential performance improvements. For example, developers can use dynamic branching to skip large portions of code that are determined to be unnecessary, and thus help to speed up the shader.
Branching, if not used carefully however, can introduce slower performance. With RADEON X1800, ATI sought to improve branching and also improve texture fetching. After all, if a pixel shader needs to look up a texture value that is not located in the texture cache, it must look in graphics memory, which can introduce hundreds of cycles of latency.
To improve flow control, ATI breaks down the pixel processing workload into a large number of small threads. ATI refers to this as ultra-threading. These threads consist of small 4x4 blocks of pixels (16) on which the same shader code is executed.
Secondly, ATI adds dedicated flow control logic. The RADEON X1800 features an ultra-threading dispatch processor which acts as a central dispatch unit that tracks and distributes up to 512 threads across the RADEON X1800’s shader processors. Each of these shader processors consists of four pixel shaders, what has traditionally been referred to as “quads”. Each of these processors is autonomous and contains its own dedicated branch unit to help eliminate flow control overhead in the shader processors.
Whenever the dispatch processor determines that a core has become idle, it is assigned a new thread to execute. If the idle thread was waiting for data, it is temporarily suspended until that data becomes available, thus freeing its ALUs to work on other threads. ATI claims that this enables the Radeon X1800 pixel shader cores to maintain over 90% utilization in practice, with negligible idle time regardless of the shader code being run.
In closing, ATI feels that by breaking the pixel processing workload into smaller threads, the RADEON X1800 works more efficiently. Ultra-threading also hides the latency normally encountered with texture fetching. Meanwhile, the X1800’s dedicated flow control logic minimizes shader processor idle times and wasted cycles. All this adds up to improved flow control, which will become increasingly important as developers continue to implement branching in their code.