Unified shaders (cont’d)
As you can imagine the key to any unified architecture is having a controller that can efficiently assign and allocate tasks to the generalized shading units. The controller has to do a good job of balancing the load evenly amongst the shaders, or else the efficiency gains realized from the unified architecture go to waste.
In the Xbox 360, this controller is dubbed the load balancing unit.
The load balancing unit is responsible for organizing the flow of data that goes to the shading units. Its designed to hand out tasks as efficiently as possible in order to ensure that the shading units are being fully utilized while also processing data in the best order so that the scene can be rendered as quickly as possible.
We got an excellent explanation of how the load balancing unit works, and its relationship with the thread arbiter (which lies above the shading units) from ATI’s Dave Baumann:
“In a traditional architecture there is a fixed resource split between geometry and pixel processing, and the ratio of these splits often vary depending on what portion of the market the processor is aimed at. The workloads of games are rarely uniform though – the level of geometry and pixel processing can vary significantly depending on what is occurring within the game, not just from one from to the next, but within a frame. If there is a section of processing required that is very geometry heavy then it can mean that the some or all of the potential pixel processing power in the GPU is wasted as it bottlenecked by the geometry processing capabilities and vice versa.
With a unified structure there is a single pool of ALU’s that can all execute either geometry or pixel instructions, so it is the task of the load balancing unit to allocate the workload over any or all of them in order to maximize the efficiency. At a basic level, the load balancing logic sits on top of the entire of the command queues, with an overview of all the workloads that need to be processed and monitors both the vertex and pixel buffers - before the loads on these get close to being inefficient (i.e. before they are starved or full) then it will try to schedule either VS or PS workloads accordingly.”
It’s important to note that the shading units can operate independent of each other, that way if the GPU is waiting for data from a vertex program or a vertex array the load balancing unit can assign the shading units to work on a pixel program or a second vertex. This further helps to ensure that the shaders are constantly being fed with data.
All this is invisible to the software developer, no special programming is required.
For the Xbox 360’s Xenos GPU, ATI employs 48 shading units. Each of these shaders is general purpose and shares the same instruction set; in other words, no one shading unit is more functional than another. This allows the shaders to operate on any type of data, whether it’s a pixel program, or a task for the vertex shader. The 48 shaders are grouped into three “banks” of shaders, each bank consists of 16 shaders.
Unified shading in DirectX 10
Because the Xbox 360’s Xenos GPU utilizes a unified shader architecture and ATI/Microsoft have been promoting unified so heavily in comparison to PlayStation 3’s RSX GPU which isn’t unified, it has generally been assumed that DirectX 10 also requires a unified shader architecture. However, it turns out that this is not the case. In fact, we’ve poured over numerous DirectX 10 documents and none of them even discuss a unified architecture!
Microsoft leaves this aspect of the DirectX 10 API entirely up to the hardware developer.
This means that a hardware manufacturer like ATI, S3, or NVIDIA could develop a GPU with distinct pixel, vertex, and geometry shaders and still claim 100% DirectX 10 Shader Model 4.0 compliance. This is because with DirectX 10, Microsoft only defines the specifications of the API, it is then up to hardware manufacturers to determine what they feel is the best method to meet those specifications.
In ATI’s case, they decided to go with a unified architecture for Xbox 360 and their upcoming R600 GPU because they felt it made the most sense, particularly since the shaders all share the exact same functionality anyway. In the words of ATI’s Dave Baumann, the decision to go unified came down to one simple question "if all these parts of the pipeline [geometry, pixel, and vertex shaders] have to have the same capabilities, does it make sense to have a traditional pipeline with discrete units or a single pool that can execute all shader program types?"
We think a lot of the confusion around unified shaders came from what Microsoft describes as DirectX 10’s “unified shader core”. This refers to the fact that in DirectX 10, all shaders rely on the same instruction set, in previous shader models there were restrictions in functionality between the pixel and vertex shaders. This is no longer the case in DirectX 10, they all have the same unified programming model for pixel, vertex, and geometry shaders.