My personal hell of translating DXIL to SPIR-V – part 5

Time for despair

It has taken way too long to get to this point, but we must address the final challenge of translating DXIL, a claustrophobic mess that haunts me to this day, the most nightmarish set of poorly understood heuristics which come together to put pretty pixels on screen. It is of course, the act of untangling spaghetti and turning unstructured control flow into structured control flow.

True impostor syndrome kicks in

Doing compiler things without knowing what you’re doing is agonizing. You can never trust yourself and whatever I did for DXIL structurization has turned out to fail 30+ times, and each time I fix a critical bug and move on. The only thing I can say for sure is that it’s becoming asymptotically correct, and I have been able to make 100k+ AAA shaders compile and validate correctly, but every new insane scenario DXC (LLVM) throws at me could be the one to break everything, because if you cannot mathematically prove a compiler algorithm you might be extremely screwed because the edge cases are truly insane. The true compiler nerds are probably giggling in the corner 🙂

Goal for this post

The goal for this post is to go through the basic theory needed to understand the structurizer. The topic itself will take many posts to fully cover, so this is a good start.

GraphViz – visual debugging

GraphViz is a corner stone of debugging control flow, so the very first thing we need is to be able to dump out a CFG to a PDF. Fortunately, there is just the tool for the job.

For example we can start with an extremely basic shader from the test suite:

RWByteAddressBuffer Buf : register(u0);

[numthreads(1, 1, 1)]
void main(uint3 index : SV_DispatchThreadID)
{
  uint result = 0;

  // Single branch
  [branch]
  if (index.x == 10)
    result = 40;
  else
  result = 50;

  // 2-level branch
  [branch]
  if (index.x == 20)
  {
    [branch]
    if (index.y == 30)
      result += 20;
    else
      result *= result;
  }
  else
  {
    [branch]
    if (index.y == 40)
      result += 70;
    else
      result *= 10;
  }

  Buf.Store(0, result);
}

If we dump this input to GraphViz we get:

digraph {
1 [label="1", shape="box"];
2 [label=".entry", shape="box"];
2 -> 1;
3 [label="2", shape="box"];
2 -> 3;
3 -> 1;
4 [label="3", shape="box"];
1 -> 4;
5 [label="4", shape="box"];
1 -> 5;
6 [label="7", shape="box"];
5 -> 6;
7 [label="8", shape="box"];
5 -> 7;
8 [label="9", shape="box"];
7 -> 8;
6 -> 8;
9 [label="5", shape="box"];
4 -> 9;
10 [label="6", shape="box"];
4 -> 10;
10 -> 8;
9 -> 8;
}

Then, we have this trivial Python script to dump to PDF:

#!/usr/bin/env python3

from graphviz import Source
import sys

def main():
  s = Source.from_file(sys.argv[1])
  s.view()

if __name__ == '__main__':
  main()

For debugging purposes, we can dump out a GraphViz after every major transformation step, of which there are many. For reference, here’s the final structured output, which we will study in greater detail. Of course, this is baby mode compared to what we have to deal with later, but gotta start somewhere.

SPIRV-Tools also has a tool to extract a GraphViz from a SPIR-V module.

spirv-cfg -o test.graph test.spv

This is super handy as well! The graph can be visualized with the same Python script.

SPIRV-Tools is also critical since we lean heavily on the SPIR-V validator to verify that our SPIR-V output conforms to structured rules.

Basic theory for the non-compiler nerds

When trying to learn this stuff, you will encounter terminology from graph theory and any learning material surrounding this topic will usually come from CS textbooks or lectures which assume you have all that background knowledge (I have an EE background, not CS or fancy math). These concepts become the basic primitives of any algorithms we have to perform on the CFG. None of this is particularly difficult, sure, but needs to be understood at a deep level.

Reachability

Asks the question of “can block A reach block B”? In the example above, 4 can reach 9 by branching via either 7 or 8, but 4 cannot reach 6. This is quite intuitive.

Back-edge

In loops, a back-edge branches back to the beginning of the loop. This is detected when A can reach B and B branches to A.

Here is an example where 3 has a back-edge to itself.

Note: I tend to mix up back-edge as a block (continue statement) and the branch itself.

This makes the CFG cyclic. We can eliminate back-edges from a DXIL CFG and achieve an acyclic CFG.

I believe this property is what makes the CFG reducible. Irreducible CFGs can have completely random gotos that make it impossible to ensure an acyclic graph without some extra preprocessing. DXIL fortunately does not support irreducible CFGs. It is also its only saving grace …

In our implementations back-edges are treated as special branches that happen in loops, and for sake of reachability analysis we only consider reachability to be acyclic.

There can only be one back-edge branch to a particular block. I think this is what LLVM defines. I have never seen a counter example, but if it happens, we can trivially rewrite the graph to only have one back-edge.

Predecessors

Usually abbreviated “preds”. If A branches directly to B, A is a predecessor of B. Back-edges are not counted in our implementation, they are recorded separately as a pred_back_edge.

Successors

Usually abbreviated “succs”. If A branches directly to B, B is a successor of A. Back-edges are not counted in our implementation, they are recorded separately as a succ_back_edge.

Domination

Domination happens when we have two blocks A and B and in order to reach B, control flow must pass through A as well. A block dominates itself. Strict domination is when A != B.

Post-domination

Post-domination happens when we have two blocks A and B, and if control flow reaches block A, control flow must then flow into B.

Immediate dominator

Among all blocks which dominate a block A, finds the closest block which dominates A. Often called idom.

Immediate post-dominator

Same as idom, but post-dominance rather than dominance.

Domination frontier

Assume we have a block A which dominates a set of blocks B. If a block in B branches to a block C, and A does not dominate C, C is in the domination frontier set of A. This is somewhat abstract, but it’s an extremely useful analysis tool for later. E.g.:

For block 4, 9 is in the domination frontier. For block 7, 9.4.ladder is in the domination frontier.

Post-domination frontier

Same idea, but for post-dominance. E.g. for block 7, 4 would be in the post-domination frontier. For 9.4.ladder, 1 would be in the frontier, etc. The post-domination frontier is particularly useful as an analysis tool to figure out which control flow decisions contribute to a block being executed.

Common (post-)dominator

Given candidates B and C, finds the closest block A which (post-)dominates both B and C.

Loop analysis

Loops in unstructured control flow can be very awkward to reason about, but we have a way to specify loops precisely.

The loop header

A loop header is any block which has a back-edge branching to it. In reducible CFGs which DXIL requires, the loop header must dominate the back-edge.

The continue block

The back-edge block is the continue block. SPIR-V allows a “continuing construct” which means it can span multiple blocks, but we don’t use that.

Defining the loop body

When we have a loop header and a continue block, we can deduce the loop body as any block which is dominated by header and can reach the continue block. Once we can no longer reach the continue block, we are breaking out of the loop. We call these blocks “loop exits”.

For example, 3 is a loop header, 6 is a continue block. The loop body is {3, 4, 5, 7, 8, 6}. 9 is the only loop exit.

Issues for later …

As we will see later, this definition of a loop is fundamentally flawed when we consider subgroup operations in non-uniform control flow. A single-threaded CPU execution model that LLVM caters to does not care about any of this, but GPU execution certainly does since semantics change depending on how we define convergence.

Computing the CFG

In the implementation, we will recompute the CFG at many points, usually after performing transformations on the CFG. Here are some of the things we need to do:

Compute post-visit order

Essentially, a depth-first traversal, where each node is assigned a visitation index. The rule is if A can reach B, post_visit_index(A) > post_visit_index(B). We also detect back edges this way. A node is only visited once.

If we don’t visit a node, it means it’s not reachable anymore (can happen after rewriting the CFG), and we remove the node, which means purging dead predecessors.

We can also build an array of nodes which we can traverse through and know that we will walk the tree in an appropriate order. Iterating over the post-visit order backwards will ensure that dominators are visited before our dominated blocks for example. This is used when we …

Compute immediate dominators

The algorithm for this is pretty easy once we have post-visit orders in place.

void CFGNode::recompute_immediate_dominator()
{
  if (pred.empty())
  {
    // For entry block only.
    immediate_dominator = this;
  }
  else
  {
    immediate_dominator = nullptr;

    for (auto *edge : pred)
    {
      if (immediate_dominator)
        immediate_dominator =
          CFGNode::find_common_dominator(immediate_dominator, edge);
      else
        immediate_dominator = edge;
    }
  }
}

Finding a common dominator (simplified) is done by:

while (a != b)
{
  if (a->post_visit_order < b->post_visit_order)
    a = a->immediate_dominator;
  else
    b = b->immediate_dominator;
}
Compute reachability

Here we build a simple LUT where we can test if post_visit_order A can reach post_visit_order B.

void CFGStructurizer::visit_reachability(const CFGNode &node)
{
  uint32_t *dst_reachability =
    &reachability_bitset[node.forward_post_visit_order *
                         reachability_stride];

  for (auto *succ : node.succ)
  {
    // Inherit reachability from all successors.
    const uint32_t *src_reachability =
      &reachability_bitset[succ->forward_post_visit_order *
                           reachability_stride];
    for (unsigned i = 0; i < reachability_stride; i++)
      dst_reachability[i] |= src_reachability[i];
  }

  // We can reach ourselves.
  dst_reachability[node.forward_post_visit_order / 32] |=
    1u << (node.forward_post_visit_order & 31u);
}

void CFGStructurizer::build_reachability()
{
  reachability_stride = (forward_post_visit_order.size() + 31) / 32;
  reachability_bitset.clear();
  reachability_bitset.resize(reachability_stride *
                             forward_post_visit_order.size());
  for (auto *node : forward_post_visit_order)
    visit_reachability(*node);
}

It’s nice how the structures we build feed into each other. We compose this as saying that if B can reach C, and A can reach B, A can reach C.

Compute dominance frontiers

Similarly, the dominance frontier is built bottom-up.

void CFGStructurizer::recompute_dominance_frontier(CFGNode *node)
{
  // If we don't dominate our successor, it's a frontier.
  for (auto *succ : node->succ)
    if (succ->immediate_dominator != node)
      node->mark_dominance_frontier(succ);

  // Inherit frontiers.
  if (auto *idom = node->immediate_dominator)
    for (auto *frontier_node : node->dominance_frontier)
      if (!idom->dominates(frontier_node))
        idom->mark_dominance_frontier(frontier_node);
}

The recursive definition here is that if A dominates B, then any dominance frontier of B is also in the dominance frontier of A.

Congratulations! You can now say dominance frontier casually with a straight face :V

The backwards traversal

To compute post-dominance and post-dominance frontiers, we have to do the same thing, just backwards. Flip successors into predecessors and vice versa, but we have some problems. The normal traversal has one unique entry node, but there can be multiple leaf nodes (think returns from a branch)! This means there are many entry blocks when we traverse backwards, and this breaks the algorithm.

Another problem is loops. Consider this example, which demonstrates both cases:

Here, 6 is not reachable through backwards traversal from 1. 7 is a return block. What we need to solve the backwards traversal is to ensure we have a unique start node. For this purpose, we will invent an EXIT node. Any return will be considered to branch to that EXIT node. Back-edges are special. For backwards traversal, we will instead invent a “fake” branch from the back-edge to the common post-dominator of all loop exits. (That is a mouthful.)

This is the first of many silly ideas I came up with … After this point, assume I have no idea what I’m talking about. 😐

Dotted lines are fake branches we can use for backwards traversal.

What does it mean to be structured?

A structured CFG basically means the control flow closely maps to higher level languages without gotos. Rather than spaghetti, the control flow is like an onion. It’s somewhat difficult to precisely define it (study the SPIR-V specification), but the basic gist is something like:

Every block exists in a scope

A block can be enclosed by many scopes. A scope could be a selection construct or a loop construct.

Selection construct

When a node has more than one successors, we begin a selection construct, where that node is a header. A selection construct has a merge node where execution reconvenes. A header must dominate its merge node. The selection construct is whatever the header dominates minus whatever the merge node dominates. Intuitively:

// HEADER
if (blah)
{
  // IN CONSTRUCT
  if_true();
}
else
{
  // IN CONSTRUCT
  if_false();
}
// MERGE, not part of construct

No crossing streams

A block from outside a construct must not branch into the construct, except through the header block.

No shared merge block

A block cannot be the merge block for multiple headers.

Loop construct

A loop header begins the construct. The loop construct contains all blocks which are dominated by the loop header, minus whatever is dominated by the merge block. See example earlier.

Notice how this is (fatally) different from unstructured loops. Unstructured loops leave the construct when they can no longer reach the continue block. Structured loops don’t exit until they reach the merge block.

Leaving a construct

A loop defines a “break” target as well as a “continue” target. The only valid ways to leave a selection construction is through its own merge block, the innermost break, or innermost continue target.

Unlike some languages, SPIR-V does not allow “multi-break”. A loop construct functions as a barrier to how far we can break. We can only break out of one loop construct at a time (except with return, but DXIL code is all inlined). Solving this is … an ordeal 🙂

Switch constructs

Switch blocks are selection constructs, but they have a special property where the merge target of switch also serves as a break block. Oddly enough, you can break directly out of a switch and a loop at the same time, which is not allowed in normal high level languages, but oh well.

Edge cases for selection constructs

It is possible that a selection construct has one branch being a “break” or “continue”, and the other branch is as normal. This special case does not require a selection construct.

Another situation is when both paths of a selection construct ends up breaking. Execution will never merge in this case, so it is possible to setup the merge block as an unreachable block with OpUnreachable instead.

Why do merge blocks matter?

In a world with SIMT-style execution of GPUs, two sets of invocations may branch off, but at some point they might want to join back again. Defining this can be done naturally through merge blocks. Without a notion of merge blocks, there is no way to meaningfully distinguish

from

In a scalar world such as CPUs this has same behavior, but certainly not in a SIMT world. If 4 does anything using subgroups or just normal texturing with implicit LOD, well … everything breaks.

From my understanding, OpenCL has a notion of divergence and convergence, but any well defined convergence must come through a merge block. In completely unstructured control flow, convergence is not well defined. From SPIR-V:

Uniform Control Flow: Uniform control flow (or converged control flow) occurs if all invocations (in the invocation group, unless otherwise stated) execute the same dynamic instance of an instruction. Uniform control flow is the initial state at the entry point, and lasts until a conditional branch takes different control paths for different invocations (non-uniform or divergent control flow). Such divergence can reconverge, with all the invocations once again executing the same control-flow path, and this re-establishes the existence of uniform control flow. If control flow is uniform upon entry into a structured loop or selection, and all invocations leave that loop or selection via the header block’s declared merge block, then control flow reconverges to be uniform at that merge block.

So here, we have explicit wording that we must use merge blocks to get guaranteed re-convergence. I couldn’t find any strong wording about this in CUDA, but the closest from skimming the docs was this:

For divergent control flow, the optimizing code generator automatically determines points of re-convergence

This is interesting, because this is exactly what we are going to have to do as well in dxil-spirv, but this is not exactly a strong guarantee. In practice, it should work fine, but I think this presents a good reason why structured control flow is a thing. I just wish IRs did not take the lazy route of just derping everything into unstructured control flow and expect implementations to guess the semantics. Sigh …

The classic subgroup breakage – maximal convergence gone awry

It’s time to go through the classic example of why we can reach catastrophic derpiness with unstructured control flow.

StructuredBuffer<uint> RO : register(t0);
RWStructuredBuffer<uint> RW : register(u0);

[numthreads(16, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
    uint v = RO[thr];
    uint result;
    while (true)
    {
        uint first = WaveReadLaneFirst(v);
        if (v == first)
        {
            result = WaveActiveSum(v);
            break;
        }
    }

    RW[thr] = result;
}

It’s crystal clear what the programmer’s intention is here. We partition a wave into sets, which then participate in a wave-op in isolation, then they exit the loop. The loop is iterated N times. This is a very common waterfall pattern, but any experienced shader programmer will avoid this exact implementation of said pattern because it breaks over half the compilers out there … The real problem here is that according to unstructured control flow, the WaveActiveSum() is not inside the loop, because it cannot reach the continue block. Therefore we end up with a completely broken:

define void @main() {
...
  br label %6

; The loop, it does not do anything.
  %7 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %5)
  %8 = icmp eq i32 %5, %7
  br i1 %8, label %9, label %6

; Outside the loop.
  %10 = call i32 @dx.op.waveActiveOp.i32(i32 119, i32 %5, i8 0, i8 1)
  call void @dx.op.bufferStore.i32(...)
  ret void
}

The wave operation now happens outside the loop, with all threads active, no matter what. It is impossible to deduce the programmer’s intention at this point. Structured control flow does not have these ambiguities, because the loop does not end until we reach the merge block.

The high level code would look something like:

// A degenerate loop.
for (;;)
{
  if (RO[thr] == subgroupBroadcastFirst(RO[thr]))
  {
    break;
  }
  else
  {
  }
}
RW[thr] = subgroupAdd(RO[thr]);

Indeed, the ACO ISA is devoid of branches here as the loop does nothing.

Funnily enough, while writing this, I found that the latest DXC versions actually go to great length to try (and fail!) to make this work somehow, through an absolutely disgusting trick:

@dx.break.cond = internal constant [1 x i32] zeroinitializer

define void @main() {
  %1 = load i32, i32* getelementptr inbounds ([1 x i32], [1 x i32]* @dx.break.cond, i32 0, i32 0)
  %2 = icmp eq i32 %1, 0
  %3 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 1, i32 0, i32 0, i1 false)
  %4 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false)
  %5 = call i32 @dx.op.threadId.i32(i32 93, i32 0)
  %6 = call %dx.types.ResRet.i32 @dx.op.bufferLoad.i32(i32 68, %dx.types.Handle %4, i32 %5, i32 0)
  %7 = extractvalue %dx.types.ResRet.i32 %6, 0
  br label %8

; <label>:8 ; preds = %13, %0
  %9 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %7)
  %10 = icmp eq i32 %7, %9
  br i1 %10, label %11, label %13

; <label>:11 ; preds = %8
  %12 = call i32 @dx.op.waveActiveOp.i32(i32 119, i32 %7, i8 0, i8 1)
  br i1 %2, label %14, label %13

; <label>:13 ; preds = %11, %8
  br label %8

; <label>:14 ; preds = %11
  call void @dx.op.bufferStore.i32(i32 69, %dx.types.Handle %3, i32 %5, i32 0, i32 %12, i32 undef, i32 undef, i32 undef, i8 1)
  ret void
}

What on earth is going on here? So, somehow DXC recognizes the horrible mistake of unstructured control flow, and tries to hack around it by forcing the breaking path to have a reachable path to the continue block. It does this by conditionally branching to continue or breaking depending on a value that is always 0. Apparently, it seems to do this in the most ridiculous way possible, by reading from a 1-element Private array which is zeroinitialized. This is probably done in an attempt to avoid optimizations which remove the branch. The high-level language now avoids the maximal convergence.

const uint _13[1] = uint[](0u);

bool _21 = _13[0u] == 0u; // Always evaluates to true.
for (;;)
{
  _52 = _46.value == subgroupBroadcastFirst(_46.value);
  if (_52)
  {
    _53 = subgroupAdd(_46.value);
    if (_21)
      break;
    else
      continue;
  }
}

However, we now have a situation where convergence depends entirely on how smart the backend compiler is. A compiler can detect that _21 is constant and eliminate the continue branch as dead code. At the same time however, this also means that we regress back to maximal convergence and everything breaks.

This particular case works as expected on AMD/Intel D3D12 drivers, but apparently NVIDIA’s D3D12 driver optimizes this back to maximal convergence and even this nasty hack fails. Fun times … I’m not even sure I can call that a bug. It’s not like any of this is well defined in DXIL 🙂

This isn’t a fantasy example, Microsoft’s own examples trip this scenario as well:

// Is equivalent to writing a loop like this:
while (true) {
  if (WaveReadLaneFirst(expr) == expr) {
    sum = WavePrefixSum(value));
    break;
  }
}

Sigh … Loops are a real PITA.

Conclusion

This concludes CFG 101. Unfortunately, we still have not reached the actual juicy algorithms, but it took way too many words just to go through the bare minimum of CFG theory.

 

My personal hell of translating DXIL to SPIR-V – part 4

It has been quite a while since the last post where I discussed the binding model shenanigans of D3D12, and as promised, it’s time to look at AMD code-gen for various API patterns we find in D3D12 when translated to Vulkan and the ACO compiler in RADV.

At a fundamental level, the AMD ISAs haven’t changed all that much since the GCN 1.0 days, for good reason. Arguably, it’s still the cleanest and most flexible descriptor model of any GPU architecture and it’s a very useful architecture to understand at a low level.

As discussed last time, the same DXIL code can translate to many different code patterns in SPIR-V. The Root Signature object in D3D12 not only concerns itself with mapping binding points, but also controls the physical implementation of how resources are accessed. This means we have to deal with all of this junk in dxil-spirv somehow.

Using Fossilize as our cross-vendor RGA – fossilize-synth and fossilize-disasm

For the ISA experiments we conduct here, we’ll use fossilize-synth and fossilize-disasm. These CLI tools are part of the Fossilize project and they can be used to conduct ISA experiments like these quite handily. Unlike RGA, fossilize-synth can automatically generate a compatible VkPipelineLayout, VkRenderPass and PSO state for given shaders and record this as a Fossilize archive. We can then replay it with fossilize-disasm, target ISA output and output ISA on any driver that supports VK_KHR_pipeline_executable_properties fully.

$ dxil-spirv --output test.spv test.dxil
$ fossilize-synth --output test.foz --comp test.spv
Fossilize INFO: Successfully synthesized a FOZ archive to /tmp/test.foz.
$ fossilize-disasm --target isa --output /tmp test.foz
Fossilize INFO: Dumping disassembly to: /tmp/888210fc26602b56.main.a2d87808e64b9164.comp

How many ways can we implement Constant Buffer View?

Welcome to hell.

Constant buffers can be implemented a million different ways, so let’s have a look at what we generate for an RDNA2 chip and some alternatives we have to use for other vendors. As a test shader, we start with this HLSL compute shader:

cbuffer Cbuf : register(b0)
{
  uint4 v;
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  RW[thr] *= v;
}

1. Root constant – Global Root Signature

Root constants translate naturally to push constants in Vulkan. The Vulkan GLSL representation of dxil-spirv output is:

#version 450
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _13;

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
} registers;

void main()
{
  uvec4 _33 = uvec4(registers._m0, registers._m1,
    registers._m2, registers._m3);
  _13._m0[gl_GlobalInvocationID.x] = 
    uvec4(_13._m0[gl_GlobalInvocationID.x].x * _33.x,
          _13._m0[gl_GlobalInvocationID.x].y * _33.y,
          _13._m0[gl_GlobalInvocationID.x].z * _33.z,
          _13._m0[gl_GlobalInvocationID.x].w * _33.w);
}
s_mov_b32 s0, s3
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], 0x0
v_lshl_add_u32 v0, s7, 6, v0
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)

// Interesting work begins here
buffer_load_dwordx4 v[4:7], v0, s[8:11], 0 offen
s_waitcnt vmcnt(0)
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s4
v_mul_lo_u32 v10, v6, s5
v_mul_lo_u32 v11, v7, s6
buffer_store_dwordx4 v[8:11], v0, s[8:11], 0 offen
s_endpgm

The first thing to understand about GCN and RDNA is the execution model. On basically any modern GPU, threads are executed in groups. “Waves”, “warps” and “subgroups” are all names for the same thing. “Wave” is the preferred AMD naming, so we’ll use that. An instruction here is either a wide vector instruction or a scalar instruction. Scalar instructions run once in the small scalar unit, and vector instructions run wide.

Scalar registers and vector registers are similar here. v registers are per-lane (called VGPR), and scalar s registers are stored once per wave (called SGPR). To translate this to x86-speak, SGPRs are normal registers like rax, rbx, etc. VGPRs would be AVX-2048 monsters, crunching 64 operations in parallel.

A critical performance consideration when targeting AMD is “scalarization”. When data is the same across a wave, we can place it in scalar registers. This keeps register pressure down. We have scalar registers in abundance, but using too many vector registers leads to poor occupancy. Shifting vector registers to scalar registers is therefore a sound strategy.

With this in mind, let’s look at the interesting bit:

s_mov_b32 s0, s3
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s4
v_mul_lo_u32 v10, v6, s5
v_mul_lo_u32 v11, v7, s6

This is the meat of the work. When execution is begun, the root constants are already placed in s3, s4, s5 and s6. This is ideal. No need to load anything explicitly. This is possible because of “User SGPRs”. There is however, a strict limit to how much “User SGPR” data we can hold. If we have too many constants, the driver is forced to insert indirection.

For example, consider:

cbuffer Cbuf : register(b0)
{
  uint4 v0;
  uint4 v1;
  uint4 v2;
  uint4 v3;
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  RW[thr] *= v0;
  RW[thr] *= v1;
  RW[thr] *= v2;
  RW[thr] *= v3;
}

Now we end up with:

s_mov_b32 s0, s3
s_movk_i32 s1, 0x8000
s_clause 0x1
s_load_dwordx4 s[16:19], s[0:1], 0x20
s_load_dwordx4 s[20:23], s[0:1], 0x30
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[0:3], s[2:3], 0x0
v_lshl_add_u32 v0, s12, 6, v0
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)
buffer_load_dwordx4 v[4:7], v0, s[0:3], 0 offen
s_waitcnt vmcnt(0)
v_mul_lo_u32 v4, v4, s4
v_mul_lo_u32 v5, v5, s5
v_mul_lo_u32 v6, v6, s6
v_mul_lo_u32 v1, v7, s7
v_mul_lo_u32 v4, v4, s8
v_mul_lo_u32 v5, v5, s9
v_mul_lo_u32 v2, v6, s10
v_mul_lo_u32 v1, v1, s11
v_mul_lo_u32 v3, v4, s16
v_mul_lo_u32 v4, v5, s17
v_mul_lo_u32 v2, v2, s18
v_mul_lo_u32 v1, v1, s19
v_mul_lo_u32 v8, v3, s20
v_mul_lo_u32 v9, v4, s21
v_mul_lo_u32 v10, v2, s22
v_mul_lo_u32 v11, v1, s23
buffer_store_dwordx4 v[8:11], v0, s[0:3], 0 offen
s_endpgm

Some of the root constants were promoted to User SGPR, but some were not and the driver had to spill those to memory somewhere, and load them.

s_load_dwordx4 s[16:19], s[0:1], 0x20
s_load_dwordx4 s[20:23], s[0:1], 0x30

Since the constants are the same for all threads, we can load them directly into SGPRs instead of doing silly broadcasts.

Scalar loads on AMD have special properties. First, they target the constant cache (K$) instead of normal data caches. The K$ is not coherent with the data caches, so we can use the K$ if the address to load is scalar and we know that we do not need to alias with other writes. SSBO loads can actually use this path too if these constraints are met, which will come in handy in the next section.

2. Root descriptor – Global Root Signature – Buffer Device Address (BDA)

Root descriptors in D3D12 are designed to be raw pointers to data and they take up 64 bits in the root signature. It is up to the implementation to figure out what to do about the raw pointer, and in vkd3d-proton we prefer to translate this to push constant BDA on AMD.

Let’s first try with dynamically uniform address:

cbuffer Cbuf : register(b0)
{
  uint4 v[8];
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= v[wg];
}

The output Vulkan GLSL is … funky:

#version 450
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

struct AddCarry
{
  uint _m0;
  uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerUint4NonWrite;
layout(buffer_reference, buffer_reference_align = 16, std430)
readonly buffer PhysicalPointerUint4NonWrite
{
  uvec4 value;
};

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14;

layout(push_constant, std430) uniform RootConstants
{
  uvec2 _m0;
} registers;

void main()
{
  AddCarry _33;

  // Pointer arithmetic is fun >:|
  _33._m0 = uaddCarry(registers._m0.x, gl_WorkGroupID.x * 16u, _33._m1);
  PhysicalPointerUint4NonWrite _40 =
    PhysicalPointerUint4NonWrite(uvec2(_33._m0, registers._m0.y +
    _33._m1));
  _14._m0[gl_GlobalInvocationID.x] =
    uvec4(_14._m0[gl_GlobalInvocationID.x].x * _40.value.x,
          _14._m0[gl_GlobalInvocationID.x].y * _40.value.y,
          _14._m0[gl_GlobalInvocationID.x].z * _40.value.z,
          _14._m0[gl_GlobalInvocationID.x].w * _40.value.w);
}

In the BDA path of dxil-spirv, we encode the BDA in push constants, perform address calculation, cast that to a pointer and load. The critical part here is NonWritable decoration. This signals the compiler that the memory region can be assumed to not have been written to, even with aliasing in this shader module. This is subtly different from C const. Const can still observe aliasing through other pointer, but not in SPIR-V. This allows ACO to generate ideal code even from raw pointers:

s_mov_b32 s0, s3                   // S[3:4] holds BDA
s_lshl4_add_u32 s0, s5, s0         // S5 holds SV_GroupID.x here
s_add_u32 s1, s4, src_scc          // Handle carry for 64-bit + 32-bit
s_load_dwordx4 s[0:3], s[0:1], 0x0 // Scalar load from pointer
s_waitcnt vmcnt(0) lgkmcnt(0)
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s1
v_mul_lo_u32 v10, v6, s2
v_mul_lo_u32 v11, v7, s3

Nice. If we remove the NonWritable from SPIR-V assembly we get the depressing vector load instead, because compiler can no longer prove there is no potential aliased write (we’re doing pointer arithmetic after all), so we have to do it in the vector unit instead:

 global_load_dwordx4 v[4:7], v[2:3], off

Of course, if the shader is loading uniforms with varying offset we also need vector loads, e.g.:

cbuffer Cbuf : register(b0)
{
  uint4 v[8];
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= v[thr & 7];
}
v_lshl_add_u32 v0, s5, 6, v0
v_and_b32_e32 v1, 7, v0
v_add_nc_u32_e32 v2, s0, v1                // Vectorized 64-bit + 32-bit
v_add_co_u32 v1, s[0:1], s0, v1
v_add_co_ci_u32_e64 v3, vcc, 0, s4, s[0:1]
global_load_dwordx4 v[4:7], v[2:3], off

The code-gen isn’t perfect here, but it accomplishes what we need for Root CBV. I am considering rewriting the pointer arithmetic for CBV to just use access chains instead, but that’s for another time …

(The existing codegen was designed for Root SRV / UAV which can do unaligned loads and stores in weird ways, and it is easier to do it this way with pointer arithmetic ._.)

(The reason we use uvec2 here is quirky. At the time, ACO only promoted 32-bit uints to User SGPRs, not 64-bit.)

3. Root constant – Global Root Signature – INLINE_UNIFORM_BUFFER workaround

If the driver does not support 256 byte push constants, we might be forced to fall back to inline UBO blocks instead of using push constants for massive root signatures. The generated code would look more or less the same as the BDA scenario shown above. Inline UBO on AMD is implemented in a way where the UBO data is stored at an offset from descriptor set pointer.

Now that BDA is widespread we might want to nuke this workaround and instead implement the workaround as a push constant BDA to memory we allocate ourselves. This removes the need to deal with annoying descriptor sets just to handle this case.

4. Root descriptor – Global Root Signature – Plain Descriptor

Push descriptors aren’t very special on AMD, so it gets lumped in with the others here. It just means the driver needs to allocate descriptor memory behind the scenes, which means extra CPU overhead. Descriptor paths on AMD have two indirections. This is why we prefer using BDA. For the example above with dynamically uniform address:

 RW[thr] *= v[wg];
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], 0x0
s_buffer_load_dwordx4 s[0:3], s[8:11], s0

In SGPRs, pointers to descriptor sets can be pushed. Descriptor memory lives in a 4 GB virtual memory range, so RADV only needs to push a 32-bit pointer to store a descriptor set and the upper half is synthesized from a constant. From here, we load a UBO descriptor. With the descriptor in SGPR, we load the actual UBO data.

A thing to note here is how trivial it is to extend this to bindless / descriptor indexing. Just adjust the offset when loading the UBO descriptor and that’s all we need. No magic involved, descriptors are just memory in this model.

This path is preferred on NVIDIA however. Push descriptors are more “magical” there and the BDA path is very slow compared to that. 🙁

5. Root table – Global Root Signature

In root tables, we need to tap into descriptor indexing. The table parameter in D3D12 works like an offset into the ID3D12DescriptorHeap. This offset is placed in push constants, which we then use to offset into the heap. E.g. given this HLSL shader:

struct C { uint4 value; };
ConstantBuffer<C> CBVs[] : register(b10);
RWStructuredBuffer<uint4> RW : register(u1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBVs[wg].value;
}

we end up with Vulkan GLSL looking like:

#version 450
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

// If VK_VALVE_mutable_descriptor_type, SSBO and BindlessCBV can share
// same descriptor binding.
layout(set = 4, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14[];

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _22[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
  uint _m4;
  uint _m5;
  uint _m6;
  uint _m7;
} registers;

void main()
{
  uint _29 = registers._m4 + 1u;
  uvec4 _52 = floatBitsToUint(_22[registers._m5 +
    (gl_WorkGroupID.x + 10u)]._m0[0u]);
  _14[_29]._m0[gl_GlobalInvocationID.x] =
    uvec4(_14[_29]._m0[gl_GlobalInvocationID.x].x * _52.x,
          _14[_29]._m0[gl_GlobalInvocationID.x].y * _52.y,
          _14[_29]._m0[gl_GlobalInvocationID.x].z * _52.z,
          _14[_29]._m0[gl_GlobalInvocationID.x].w * _52.w);
}

The ISA looks like:

v_lshl_add_u32 v0, s6, 6, v0 // Compute DispatchThreadID.x
s_add_u32 s0, s6, s5         // Add GroupID.x + root table offset
s_lshl_b32 s0, s0, 4         // sizeof(UBO) == 16
s_mov_b32 s6, s3             // Build pointer to set
s_movk_i32 s7, 0x8000

// Load descriptor by index.
  // Missing constant offset here. LLVM disassembler bug I think ...
  // Raw opcode is f4080203 000000a0
s_load_dwordx4 s[8:11], s[6:7], s0

...
s_buffer_load_dwordx4 s[4:7], s[8:11], 0x0

Fundamentally, this is no different from legacy descriptors. The offset is just no longer constant.

6. Root table – Global Root Signature – NonUniformResourceIndex

Adding more spice, we’ll make the resource index not uniform. Descriptors on AMD have to live in scalar registers for good reasons, so for this case we will see a common pattern in advanced compute, waterfall loops.

struct C { uint4 value; };
ConstantBuffer<C> CBVs[] : register(b10);
RWStructuredBuffer<uint4> RW : register(u1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBVs[NonUniformResourceIndex(thr % 3)].value;
}

This just translates to nonuniformEXT() in Vulkan GLSL, so there’s nothing interesting to see here, but the ISA is unrecognizable.

BB0:
s_lshl_b32 s4, s4, 4
s_mov_b32 s0, s3
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], s4
v_lshl_add_u32 v0, s6, 6, v0
v_mul_hi_u32 v1, v0, 0xaaaaaaab
v_lshrrev_b32_e32 v1, 1, v1
v_lshl_add_u32 v1, v1, 1, v1
v_sub_nc_u32_e32 v1, v0, v1
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)
buffer_load_dwordx4 v[4:7], v0, s[8:11], 0 offen
v_add_lshl_u32 v1, s5, v1, 4
v_add_nc_u32_e32 v1, 0xa0, v1

// REGION OF INTEREST - BEGIN
s_mov_b64 s[2:3], exec
BB1:
v_readfirstlane_b32 s1, v1
v_cmp_eq_i32_e32 vcc, s1, v1
s_and_saveexec_b64 s[4:5], vcc
s_cbranch_execz BB9
BB2:
s_mov_b32 s6, s0
s_movk_i32 s7, 0x8000
s_load_dwordx4 s[12:15], s[6:7], s1
s_waitcnt lgkmcnt(0)
s_buffer_load_dwordx4 s[12:15], s[12:15], 0x0
s_waitcnt lgkmcnt(0)
v_mov_b32_e32 v1, s15
v_mov_b32_e32 v2, s14
v_mov_b32_e32 v3, s13
v_mov_b32_e32 v8, s12
s_andn2_b64 s[4:5], s[4:5], exec
s_cbranch_scc0 BB10
BB9:
s_mov_b64 exec, s[4:5]
s_branch BB1

// REGION OF INTEREST - END
BB10:
s_mov_b64 exec, s[2:3]
s_waitcnt vmcnt(0)

// Actually do the math.
v_mul_lo_u32 v8, v4, v8
v_mul_lo_u32 v9, v5, v3
v_mul_lo_u32 v10, v6, v2
v_mul_lo_u32 v11, v7, v1

// Store result.
buffer_store_dwordx4 v[8:11], v0, s[8:11], 0 offen
s_endpgm

This is … a bit more complicated, but now we get to study the corner stone of SIMD-style control flow – execution masking – in more detail.

The principle behind waterfall loops is that we need to partition a wave into sets which have the same value. The more divergence in a wave, the more loop iterations are needed. For sake of simplicity in the example below, we can assume a wave size of 8 (it’s actually 64 here).

The descriptor indices could be something like: index = { 0, 1, 2, 0, 1, 2, 0, 1 }. The waterfall loop looks like this in pseudo-code:

uint dynamic_index = thr % 3;
for (;;)
{
  uint index = subgroupBroadcastFirst(dynamic_index);
  if (index == dynamic_index)
  {
    desc = load_descriptor(index);
    data = load_data_from_descriptor(desc);
    break; // Stay tuned for structurization blogs, harharharhar :V
  }
}

In the first iteration, we look at our active thread mask (0xff), find the first bit set, which is 0. Lane 0 holds the value 0.

v_readfirstlane_b32 s1, v1 // A compute enthusiast's best friend!

Any other lane with the same value can go ahead, so lane 0, 3 and 6 will enter the branch.

v_cmp_eq_i32_e32 vcc, s1, v1
s_and_saveexec_b64 s[4:5], vcc

The index is now known to be constant over the wave, so we can now load the UBO descriptor and data, but the result must be moved to vector registers since we don’t load from wave uniform resource indices here:

s_buffer_load_dwordx4 s[12:15], s[12:15], 0x0
// The v_mov is only executed for active lanes.
v_mov_b32_e32 v1, s15
v_mov_b32_e32 v2, s14
v_mov_b32_e32 v3, s13
v_mov_b32_e32 v8, s12

The lanes which participate are now masked out, and the exec mask would be 0xff & ~(0x01 | 0x08 | 0x40). In the next iteration, we look at the first active lane, which has the value 1, and so we roll on until all unique values have been processed. Neat!

s_cbranch_scc0 BB10 // Keep going until we've masked out all exec bits.

7. ResourceDescriptorHeap – SM 6.6

This is trivially similar to the root table case, so I’ll omit ISA here. The main problem is DXIL having to define a completely different set of opcodes for SM 6.6, but that’s another story <_<. Overall it is simpler. Root tables are translated to SM 6.6 automatically in dxil-spirv after all.

struct C { uint4 value; };
RWStructuredBuffer<uint4> RW : register(u0, space1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  ConstantBuffer<C> CBV = ResourceDescriptorHeap[wg];
  RW[thr] *= CBV.value;
}
#version 450
#extension GL_EXT_nonuniform_qualifier : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _13[];

layout(set = 1, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _18;

void main()
{
  uvec4 _34 = floatBitsToUint(_13[gl_WorkGroupID.x]._m0[0u]);
  _18._m0[gl_GlobalInvocationID.x] =
    uvec4(_18._m0[gl_GlobalInvocationID.x].x * _34.x,
          _18._m0[gl_GlobalInvocationID.x].y * _34.y,
          _18._m0[gl_GlobalInvocationID.x].z * _34.z,
          _18._m0[gl_GlobalInvocationID.x].w * _34.w);
}

8. Root table / ResourceDescriptorHeap – Pascal workaround

On older NVIDIA hardware, we cannot use bindless UBO at all, as hardware does not support it. The workaround is to use bindless SSBO instead, with hilarious performance loss as a result :’)

9. Root constant – Local Root Signature

Local root signatures are an abomination. There is no fossilize-synth support for RTPSOs at this time, so I’ll resort to just showing the Vulkan GLSL equivalent. In Vulkan, the shader record block works just like an SSBO, but D3D12 insists on the abstracted model of root signatures, so when emitting code we need to consider if a resource belongs to the local root signature or global root signature.

struct CBVData { float4 v; };
ConstantBuffer<CBVData> SBTRootConstant : register(b0, space15);
ConstantBuffer<CBVData> SBTRootConstant2 : register(b1, space15);

struct Payload
{
  float4 color;
  int index;
};

[shader("miss")]
void RayMiss(inout Payload payload)
{
  payload.color = SBTRootConstant.v;
  payload.color += SBTRootConstant2.v;
}
#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct _19
{
  vec4 _m0;
  uint _m1;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  // ... Local root signature parameters.
  // The exact binary layout is implied by the root signature
  // parameter order.
} SBT;

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  // ... Global root signature parameters
} registers;

layout(location = 0) rayPayloadInEXT _19 payload;

vec4 _67;

void main()
{
  vec4 _41 = uintBitsToFloat(uvec4(SBT._m0[0u], SBT._m0[1u], SBT._m0[2u], SBT._m0[3u]));
  vec4 _57 = uintBitsToFloat(uvec4(SBT._m1[0u], SBT._m1[1u], SBT._m1[2u], SBT._m1[3u]));
  // Ah, yes, the classic LLVM composite insert pattern from undef vector.
  // You thought DXIL didn't support composites? Think again! :D
  vec4 _66 = _67;
  _66.x = _41.x + _57.x;
  vec4 _68 = _66;
  _68.y = _41.y + _57.y;
  vec4 _69 = _68;
  _69.z = _41.z + _57.z;
  vec4 _70 = _69;
  _70.w = _41.w + _57.w;
  payload._m0 = _70;
}

10. Root descriptor – Local Root Signature

Unlike the global root signature, we are forced to use BDA here. It is physically impossible to translate this to push descriptors since we have no chance to observe this pointer on CPU side. The GPU can generate this pointer on its own. It’s a bit hilarious that we can use raw pointers like this on D3D12 since forever, but we cannot actually use the pointer directly in the shader 😐

The SPIR-V is similar to global root signature, we just have to source the BDA from record block instead.

AddCarry _31;
_31._m0 = uaddCarry(SBT._m6.x, 0u * 16u, _31._m1);
PhysicalPointerFloat4NonWrite _38 =
  PhysicalPointerFloat4NonWrite(uvec2(_31._m0, SBT._m6.y + _31._m1));
vec4 _57 = _58;
_57.x = payload._m0.x + _38.value.x;
vec4 _59 = _57;
_59.y = payload._m0.y + _38.value.y;
vec4 _60 = _59;
_60.z = payload._m0.z + _38.value.z;
vec4 _61 = _60;
_61.w = payload._m0.w + _38.value.w;
payload._m0 = _61;

11. Root table – Local Root Signature

Cursed mode. In global root signatures, a table costs 1 DWORD, but in local it’s 2, because we are expected to consume a full GPU VA which points to descriptors. Since we work with offsets, we synthesize a fake GPU VA pointer instead in ID3D12DescriptorHeap::GetGPUVirtualAddress(), which we can trivially convert to an offset and then sample the global heap, SM 6.6 style … The GPU VA is faked with the formula (unique_value << 32) + (offset * DescriptorIncrement)

struct CBVData { float4 v; };
ConstantBuffer<CBVData> SBTCBVs[] : register(b4, space15);

struct Payload
{
  float4 color;
  int index;
};

[shader("miss")]
void RayMiss(inout Payload payload)
{
  payload.color += SBTCBVs[payload.index].v;
}
#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct _25
{
  vec4 _m0;
  uint _m1;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  uvec2 _m2;
  uvec2 _m3;
  uvec2 _m4;
  uvec2 _m5;
  uvec2 _m6;
  uvec2 _m7;
  uvec2 _m8;
  uvec2 _m9;
  uvec2 _m10;
} SBT;

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _24[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  // ... Global root parameters
} registers;

layout(location = 0) rayPayloadInEXT _25 payload;

vec4 _62;

void main()
{
  // Extract offset in descriptor heap from VA.
  // The shift amount is configurable.
  uint _42 = ((SBT._m9.x >> 5u) + 13u) + payload._m1;
  vec4 _61 = _62;
  // Force nonuniformEXT here when indexing from SBT
  // since it's a bit unclear what dynamically uniform even means for RT.
  _61.x = payload._m0.x + _24[nonuniformEXT(_42)]._m0[0u].x;
  vec4 _63 = _61;
  _63.y = payload._m0.y + _24[nonuniformEXT(_42)]._m0[0u].y;
  vec4 _64 = _63;
  _64.z = payload._m0.z + _24[nonuniformEXT(_42)]._m0[0u].z;
  vec4 _65 = _64;
  _65.w = payload._m0.w + _24[nonuniformEXT(_42)]._m0[0u].w;
  payload._m0 = _65;
}

So there we have it, at least 11 different ways we can implement CBV in Vulkan from the exact same DXIL input … Oh never mind, make that 12 if we add the -no-legacy-cbuf-layout flag to DXC.

12. CBufferLoad vs LoadLegacy

Non-legacy loads force scalar loads for everything 🙁 It’s also horribly buggy in DXC and drivers, but that’s a story for another time. With legacy model, we only really had to consider loading 128 bits at once with a clean index into float4 array, but now we have a byte-offset based interface. Fortunately, basically no content uses this feature. It’s still opt-in in DXC despite being introduced with SM 6.0.

struct C { uint4 value; };
struct C64 { uint64_t4 value; };
RWStructuredBuffer<uint4> RW : register(u2);
ConstantBuffer<C> CBV : register(b0);
ConstantBuffer<C64> CBV64 : register(b1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBV.value;
  RW[thr] *= uint4(CBV64.value);
}
#version 460
#extension GL_ARB_gpu_shader_int64 : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_scalar_block_layout : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 4, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14[];

// We got descriptor aliasing, no problem!
// We also have actual scalar packing of uniforms :)
layout(set = 5, binding = 0, scalar) uniform BindlessCBV
{
  float _m0[16384];
} _21[];

layout(set = 5, binding = 0, scalar) uniform _25_28
{
  double _m0[8192];
} _28[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
  uint _m4;
  uint _m5;
  uint _m6;
  uint _m7;
} registers;

void main()
{
  uint _35 = registers._m4 + 2u;
  uint _42 = registers._m5 + 1u;

  // Please make it stop ...

  _14[_35]._m0[gl_GlobalInvocationID.x] = uvec4(
    _14[_35]._m0[gl_GlobalInvocationID.x].x *
      floatBitsToUint(_21[registers._m5]._m0[0u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].y *
      floatBitsToUint(_21[registers._m5]._m0[1u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].z *
      floatBitsToUint(_21[registers._m5]._m0[2u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].w *
      floatBitsToUint(_21[registers._m5]._m0[3u]));

  _14[_35]._m0[gl_GlobalInvocationID.x] = uvec4(
    _14[_35]._m0[gl_GlobalInvocationID.x].x *
      uint(doubleBitsToUint64(_28[_42]._m0[0u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].y *
      uint(doubleBitsToUint64(_28[_42]._m0[1u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].z *
      uint(doubleBitsToUint64(_28[_42]._m0[2u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].w *
      uint(doubleBitsToUint64(_28[_42]._m0[3u])));
}

Ugh … There isn’t too much we can do though, the DXIL is hardcoded to be scalar here despite all other buffer load interfaces being vectorized to some degree at least.

%5 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 0, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
// Note the broken alignment, cute. :')
%6 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 4, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%7 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 8, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%8 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 12, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

Without robustness, ACO manages to vectorize this garbage at least:

s_buffer_load_dwordx4 s[4:7], s[8:11], 0x0
s_buffer_load_dwordx4 s[8:11], s[12:15], 0x0
s_buffer_load_dwordx4 s[12:15], s[12:15], 0x10

This also holds even for robustness2, but if we start doing dynamically indexing the CBV like this, things collapse into noise since the DXIL mode is doing insane things with the byte offset here.

%5 = shl i32 %4, 4
%6 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %5, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%7 = or i32 %5, 4
%8 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %7, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

// ÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆ. I'm out!
%9 = add nsw i32 %7, 4

%10 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %9, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%11 = or i32 %5, 12
%12 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %11, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

Sad compiler noises …

Bonus round

Other descriptor types?

At a fundamental level, once we have studied CBVs, there isn’t that much more to cover, except for some key edge cases. The only real differences is how big the descriptors are and the opcodes used with the descriptors, like image_sample, image_load, image_write, etc, but that’s not terribly interesting for this post, so I’ll leave it here.

Raw SRV / UAV vectorization

DXIL raw buffer opcodes are in general quite unfriendly to human readable representations, and it maps very poorly to SSBOs. The robustness rules are also horrible to work with, and making this all work was not fun. After a significant rework, dxil-spirv is somewhat competent now, I think.

The raw buffer interfaces do make sense from an ISA-targeting compiler point-of-view though, I just wish it was a bit more structured for the sake of my sanity … :V

Byte address buffers

ByteAddressBuffers are very annoying. In HLSL, there is only scalar alignment required for a load-store with no way for HLSL shaders to signal intended alignment. There is also a very strange robustness rule. Even for a vector load-store, robustness is per-component at a 16 byte granularity. (why …)

With a simple shader like this, we can do a good job:

RWByteAddressBuffer RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  float4 values = RW.Load<float4>(thr * 16 + 16);
  values += 1.0;
  RW.Store<float4>(thr * 16 + 16, values);
}

When we attempt to vectorize these loads and stores, we must be able to prove that the SSA expression can be divided by the load-store size, here 16. We can split the addition and compute the result: thr + 1. We then declare an aliased SSBO variant as float4[].

#version 460
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _10;

void main()
{
  vec4 _29 = uintBitsToFloat(_10._m0[gl_GlobalInvocationID.x + 1u]);
  _10._m0[gl_GlobalInvocationID.x + 1u] = uvec4(
    floatBitsToUint(_29.x + 1.0),
    floatBitsToUint(_29.y + 1.0),
    floatBitsToUint(_29.z + 1.0),
    floatBitsToUint(_29.w + 1.0));
}

The way we used to implement this was through 4 unrolled scalar load-stores. E.g. if address is changed to something weird like thr * 4 + 20 for the store, we cannot prove vectorization and we fall back into the old and bad code paths.

#version 460
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

// These are declared with Aliased decorations in SPIR-V.
layout(set = 0, binding = 0, std430) buffer SSBO
{
  uint _m0[];
} _9;

layout(set = 0, binding = 0, std430) buffer _12_14
{
  uvec4 _m0[];
} _14;

void main()
{
  uint _22 = gl_GlobalInvocationID.x << 4u;
  uvec4 _30 = _14._m0[gl_GlobalInvocationID.x + 1u];
  vec4 _33 = uintBitsToFloat(_30);
  uint _46 = gl_GlobalInvocationID.x + 5u;
  _9._m0[_46] = floatBitsToUint(_33.x + 1.0);
  _9._m0[_46 + 1u] = floatBitsToUint(_33.y + 1.0);
  _9._m0[_46 + 2u] = floatBitsToUint(_33.z + 1.0);
  _9._m0[_46 + 3u] = floatBitsToUint(_33.w + 1.0);
}

This is a case where robustness2 can matter quite a lot actually … Without robustness, ACO will generate:

v_add_f32_e32 v8, 1.0, v4
v_add_f32_e32 v9, 1.0, v5
v_add_f32_e32 v10, 1.0, v6
v_add_f32_e32 v11, 1.0, v7
buffer_store_dwordx4 v[8:11], v0, s[4:7], 0 offen offset:20

Memory operations on AMD are very flexible w.r.t. alignment fortunately! With robustness2 however, we cannot safely vectorize 🙁

buffer_store_dword v4, v0, s[4:7], 0 offen offset:20
buffer_store_dword v5, v0, s[4:7], 0 offen offset:24
buffer_store_dword v6, v0, s[4:7], 0 offen offset:28
buffer_store_dword v1, v0, s[4:7], 0 offen offset:32

If we emit the vectorized load-store ourselves, we can get vectorized load-store as expected.

The case with float3[] vectorization is very special. With scalar block layout we can implement it, but we need to consider the 16 byte robustness rule. On AMD, we get per-component robustness in hardware, but not on NVIDIA it seems, so for ByteAddressBuffer, float3[] can only conditionally be vectorized. We’re probably going a bit out of spec on Vulkan here I think, but it’s backed up by tests in the vkd3d-proton test suite.

RW.Store<float3>(thr * 12, values.xyz);

ends up as

layout(set = 0, binding = 0, scalar) buffer _12_14
{
  uvec3 _m0[];
} _14;

 _14._m0[gl_GlobalInvocationID.x] = uvec3(
  floatBitsToUint(_41.x + 1.0),
  // Individual components might be sliced here. >_<
  floatBitsToUint(_41.y + 1.0),
  floatBitsToUint(_41.z + 1.0));
buffer_store_dwordx3 v[4:6], v0, s[4:7], 0 offen
Structured buffer

Structured buffers are a bit simpler, but still annoying in practice. While the HLSL declares something like:

struct T { float4 a; float3 b; float c; };
RWStructuredBuffer<T> Buf;

This information is lost in the raw DXIL form except in some reflection metadata that we cannot fully trust. The interface for structured buffers is just:

  • Index (usually not constant)
  • Offset into structured element (usually constant)
  • Load-store size
  • Alignment (ignored, DXC emits nonsense here)

Ideally, we would be able to emit the equivalent in SSBO form, but the raw nature of DXIL doesn’t really care anymore about the data types. We can only get effective vectorization for composites like this if the stride aligns with the vector size.

An example here would be:

struct T { float4 a; float3 b; float c; };
RWStructuredBuffer<T> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  T t = RW[thr];
  t.a += 1.0;
  t.b += 2.0;
  t.c += 3.0;
  RW[thr] = t;
}
// Alignment of 4? Sure ... Except we trivially know it's 32 ...
// Told you we cannot trust DXC here! <_<
%3 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 0, i8 15, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

%8 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 16, i8 7, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

%12 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 28, i8 1, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

Since stride is 32, we can vectorize the float4, but not float3.

// float4 load
vec4 _29 = uintBitsToFloat(_14._m0[gl_GlobalInvocationID.x * 2u]);

// float3 load
uint _37 = (gl_GlobalInvocationID.x * 8u) + 4u;
uint _40 = _9._m0[_37];
uint _44 = _9._m0[_37 + 1u];
uint _47 = _9._m0[_37 + 2u];
vec3 _50 = uintBitsToFloat(uvec3(_40, _44, _47));

// float load
uint _58 = _9._m0[(gl_GlobalInvocationID.x * 8u) + 7u];

// float4 store
_14._m0[gl_GlobalInvocationID.x * 2u] = uvec4(
  floatBitsToUint(_29.x + 1.0), floatBitsToUint(_29.y + 1.0),
  floatBitsToUint(_29.z + 1.0), floatBitsToUint(_29.w + 1.0));

// float3 store
uint _79 = (gl_GlobalInvocationID.x * 8u) + 4u;
_9._m0[_79] = floatBitsToUint(_50.x + 2.0);
_9._m0[_79 + 1u] = floatBitsToUint(_50.y + 2.0);
_9._m0[_79 + 2u] = floatBitsToUint(_50.z + 2.0);

// float store
_9._m0[(gl_GlobalInvocationID.x * 8u) + 7u] =
  floatBitsToUint(uintBitsToFloat(_58) + 3.0);
buffer_store_dwordx4 v[12:15], v0, s[4:7], 0 offen
buffer_store_dwordx4 v[4:7], v0, s[4:7], 0 offen offset:16

ACO is being an absolute champ here though, and even with robustness2 it can handle it. I think this happens because it can prove the address increments cannot overflow due to a POT index multiplier.

One saving grace of structured buffers is that it is undefined to straddle an element boundary. This means there is no byte address buffer hell of having to support per-component robustness. We can safely vectorize e.g.

RWStructuredBuffer<float3> Buf;

Root descriptor of ray tracing acceleration structure

If you ever wonder why OpConvertUToAccelerationStructureKHR exists, you can thank yours truly for creating a truly horrible monster.

struct Payload
{
  float4 color;
};

RaytracingAccelerationStructure AS_RootDesc : register(t0, space0);

[shader("raygeneration")]
void RayGen()
{
  RayDesc ray;
  ray.Origin = float3(1, 2, 3);
  ray.Direction = float3(0, 0, 1);
  ray.TMin = 1.0;
  ray.TMax = 4.0;

  Payload p;
  p.color = float4(1, 2, 3, 4);
  TraceRay(AS_RootDesc, RAY_FLAG_NONE, 0, 0, 0, 0, ray, p);
}
#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_nonuniform_qualifier : require

struct _16
{
  vec4 _m0;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  uvec2 _m2; // Why hello there. I'm a cursed RTAS.
} SBT;

layout(location = 0) rayPayloadEXT _16 _18;

void main()
{
  _18._m0 = vec4(1.0, 2.0, 3.0, 4.0);
  traceRayEXT(accelerationStructureEXT(SBT._m2),
    0u, 0u, 0u, 0u, 0u,
    vec3(1.0, 2.0, 3.0), 1.0,
    vec3(0.0, 0.0, 1.0), 4.0, 0);
}

Oh no!

Conclusion

Now you know probably a little too much about descriptors and you can go bug IHVs about this with confidence. D:

There’s a million different ways to do things when it comes to descriptors and D3D12 and Vulkan took completely different approaches here. Developers asked how things work and D3D12 said “… yes”. In response I say ÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆ.

My personal hell of translating DXIL to SPIR-V – part 3

It’s time we tackle one of the big problems of DXIL, the binding model. The D3D12 binding model is completely foreign to most people who know the Vulkan binding model, and vice versa. I don’t think there are that many people in the world who can claim to fully grasp the binding models in both APIs. Translating every last detail of the D3D12 binding model to Vulkan is extremely painful, and I feel D3D12 made some critical design mistakes which bite us (and native drivers?) hard. Whenever I hear people naively claim D3D12 and Vulkan is basically the same API, I cringe hard and cry a little inside. Translating low level APIs is hellish when the details don’t map 1:1 exactly and the binding model is the perfect vehicle to demonstrate it.

I hope this blog post can serve as a definitive document on the insanity we need to go through in vkd3d-proton to make all of this work well. We have landed on a solution I feel is quite solid for AMD, but perhaps less so on other IHVs … A lot of credit here goes to Doitsujin who went through the insane task of rewriting the entire binding model in vkd3d-proton to the full TIER_3 binding model last year.

Earlier posts in the series:

The starting point of D3D12 – allow reuse of D3D11-era shaders

Before we can discuss D3D12, we must have a basic understanding of the D3D11 binding style we find in SM 5.0. From there, we will see how this was extended to the frankenstein monster that is SM 5.1 DXBC and how this was cribbed wholesale in SM 6.0+.

Binding resources to slots

The binding model in D3D11 was extremely simple, simple enough that we have no problems implementing that in Vulkan 1.0 as-is. Essentially, you can bind a fixed amount of:

  • SRV (t#): Generic read-only resources, sampled images, read-only SSBO and uniform texel buffers
  • UAV (u#): Generic read-write resources, storage images, read-write SSBO and storage texel buffers
  • Samplers (s#): Separate samplers, same as Vulkan
  • CBV (b#): Constant buffer views, i.e. UBOs in Vulkan

Each resource in the HLSL declares which register it binds to, easy enough! This maps very easily to Vulkan. The main difference is that the bindings are per stage in D3D11, but these details are easy to work around. Another awkward detail of D3D11 is that SRVs and UAVs are catch alls for any kind of read-only or read-write resource, even if they supports many different descriptor types. Sampled images, raw buffers and typed buffers are quite different things! Of course, with a slot based binding model, the implementation can resolve these details easily since we know all accessed resources up front. No bindless hell to deal with.

Now, we imagine that game developers are sitting on a large repository of D3D11-era shaders and D3D12 rolls around. Very few were going to risk going D3D12 exclusive any time soon, so it had to be possible to ship the exact same DXBC shaders to target both APIs. Even to this day, most new D3D12 games ship DXBC to target both APIs! (Maintaining two completely different shader backends is soooo much fun and productive use of time …) The center of gravity is slowly shifting towards DXIL, and it’ll probably take a few more years before DXIL is the main bytecode games ship. I still haven’t seen a single Unreal Engine title shipping DXIL-exclusive yet for example.

One key aspect of the D3D11 model is that the shaders say nothing about how data is accessed. Especially buffer data can be implemented in many different ways, and it would have been up to the D3D11 driver to figure out how to implement constant buffers efficiently for example. Of course, with the highly managed model of that API, the driver had ample opportunity to do so, at the cost of CPU overhead. With explicit APIs like D3D12 and Vulkan, there is far less room for drivers to optimize because the application developer is given certain freedoms, but they are expected to do some work in return. In reality, the IHV will probably add hacks behind your back if your performance is important enough to them, but that’s another story.

With buffer data like constant buffers for example, there are three primary indirection levels where we can trade off speed vs. space:

  • Preloaded registers: The absolute fastest mode of execution. If a constant buffer is tiny (a few u32s), it might as well fit entirely inside the register bank of the shader core, no memory operation required in the shader!
  • Store a pointer in preloaded registers: The second fastest mode. We can take one indirection to be able to access a larger chunk of data with a simple load instruction from pointer.
  • Access a descriptor: The “slowest” mode, with two indirections. First, we need to load a descriptor, then, we load data based on that descriptor. We also get robustness guarantees, which is another special kind of hell to handle.

As we see here, these three styles of buffer access have different trade-offs. We can get more efficient access the smaller data we need to consider. A few u32s of data? Preload registers. A few KiB? Store a pointer to it in registers. Many different buffers with different base pointers? Descriptor is the way to go.

The awkward part of this simplified model is that on certain hardware, using descriptors might be the fast path, so we’ll have to be careful. IHVs have optimized these paths to death over the decades.

The critical part for D3D12 here is that there is no way to express these concerns in the HLSL shader itself, whereas we certainly can in GLSL / SPIR-V!

A “virtual” vs. “physical” binding model

In Vulkan, we can express concerns like these in the SPIR-V for the most part.

  • Preloaded registers: Push constants is designed to map directly to this scheme. Some drivers might also be able to promote INLINE_UNIFORM_BUFFER to registers, especially if the descriptor set is not UPDATE_AFTER_BIND.
  • Store a pointer in preloaded registers: In Vulkan, we can place buffer device addresses in a push constant block and load directly from that. This is somewhat esoteric, but we use this a lot in vkd3d-proton. INLINE_UNIFORM_BUFFER also maps to this scheme. If we consider a descriptor set to be a pointer, inline UBO data can be placed directly in that descriptor memory, and we get one indirection. Implementation details between IHVs tends to be wildly different for inline UBO.
  • Access a descriptor: Just normal descriptor sets. If the descriptor set is not UPDATE_AFTER_BIND, a driver might be able to promote a buffer to a pointer in preloaded registers. This model has two indirections. One to load a descriptor, one to load the data from that descriptor.

Since we’re able to express ourselves fairly explicitly in Vulkan in the shader itself, I’ll call this a “physical” binding model, or at least a close enough approximation. A true physical binding model would be one where descriptors are accessed through raw pointers, but good luck convincing N IHVs to agree on how that should work 🙂 Inline UBO is the odd exception to this rule where the pipeline layout specifies access patterns which are not expressed in SPIR-V.

One fair criticism of Vulkan’s model is that there are a lot of ways to do effectively the same thing. Should you use push constants, inline UBO, UBO, UBO with dynamic offset, push constant buffer device address, push descriptors or normal pool descriptors? Hard to say without lots of experience and profiling. At least we have the tools available in the API, even if it’s not always obvious what the optimal choice is. In vkd3d-proton we use almost all of these, depending on the context.

To lament the state of graphics programming a bit, loading constant data optimally is still an unsolved problem in graphics, but we have accelerated ray traced global illumination at least, so yay? 🙂 To be fair, it’s hard to go wrong with plain old Vulkan 1.0 UBOs. Any perf gains beyond that tends to be minor and highly situational.

The abstract nature of D3D12 binding model

D3D12’s model is far more abstract, but at the same time, it maps to extremely specific restrictions on an implementation. It’s a weird model where it feels like an abstraction, but it actually isn’t. We can see right through it. We’ll explore this point in the root signature section …

Shader model 5.1 – I wanna bindless too!

5.1 is a weird version of DXBC. It changes some critical things:

  • Resources can be declared as arrays with dynamic indexing. Unbounded array size is also supported. Bindless, baby!
  • register() bindings now take an optional space parameter. This seems similar to DescriptorSet in SPIR-V, but it actually isn’t. It’s meaningless on its own. It is necessary however when using a lot of unbounded array size declarations. In DXBC, all t# or u# registers for example would be exhausted by unbounded array sizes, but by using different register spaces we can declare multiple unbounded arrays. When referring to a resource, we need to consider its binding, space, shader stage and type (SRV, UAV, CBV vs Sampler).

5.1 is kinda obsolete now that DXIL (6.0) has been around for a while, but it is still shipped in various titles.

The root signature

As mentioned earlier, HLSL shaders give no hints to the driver about how resources are actually accessed, so this information has to come from somewhere. In D3D12, we provide all this information in the root signature. The root signature is analogous to Vulkan’s VkPipelineLayout, but the details are completely different.

I think it’s helpful to understand D3D12 root signatures from a top-down view, and the point I would like to get across is that we should see the root signature as a very fancy push constant layout declaration.

The root signature defines an ABI for shaders. We lay down up to 256 bytes in memory, and then define how those 256 bytes can be used to access resources in various ways. The weird way root signatures work is that there’s a very explicit limit of a certain number of bytes, but we cannot actually access those bytes ourselves in a shader. The 256 bytes are organized in terms of 64 DWORDs. Each DWORD can be viewed in a certain way.

  • 32_BIT_CONSTANT: Consume a certain number of DWORDs and assign them to a CBV with specific register binding. That CBV’s data cannot be accessed with any dynamic array indices. This enforces the hidden implementation detail that the data can be mapped directly to registers. As an interesting tidbit, Vulkan push constants actually don’t have this restriction. In Vulkan, you can still dynamically index push constants as long as the access is dynamically uniform (yes, some hardware can dynamically index registers, it’s neat!)
  • Root descriptor: Consume 2 DWORDs to store a 64-bit GPU virtual address to any kind of raw buffer. CBV, UAV and SRV all work as long as they are untyped. The key thing to note here is that there is no room for a buffer size, so buffer robustness is completely disabled for root descriptors. All of this is just a very roundabout way to say that we store a pointer and access it as a pointer. This is basically a very restricted form of VK_KHR_buffer_device_address which is hidden from the developer. The driver is of course free to turn this into something like a Vulkan push descriptor if it so chooses. In vkd3d-proton, we’d love if games actually made use of root CBVs, but oh well 🙁 Some engines got the memo though 🙂
  • Table pointer: The cost is 1 DWORD, even though it should be 2 (we’ll get to this later, it gets confusing!). This is where the D3D12 binding model starts to get esoteric and diverge completely from Vulkan. We’ll need a separate section to cover this.

The global descriptor heaps and table pointers

Descriptors are managed in a completely different way than Vulkan and this is the point that gives us a pretty big headache.

In D3D12, you’re expected to allocate one heap which contains N descriptors. N goes up to 1,000,000 in the highest binding tier. All descriptors you access in a shader must exist inside this heap, and the heap must be explicitly bound to a command list before use. For a given command list, you’re not really expected to change it, but you can suballocate, and 1 million descriptors should be enough unless we take the Nanite meme far enough where we need to render one material per pixel. Samplers have their own heap, and there’s a limit of a few thousand here.

From the global heap, you’re able to bind sub-ranges of that heap. These sub-ranges are represented as the table pointers in the root signature. Essentially, a table pointer encodes a u32 offset into the global heap – even if you give the API a GPU VA – which is why it costs 1 DWORD instead of 2 (for now …).

Apparently, this idea is a workaround for certain GPUs. There exists a literal “descriptor palette” in some hardware and we can only read descriptors from this “palette” essentially. D3D12 seems to cater to that model, at the cost of flexibility. You’re asked as a developer to own that palette / heap and allocate memory on top of that.

View objects work completely differently as well. In D3D12, a view object is created directly into a descriptor heap, rather than having separate VkImageView and VkBufferView objects. Copying descriptors is a critical part of D3D12 descriptor management. It’s possible to create non-shader visible descriptor heaps, which is basically just a fancy malloc(). Here, it’s intended that we can stream out descriptors to the shader-visible descriptor heap by calling CopyDescriptors over and over, often over 10000 times per frame! Making this path efficient enough has been a ton of work and has placed some serious restrictions on our implementation. There is only one way to do it “fast” and correct, it just took 3-4 iterations to get there.

In Vulkan, the approach is quite different. We create view objects as standalone objects, which are then written into descriptor sets with vkUpdateDescriptorSets. If we squint enough, the Vulkan view objects represent the non-shader visible descriptor heaps, and vkUpdateDescriptorSets is closer to CopyDescriptors (non-shader visible → shader visible) than Create*View. Profiling native drivers shows this clearly. Create*View() is ~50-100x slower than CopyDescriptors, similar to Vulkan if you look at vkCreate*View() vs. vkUpdateDescriptorSets.

In the Vulkan model, there is no concept of a “descriptor heap” which has to be bound, however, we can figure out what is really going on under the hood. We can allocate a lot of view objects in Vulkan, but eventually, we will actually hit out-of-memory conditions on some hardware. This is a good indication that we have actually exhausted the internal descriptor “palette”. The views merely contain references to that global palette instead.

Another headache is that view objects in D3D12 do not have a lifetime or their own. The lifetime is tied to the descriptor heap itself, as a descriptor is essentially treated as plain old data. This is not the case with Vulkan. The unfortunate side effect of this scenario is that we need to maintain a per ID3D12Resource hash-map in vkd3d-proton where we keep view objects alive until we know for sure that we can destroy the views, i.e. at resource destruction time. The older implementation in vkd3d-proton used reference counted views, but as you can expect, when games copy 10000+ descriptors per frame in many threads concurrently, the overhead was unusably large. The shit hit the fan so to say in Death Stranding where we spent >80% of CPU time copying descriptors, not a good look. The only solution was to use VkCopyDescriptors functionality of vkUpdateDescriptorSets, which is considered quite esoteric in Vulkan.

So how do shaders access the global descriptor heap? Well, through a lot of fixed function jank, that’s how! This jank is more or less fixed in SM 6.6 as we’ll get to later, but now we just have yet another API to implement, support and test, sigh …

First, we start with the table parameter that is pushed to the GPU through the command list API. This represents an offset into the descriptor heap. A table entry has a certain number of descriptor ranges associated with it. These ranges specify that a certain subset of descriptors will be consumed as descriptors. For example, consider a Texture2D[] bound at register(t10, space4). A table entry in the root signature might say “Parameter #5 has an SRV range which begins at space #4, register #8, contains unbounded number of descriptors, with a constant offset of 15 descriptors from the table entry.” When accessing this resource, we should access the descriptor heap at descriptor #tableEntry5 + (10 – 8) + 15 + dynamicIndex in shader. As we see, the registers and spaces in the HLSL have no physical meaning until the root signature gives it meaning.

Codegen to SPIR-V

We can map the entire root signature into two parts, the root parameters and immutable sampler declaration.

For example, take this shader:

cbuffer CPush : register(b0) { float4 cpush; };
cbuffer C : register(b1) { float4 c; };
cbuffer C2 : register(b2) { float4 c2; };

Texture2D<float4> T[] : register(t40);
SamplerState S : register(s30);

float4 main(uint index : INDEX, float uv : UV) : SV_Target
{
	return T[NonUniformResourceIndex(index)].Sample(S, uv) +
		cpush + c + c2;
}

A typical codegen here would be something like (via SPIRV-Cross):

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct AddCarry
{
    uint _m0;
    uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerFloat4NonWrite;
layout(buffer_reference, std430) readonly buffer PhysicalPointerFloat4NonWrite
{
    vec4 value;
};

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
    vec4 _m0[4096];
} _21[];

// The root parameters map to push constants.
layout(push_constant, std430) uniform RootConstants
{
    uvec2 _m0; // Root CBV for b1
    uint _m1;  // Various table offset root parameters
    uint _m2;  // ...
    ...
    uint _m9;  // Root constants
    uint _m10;
    uint _m11;
    uint _m12;
} registers;

layout(set = 0, binding = 0) uniform texture2D _14[];
layout(set = 2, binding = 0) uniform sampler _25[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in float UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    uint _46 = registers._m6 + 2u;

    // Textures can only be accessed through heap.
    // Here we see address computation being based on all three inputs.
    // Push constant + constant + dynamic index
    // All of this is wrapped in a nonuniform thing for extra spice.
    vec4 _66 = texture(nonuniformEXT(sampler2D(_14[registers._m1 + (INDEX + 40u)], _25[registers._m3 + 30u])), vec2(UV));

    // Root constants, directly copy from push constants.
    // The cbufferLoad instruction gives us constant offsets, so we can
    // directly access push constant members here.
    vec4 _84 = uintBitsToFloat(uvec4(
        registers._m8, registers._m9, registers._m10, registers._m11));

    // Root CBV, here implemented with buffer device address,
    // kinda ugly SPIR-V, but gotta go fast.
    // This works great on AMD, not so much on NVIDIA … but we can
    // Use a normal plain UBO instead if we want.
    // Ah, the luxury of giving the implementation a choice … :)
    
    AddCarry _98;
    _98._m0 = uaddCarry(registers._m0.x, 0u * 16u, _98._m1);
    PhysicalPointerFloat4NonWrite _105 = PhysicalPointerFloat4NonWrite(uvec2(_98._m0, registers._m0.y + _98._m1));

    // Bindless UBO action, table CBV here we go …
    // Not a great look for performance on some GPUs.
    // On Pascal in particular, we have to emit bindless SSBO instead, rip perf.
    // Not much we can do sadly.
    SV_Target.x = ((_84.x + _66.x) + _105.value.x) + _21[_46]._m0[0u].x;
    SV_Target.y = ((_84.y + _66.y) + _105.value.y) + _21[_46]._m0[0u].y;
    SV_Target.z = ((_84.z + _66.z) + _105.value.z) + _21[_46]._m0[0u].z;
    SV_Target.w = ((_84.w + _66.w) + _105.value.w) + _21[_46]._m0[0u].w;
}

 

It’s common to think of the Table parameter as equivalents to Vulkan descriptor sets, but that is misleading. The correct way to think about it is that D3D12 has one huge (technically two for Sampler heap) descriptor set, and for convenience you can push offsets into that heap as desired. In Vulkan, descriptor sets can be allocated and bound in complete isolation. Shader model 6.6 removes all of this pretense and shows it how it really is. Everything is just an index to the heap, and the pre-6.6 world tried to hide this fact.

My main problems with the binding model

Very loose coupling between root signature and table descriptor access

The only thing the root signature does is to define how offsets into the heap is computed. There is no knowledge about the array-ed-ness of a resource, nor if every resource is even valid at record time (exception here is STATIC descriptors, but alas, extremely rare in the wild). Unfortunately, this loose nature of root signature and shader matching complicates descriptor hoisting, and when you complicate hoisting, you better pray the target hardware has pristine bindless support (pro-tip, they don’t).

A root signature could say that “we have 100 CBVs in this range”, but shaders might just happen to use a few CBVs from that range. Doing descriptor work per shader is pretty gross and goes against the spirit of the modern graphics APIs. Aggressive hoisting and repacking is CPU overhead that we shouldn’t consider except in exceptional circumstances. A native driver might have a much easier time dealing with these things since they can modify their own command streams at the last minute if they want. In D3D12, SIMULTANEOUS_ACCESS command lists don’t exist, so the driver has this option. Perhaps this restriction was put in place precisely for these scenarios? Who knows, we can only armchair these things!

The D3D12 docs also encourage developers to use somewhat generic root signatures and rarely change them, so a root signature is likely going to be quite “fat” compared to an ideal Vulkan pipeline layout, making useful workarounds even more difficult …

Awkward and unnecessary aliasing

In a D3D12 descriptor heap, there is no descriptor type. Descriptor #10 can be an SRV and descriptor #11 can be a CBV, it’s all a few bits between friends. Surely, games are not going to commit the grave sin of placing the wrong descriptor type in the wrong index, right? … Right?! Of course they do. Debugging insanity like this was the prelude to vkd3d-proton’s fast descriptor QA checking mechanism. It’s actually quite fast. In AAA games, we have code like this running at 40 FPS usually. More than good enough I’d say when hunting for bugs.

void descriptor_qa_report_fault(uint fault_type, uint heap_offset, uint cookie, uint heap_index, uint descriptor_type, uint actual_descriptor_type, uint instruction)
{
    uint _63 = atomicAdd(QAGlobalData.fault_atomic, 1u);
    if (_63 == 0u)
    {
        QAGlobalData.failed_cookie = cookie;
        QAGlobalData.failed_offset = heap_offset;
        QAGlobalData.failed_heap = heap_index;
        QAGlobalData.failed_descriptor_type_mask = descriptor_type;
        QAGlobalData.actual_descriptor_type_mask = actual_descriptor_type;
        QAGlobalData.failed_instruction = instruction;
        QAGlobalData.failed_shader_hash = uvec2(291u, 0u);
        memoryBarrierBuffer();
        QAGlobalData.fault_type = fault_type;
    }
}

uint descriptor_qa_check(uint heap_offset, uint descriptor_type_mask, uint instruction)
{
    uint _98 = QAHeapData.descriptor_count;
    uint _100 = QAHeapData.heap_index;
    uvec2 _102 = QAHeapData.cookies_descriptor_info[heap_offset];
    uint _110 = QAGlobalData.live_status_table[_102.x >> 5u];
    uint _121 = (uint(heap_offset >= _98) | (((_102.y & descriptor_type_mask) == descriptor_type_mask) ? 0u : 2u)) | (((_110 & (1u << (_102.x & 31u))) != 0u) ? 0u : 4u);
    if (_121 != 0u)
    {
        descriptor_qa_report_fault(_121, heap_offset, _102.x, _100, descriptor_type_mask, _102.y, instruction);
        return _98;
    }
    return heap_offset;
}

void main()
{
    uint _45 = descriptor_qa_check(registers._m1 + (INDEX + 40u), 1u, 1u);
    vec4 _132 = texture(nonuniformEXT(sampler2D(_14[_45], _18[registers._m3 + 30u])), vec2(UV));
    SV_Target.x = _132.x;
    SV_Target.y = _132.y;
    SV_Target.z = _132.z;
    SV_Target.w = _132.w;
}

 

With this scheme we can check for:

  • Is the descriptor type correct?
  • Is the resource destroyed already?
  • Did we index the heap out of bounds?

If we hit failures we get a neat atomic cookie in host memory we can sample at regular intervals on CPU. Works surprisingly well and tailor made for our needs, which is likely why it’s so fast compared to generic alternatives. We also only report the first failure. With this mode enabled, we allocate N + 1 descriptors, and reserve the last descriptor as a NULL descriptor which we can OpSelect on failure to avoid GPU hangs. (well, if someone did a faulty atomic compswap loop on a NULL descriptor I guess we’d timeout, but whatever, not much we can do about that).

Forcing the addition of VK_VALVE_mutable_descriptor_type

The combination of weirdly aliased descriptor types and full bindless is pretty deadly since Vulkan has typed descriptors, for good reason! This unfortunate mismatch forced us to create many descriptor sets, one per type, which looks very dumb. It consumes enough memory that we actually ran into serious performance issues on AMD hardware in the beginning. To fix all the immediate issues, we developed VK_VALVE_mutable_descriptor_type, which is kinda gross, but it lets us implement the D3D12 weird aliased descriptors directly, and all the problems went away. It’s a pretty goofy model though and not something Vulkan developers should be using directly. Not all descriptors have the same size in hardware, so there is gaps between descriptors, which is what I’m pretty sure is the case for native D3D12 as well, rip K$ … The more natural model would have been to have descriptor heaps where equally sized objects were allocated in different heaps, but we have to deal with whatever problems D3D12 throws our way and make it work …

Another “nice” side effect of using mutable was that some GPU hangs went away. If the game used the wrong descriptor type, it seemed to at least not read a descriptor that pointed to already freed memory, but rather just a descriptor of wrong type. Somehow, this helped certain games to run around the time the extension was released. It is deeply disturbing that games can ship in this state. :\

A game engine targeting Vulkan can be smarter about all these things. For example, having 1M sampled images seems reasonable, as that’s the number one thing you’d use bindless for, but do you need 1M CBVs, fully bindless? I doubt it. I’m far more in favor of a model where normal, packable, hoistable descriptors are used by default, and bindless is only tapped into as needed.

Enforces a fully bindless implementation

When implementing this binding model in other APIs, we’re forced to make everything fully bindless. Everything is accessed in terms of the global descriptor heap, and even if 99% of shaders use a very traditional D3D11-style binding model, we cannot do anything smart here because of one critical design flaw of D3D12 …

The great VOLATILE mistake of Root Signature 1.0

Root Signature 1.0 really doubles down on forcing full bindless everywhere, even when shaders don’t actually need it. All descriptors are considered VOLATILE, which means that we require all the Vulkan flags that are opt-in:

  • UPDATE_AFTER_BIND
  • UPDATE_WHILE_PENDING
  • PARTIALLY_BOUND
  • VARIABLE_COUNT

This fact completely blocks us from hoisting descriptors on the fly. Even if the shader is simple enough, we’re royally screwed because there is no guarantee that descriptors are actually valid at the time of command recording. Almost all games still cling to Root Signature 1.0, and probably rely on driver heroics to work around any performance implication. Table CBV on Pascal GPUs for example is disgustingly slow in vkd3d-proton. In Vulkan, these GPUs don’t even support UPDATE_AFTER_BIND UBOs! I can only wonder what utter depravities the D3D12 driver engineers had to do to make this work well … Hoist descriptors late with device generated commands based on the PSO? Bleh. The API forces us to implement these as bindless SSBO, and I die a little inside every time I have to think about this. Root CBV works perfectly fine, yet games insist on using table CBVs with VOLATILE just because …

On AMD, everything is fully bindless anyways, as descriptors are just memory, but it’s unfortunate that the D3D12 API went VOLATILE by default. It really should have been opt-in. It’s not like the poor Pascal souls can actually buy new GPUs these days even if they wanted to 🙂

The second mistake of Root Signature 1.1 – STATIC

The mistake of VOLATILE-by-default was recognized by the time 1.1 rolled around, and STATIC was now made the default. STATIC is the nicest mode possible for drivers, since it fully allows hoisting of descriptors, but you actually lose robustness guarantees (!). Most big game engines probably looked at this for 2 seconds and noped out.

D3D in general has gone to extreme lengths to ensure robustness guarantees, and engines of course rely on every esoteric OOB scenario to not fail. This is hopefully not by design, but engine bugs get masked for years if they never cause issues in practice. Losing that guarantee means no one is going to risk random GPU crashes in the wild for just a potential performance increase.

The actual fix that no game ever uses – STATIC + BOUNDS_CHECKED

The actual fix is fairly obscure. At some point a more reasonable mode was added which preserves bounds checking. Of course, no game I know of actually uses this. vkd3d-proton can take advantage of STATIC / BOUNDS_CHECKED CBVs in some cases where we will hoist them to push descriptors for significant NV perf gains, but alas, when no games use the APIs like intended, it might as well not exist …

Immutable sampler jank

The docs imply that the driver is supposed to implement this with an internal hashmap of sampler objects. Oddly enough, these sampler objects live outside the normal sampler heap you bind to the command list, and in fact, you don’t have to bind anything to use immutable samplers, it’s intended to be implemented through driver magic.

This magic is certainly odd, but in Vulkan we have to allocate and bind immutable samplers to a command buffer. This is because very little hardware actually supports true immutable samplers in the sense that they are embedded in the shader code itself. While we still have to allocate descriptor sets, at least we don’t have to actually write the immutable samplers to the descriptor set.

On AMD, the descriptors are read directly from scalar registers, so there’s nothing stopping a motivated compiler from emitting a bunch of immediate register moves and go to town, but sadly, most hardware don’t work like this, and I don’t think drivers even take advantage of this possibility yet.

My big question is how this is even supposed to work on D3D12 if the samplers have to live somewhere in memory. If the shader is simply storing constant offsets into some hidden sampler heap, then surely those constants are not stable across different runs, and thus shader caching fails. Patching in new constants seems pretty gross, but I can only speculate … Either way, in vkd3d-proton we end up creating a VkDescriptorSet per root signature which holds all immutable samplers for a given root signature. This is automatically bound when flushing dirty state in a draw or dispatch call.

Painful and awkward raw buffer types

In Vulkan we are blessed with the SSBO, the flexible and versatile buffer. D3D12 is a bit more weird and has extremely specific alignment and robustness guarantees which don’t map cleanly to anything.

First, we have ByteAddressBuffer.

  • Alignment requirement of buffer binding, 16 bytes (good)
  • Load1,2,3,4 variants, with 4 byte alignment requirement (2 for 16-bit). SM 6.2 improved this a fair bit, with a templated type, but for us in DXIL land, very little changes.
  • Robustness is checked per-component (why ._.)
  • Loads and stores are done at specific byte offsets, which is super ugly, but mostly for shader authors

The only natural way to implement this is a uint[] array, where we unroll loads and store per-component. Fun times. I would be surprised if hardware actually supported vectorized load-store with per-component robustness at scalar alignment …

The second style is StructuredBuffer which represents a buffer as a T data[] array. One deeply frustrating aspect of this is that data structures are tightly packed, meaning T could be float3, and we’d get 12 byte stride. StructuredBuffers must be bound in terms of elements, not byte offsets, so it’s perfectly valid to bind a structured buffer at 12 byte offset. This cannot be implemented safely on GPUs which report 16 byte SSBO alignment sadly. Due to all this, it’s almost impossible to cleanly declare a vec3 data[]; inside the SPIR-V, even if we have scalar block layout support. Just like ByteAddressBuffer, we kinda have to unroll loads and stores with uint data[] again, shame … We have gone to great lengths to support an “offset” buffer, which means we can bind an SSBO at 16 byte alignment, and nudge the offset inside the shader to fix things up, but it is horrible.

There are some mind-melting optimizations in flight for dxil-spirv which can vectorize these things, but it’s not a perfect solution. Weird robustness and alignment rules forces us to do some pretty horrible things, and we only have a few escape hatches which we can exploit, and dxil-spirv will try hard to squeeze every last legal vectorization opportunity out of the existing rules.

For example, StructuredBuffers have a rule saying that robustness only needs to be checked once, based on the structure element that is being accessed. For dynamic access into the element itself, it is undefined behavior to straddle the element boundary. Think StructuredBuffer<float4> where you do a dynamic access into the components, and index overflows.

I’m wondering if hardware actually has a special descriptor type they use to implement structured buffers. Perhaps the same ones which are used for vertex attribute fetch? And yes, even AMD uses tbuffer with stride pulled from descriptors, so saying it’s fully programmable fetch is a bit misleading I think.

Texel buffers are not a valid solution either. They can be bound at scalar alignment, which is nice, but we cannot safely use vectorized load-store either. Texel buffers can only be read and written to fully. No write masks allowed. Another problem is mix and match of 16-bit load-store which completely broke any hope of using texel buffers. In Vulkan, we can redeclare an SSBO multiple times with different data layouts, which is basically a human readable version of ByteAddressBuffer.

It’s a bit frustrating to deal with these problems, because almost no sane application is going to run into the edge cases, or need all the esoteric robustness features, but we have to make every possible edge case work. We cannot take shortcuts. To demonstrate the insanity we have to go through on certain GPUs …

StructuredBuffer<float3> B;

float3 main(uint index : INDEX) : SV_Target
{
	return B[index];
}

Since this can be bound at offset 12, we might see this horrible code. The offset buffer is associated with the descriptor heap, and we allocate a separate side channel buffer for this purpose.

layout(set = 1, binding = 0, std430) restrict readonly buffer SSBO_Offsets
{
    uvec2 _m0[];
} _13;

layout(set = 1, binding = 1, std430) restrict readonly buffer SSBO
{
    uint _m0[];
} _18[];

layout(push_constant, std430) uniform RootConstants
{
   // root parameters
} registers;

layout(location = 0) flat in uint INDEX;
layout(location = 0) out vec3 SV_Target;

void main()
{
    // Load extra offset + actual range. We cannot bind the SSBO tightly enough
    // on some GPUs.
    // Since the access is not nonuniform, we can give a strong hint to compiler
    // that the offsets can be loaded as a broadcast.
    uvec2 _37 = _13._m0[subgroupBroadcastFirst(registers._m1)] >> uvec2(2u);
    uint _41 = INDEX * 3u;
    // Trip explicit OOB if we fail.
    uint _47 = (_41 < _37.y) ? (_41 + _37.x) : 1073741820u;
    // Unroll 3 loads. ;_;
    vec3 _60 = uintBitsToFloat(uvec3(_18[registers._m1]._m0[_47], _18[registers._m1]._m0[_47 + 1u], _18[registers._m1]._m0[_47 + 2u]));
    SV_Target.x = _60.x;
    SV_Target.y = _60.y;
    SV_Target.z = _60.z;
}

The horrible workarounds

We have a fair amount of issues to deal with. A lot of invalid API behavior in D3D12 happens to work on native drivers, and many games ship with subtle bugs which force us to do horrible things. IHVs are going to have a fun time in 5 years when hardware details change 🙂

What is a raw buffer anyways?

A surprisingly common bug is that D3D12 does not complain if you create a typed buffer, but actually intended to create a raw buffer (ByteAddressBuffer) and vice versa. The difference is just one flag away and is easily missed. For us, that means SSBO vs texel buffer, a catastrophic error, but apparently, D3D12 does not care and drivers will happily consume typed buffers as raw and vice versa. I have no idea how this works, but somehow it does (I wrote tests)! Either way, it got bad enough that when emitting buffer descriptors, we have to emit both an SSBO variant and texel buffer variant and then the shader can pick the correct descriptor it wants. Of course, this being bindless and decoupled, we cannot pick the correct descriptor at draw time, sigh … Sometimes I think bindless all the things is going to be considered a huge mistake in 10 years from a software ecosystem point of view.

What is a NULL descriptor anyways?

A reasonably common bug is the NULL descriptor with wrong type. Technically in D3D12, NULL descriptors are still typed and you have to pass in a valid resource desc when creating one, but mistakes happen all the time. Using the wrong type works perfectly fine on existing hardware in almost all cases (I wrote a test!). For this reason, vkd3d-proton takes the conservative approach, and writing a NULL descriptor means splatting out NULL descriptors to all the different types of descriptors we have in the heap. For MUTABLE, this means two sets, for non-mutable, six! Voila, various GPU hangs disappeared, just like that.

VOLATILE is really bad for validation

Having simple validation is critical for development, and I think the real hidden mistake of VOLATILE being default, is that only GPU-assisted validation can catch any bugs! GPU validation is very heavy and usually avoided when possible. I feel a ton of these bugs could have been caught if descriptors could have been validated on CPU timeline. As long as an invalid descriptor is not dynamically accessed by a shader, it’s all OK! Dynamically accessed is the critical wording here. With this requirement, we can only validate on submit time if we can prove ahead of time that a shader will generate invocations which then statically access an invalid descriptor. Good luck proving that fragment shader invocations happen 100% of the time! Don’t get me wrong, bindless is very powerful and great for the use cases it enables, but derping everything into bindless doesn’t seem like the correct approach to me.

Local root signatures – one great kludge to rule them all

Local root signatures add yet another layer of hell to the binding model. It was annoying enough to implement a dozen different ways to access resources, and now we have do it all over again, but this time, replacing push constants with SBTs + some additional edge cases that are specific to local root signatures!

Local root signatures are used specifically for the shader record tables in ray tracing pipelines. In Vulkan, this data is accessed directly as a buffer, but D3D12 went the way of describing a view instead where each parameter is laid out in memory one after the other. As usual, table pointers is a thing, but this time, they consume 2 DWORDs instead of 1. Why on earth was table pointers described in terms of GPU VA and not just a simple u32 “offset into heap”? Sometimes I find D3D12’s obsession with GPU VA-s quite bizarre and counter-productive. It doesn’t even have buffer device address support, yet we’re using them everywhere, even when it doesn’t make much sense at all. In vkd3d-proton, we had to do the janky thing of returning artificial GPU VA-s for descriptor heaps so that it was fast to convert the VA into an offset.

While root descriptors in global root signatures can be implemented as descriptors, we are forced to use buffer device addresses since the VA is literally sourced from buffer memory. The fun thing is that you can even place RT acceleration structures here as a literal VA! And this is how OpConvertUToAccelerationStructureKHR was born … 🙂

Local root signatures also add the headache of also supporting immutable samplers! This time, we have to defer the creation of a second immutable sampler descriptor set to RTPSO creation, which is super annoying. At bind time, we therefore might run the risk of having to bind one immutable sampler set for the global root signature, then one which is tied to the union of all local root signatures for the RTPSO, so much fun! >_<

Implementing SM 6.6 bindless

SM 6.6 is actually straight forward. Our model in vkd3d-proton was always to map pre-6.6 style bindings to a SM 6.6 style “access the heap directly” model, so this translation is actually not that bad at all. The main annoyance is how there are now 5 (!) opcodes in DXIL to deal with resource handle creation, each with their own idiosyncrasies. I’m pretty sure all of this could have been squeezed into 2 …

  • CreateHandle
  • CreateHandleForLib
  • CreateHandleFromHeap
  • CreateHandleFromBinding
  • AnnotateHandle

Fun times indeed …

float4 main(uint index : INDEX, float2 uv : UV) : SV_Target
{
	Texture2D<float4> T = ResourceDescriptorHeap[index];
	return T.Load(int3(uv, 0));
}

#version 460
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_samplerless_texture_functions : require

layout(set = 0, binding = 0) uniform texture2D _9[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in vec2 UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    vec4 _32 = texelFetch(_9[INDEX], ivec2(uvec2(uint(int(UV.x)), uint(int(UV.y)))), int(0u));
    SV_Target.x = _32.x;
    SV_Target.y = _32.y;
    SV_Target.z = _32.z;
    SV_Target.w = _32.w;
}

Nice. I recently completed SM 6.6 support for vkd3d-proton, have fun with that!

How this maps to AMD GPUs (teaser for next post)

If anything, it feels like the D3D12 binding model is tailored pretty well to AMD hardware. It probably makes sense given the Mantle heritage. The binding model on GCN is extremely neat and tidy and was designed with full bindless in mind since the early days, quite forward looking, indeed. Descriptors are just memory that is loaded into scalar registers and passed directly to various operations which take descriptors. Very little magic is going on here and once you understand how this work, the modern API binding models start making some sense. Root parameters and push constants map nicely to user SGPRs, and as long as we don’t use too many u32s, we have a very direct mental model of how this is going to map to hardware.

Where D3D12 suffers I think is on hardware where there are still remnants of legacy descriptor slots in hardware. We cannot take advantage of this due to poor design decisions made in D3D12, but it is what it is. There is only so much we can do as a translation layer.

We’ll explore this topic in a future post …

Conclusion

It has taken many, many months, if not over a year to get where we are today in vkd3d-proton. Handling all of this insanity pushed us to the breaking point, but somehow it works pretty well. There are no shortcuts we can take.

Despite the mammoth length of this post, I don’t think we’re quite done with the subject of binding models. I’ll probably need another post to cover codegen examples for AMD ISA as well as an overview of the million ways we can convert the same DXIL code to different SPIR-V depending on the root signature.

My personal hell of translating DXIL to SPIR-V – part 2

In the previous blog post, I began a long form exploration of the DXIL shader format, and how it translates to SPIR-V. In this part, we’ll look more closely at the LLVM format, how it is parsed, and how to interpret the parsed result.

The LLVM IR binary format is mostly undocumented. Very early on we have to dig through the source to understand what is going on. LLVM IR was never intended to be used as a “standard” format that is shipped between different software stacks. It’s clearly an ad-hoc serialization format that serves the purpose of LLVM internals. The IR format is backwards compatible at the very least, which is why we can parse DXIL LLVM 3.7 modules with modern LLVM versions.

As we’ll see, LLVM IR is very complex to parse compared to SPIR-V. There are some interesting similarities however, as SPIR-V shares some DNA of LLVM.

Layered architecture

LLVM IR is parsed in multiple layers. At the lower level is a compression scheme which feels somewhat like LZ compression. The bit-stream teaches the decoder how to decode the stream, by emitting “templates” (or “code book entries”), and these templates can then be instantiated to form complete records.

The low-level bit-stream parser

The initial part of the LLVM IR puzzle is pulled from RenderDoc’s implementation. The basic gist of it is documented here. To summarize however, the idea is that a module consists of one top-level “block”. A block is a structure of blocks and records. A record has an ID with an array of uint64_t operands (quite similar to SPIR-V, except SPIR-V is array of 32-bit operands).

Storing full uint64_t operands is of course very wasteful, and this is where the primitives types of LLVM IR come in. We can express primitive types compactly with:

  • Variable length integers (configurable chunk size)
  • Fixed width integers (configurable bit width)
  • 6-bit chars (useful for C-style identifiers, i.e. a-z, A-Z, 0-9 and _)

Variable length integers are encoded in a scheme where we look at N bits at a time, N – 1 bits contain useful data, and the MSB marks whether to keep looking at N more bits.

Blocks and records are invoked in an esoteric way, which is where abbreviations come in. When we parse, we’re parsing abbreviations one at a time, which either results in some action:

  • 0 – END_BLOCK – Ends block scope
  • 1 – ENTER_SUBBLOCK – Begins a new scope, can nest arbitrarily
  • 2 – DEFINE_ABBREV – Defines a template for how to build new records. For example, we can specify that a record is [vbr4, char6, literal constant] or something. That abbreviation implicitly gets a new ID, starting with 4, which can be invoked when parsing new abbreviations. When decoding this abbreviation, the parser knows ahead of time how to decode the bits into arguments. For char6 strings in particular, it’s also possible to specify an array abbreviation.
  • 3 – UNABBREV_RECORD – YOLO mode, directly decodes a record with a bunch of variable length integers, fairly inefficient. DXC seems to love to emit these 🙂
  • 4+ – Invoke user abbreviations.

In typical LLVM IR fashion, the number of bits used to encode the abbreviation ID is variable. It starts at 2 bits (since there are no user abbreviations to worry about yet), but can grow as needed. Fun!

The details are not super interesting for this post, but suffice to say, there’s a decent amount of detail that goes into parsing this.

The calling code ends up looking something like this:

LLVMBC::BitcodeReader reader(static_cast<const uint8_t *>(data), size);
LLVMBC::BlockOrRecord toplevel = reader.ReadToplevelBlock();

// The top-level block must be MODULE_BLOCK.
if (KnownBlocks(toplevel.id) != KnownBlocks::MODULE_BLOCK)
   return nullptr;

// We should have consumed all bits, only one top-level block.
if (!reader.AtEndOfStream())
   return nullptr;

The BlockOrRecord struct is fairly straight forward, simplified here as:

struct BlockOrRecord
{
  uint32_t id; // What kind of record or block is this?
  Type type; // block or record
  dxil_spv::Vector<BlockOrRecord> children; // If block
  dxil_spv::Vector<uint64_t> ops; // If record
};

Higher level parser

Now, we’re at a level where we have recovered structure from the bit-stream, and now we need to turn the BlockOrRecord structs into actual API objects, llvm::Module, llvm::Function, llvm::Value, etc, etc … dxil-spirv implements a LLVM C++ API drop-in replacement to be able to cross-reference our implementation against the reference implementation at any time (which has saved me many times). The implementation only implements exactly what is needed for DXIL however, so don’t expect too much of it. 🙂

Refer to objects by ID

Very similar to SPIR-V, types and values are referred to by an uint64_t ID. The annoying part however is that types and values implicitly allocate their own IDs, meaning that forgetting to parse something can be fatal. On top of this, IDs may or may not refer to other IDs through deltas relative to their own values or absolute values. It is somewhat context sensitive which one to use, which gets quite annoying to deal with.

Decoding llvm::Type

LLVM IR has a type hierarchy similar to SPIR-V. You start by declaring fundamental types like ints and float, and then upgrade them to vectors, array, pointers or structs. While parsing, the top-level block can contain TYPE blocks, which contains a bunch of records.

for (auto &child : toplevel.children)
{
   if (child.IsBlock())
   {
      switch (KnownBlocks(child.id))
      {
      case KnownBlocks::TYPE_BLOCK:
         for (auto &entry : child.children)
             parse_type(entry);
         break;
      }
   }
}

bool ModuleParseContext::parse_type(const BlockOrRecord &child)
{
   Type *type = nullptr;
   switch (TypeRecord(child.id))
   {
   case TypeRecord::VOID_TYPE:
   case TypeRecord::HALF:
   case TypeRecord::INTEGER:
   case TypeRecord::POINTER:
   case TypeRecord::ARRAY:
   case TypeRecord::FUNCTION:
   // you get the idea
   }
   types.push_back(type);
}

Integers deserve special mention, because they are somewhat whacky in LLVM. First, they have no signage associated with them. This kinda makes sense, since signage only actually matters in certain opcodes, like signed min/max, signed compare, arithmetic vs logical right shift, signed vs unsigned float <-> int conversion, etc. SPIR-V maintains sign for its integer type, but we can ignore it in most scenarios. (There is an esoteric exception to this however where DXIL kinda breaks down, once we dig into relaxed precision signed integers!) Another annoying exception we have to deal with all the time is stage IO and resource variables which are explicitly signed or unsigned in DXIL.

As the grizzled C programmer will know, signed overflow is undefined, but unsigned overflow is not. Does LLVM just not care? Well, it does. LLVM can mark operations as being “no signed wrap”, or “no unsigned wrap” for optimization purposes, but we don’t have to care about those at all fortunately.

Booleans are expressed as 1-bit integers, which kind of makes sense, but at the same time feels like a very LLVM thing to do … Logical operations reduce to simple arithmetic operations on 1-bit values instead.

The final whack part is that you can declare non-POT integer sizes. There are shaders in the wild which declare 11-bit integers and rely on wrapping on these values to work! (dear lord … <_<) I even tried to compile this to x86_64 and yes, it does actually deal with it correctly. I’m kind of amazed, and scared at the same time.

Overall though, type declaration in LLVM IR is pretty easy to understand if you understand SPIR-V.

Decoding constants

Similar as types, constants are records within a block. They can appear at function scope or global scope.

bool ModuleParseContext::parse_constants_record(
    const BlockOrRecord &entry)
{
    llvm::Constant *value = nullptr;
    switch (ConstantsRecord(entry.id))
    {
        case ConstantsRecord::SETTYPE:
        case ConstantsRecord::CONST_NULL:
        case ConstantsRecord::UNDEF:
        case ConstantsRecord::INTEGER:
        // ...
    }
    values.push_back(value);
}

Roughly speaking, this looks very similar, with some quirks. SETTYPE informs subsequent constant blocks which type is actually used. CONST_NULL is example of a fully context sensitive constant, similar to OpConstantNull in SPIR-V.

INTEGERs are converted through sign rotations. Since small negative numbers would be horribly inefficient to encode with VBR otherwise, the first bit is the sign bit, encoded in a sign magnitude scheme. -0 is interpreted as INT64_MIN.

Where LLVM constants get disgusting however, is the pseudo-specialization constant operation support. It is possible to encode a constant cast operation, or constant access chain into a global object (wtf?) this way. I don’t understand the motivation behind this, but there are lots of super weird edge cases here that took some time to iron out.

AGGREGATE is the first time we start to see how value IDs are referenced.

Vector<Value *> constants;
constants.reserve(entry.ops.size());

for (auto &op : entry.ops)
{
   constants.push_back(get_value(op, element_type,
      /* force absolute IDs */ true));
}

Value *value;
// Ah, yes. Why have VECTOR and ARRAY types when you can
// have a context sensitive one instead.
if (current_constant_type_is_vector)
{
   value = context->construct<ConstantDataVector>(
      get_constant_type(), std::move(constants));
}
else
{
   value = context->construct<ConstantDataArray>(
      get_constant_type(), std::move(constants));
}

get_value() is quite sneaky. In LLVM IR, it is valid to forward reference an ID, as long as the type is known. This leads to ProxyValue objects being created, which are resolved later. get_value() can be relatively indexed, or absolutely indexed depending on the context, which is always fun.

Global variables

Global variables are declared in top-level records. Typically these are only used for groupshared variables. In DXIL, a special pointer address space is reserved for this purpose. Global variables can also be used for global look-up tables. Global variables can also have optional initializers. This is very similar to SPIR-V overall. The equivalent is an OpVariable with either Workgroup or Private storage class.

Resource handles are declared in a completely different way … unless we’re DXR (more on that later, sigh v_v …)

Function prototypes

We also get to declare function prototypes at this stage. Some functions only have prototypes, and the common case here is various prototypes which declare dx.op intrinsics functions. If a prototype is declared to also have a body, we place that in a queue.

We also have to parse parameter attribute lists (surprisingly tricky!), just in case the function declares LLVM attributes. The only case we have to care about here is FP32 denorm handling. Why that isn’t a metadata entry, I’ll never know. DXIL really likes splitting its implementation across two completely different systems for no good reason …

Parsing functions

A function body is a block, consisting of records (which express normal opcodes), and other blocks (e.g. constant blocks). The first record we’ll typically see is the DECLAREBLOCKS one, which specifies the number of basic blocks in the function.

Basic blocks

A basic block is a fundamental building block of SSA-based IRs. A basic block enters execution at the first instruction, and executes in a straight line fashion until a terminator instruction executes. A terminator instruction can be anything which transfers control like a direct branch, conditional branch, switch statement, returns, etc. If you know SPIR-V, this is nothing new. It’s the exact same concept.

Unlike SPIR-V, where we have explicit OpLabel opcodes to begin a new block, LLVM makes this implicit. When we observe a terminator, the next instruction will be added to the next basic block.

Context sensitive parsing

The opcodes in the IR match closely to the type hierarchy of LLVM, let’s look at parsing llvm::BinaryOperation. A binary operation is any c = op(a, b) kind of instruction, it’s not necessarily just and/or/xor, etc. This is a catch all for FAdd, IMul, IAdd, Xor, etc.

case FunctionRecord::INST_BINOP:
{
   unsigned index = 0;
   auto lhs = get_value_and_type(entry.ops, index);
   if (!lhs.first)
      return false;
   auto *rhs = get_value(entry.ops, index, lhs.second);
   if (!lhs.first || !rhs)
      return false;
   if (index == entry.ops.size())
      return false;
   auto op = BinOp(entry.ops[index++]);
   auto *value = context->construct<BinaryOperator>(lhs.first, rhs, translate_binop(op, lhs.second));
   if (index < entry.ops.size())
   {
      // Only relevant for FP math,
      // but we only look at fast math state for
      // FP operations anyways.
      auto fast_math_flags = entry.ops[index];
      bool fast = (fast_math_flags &
          (FAST_MATH_UNSAFE_ALGEBRA_BIT |
           FAST_MATH_ALLOW_CONTRACT_BIT)) != 0;
      value->set_fast_math(fast);
   }
   if (!add_instruction(value))
      return false;
   break;
}

In SPIR-V a binary operation would be encoded as:

%id = OpMyBinOp %type %operand_a %operand_b

Very explicit and understandable. LLVM IR on the other hand is more clever, for better or worse.

First, the result %id is implicit, and is allocated linearly as new opcodes come in. The type of an instruction is context sensitive. First, we parse %operand_a. If we have seen this ID already, %type is deduced directly from the operand. If it is a forward reference, the type of %operand_a is encoded in the record explicitly.

IDs in most opcodes are encoded with a relative scheme. Since SSA requires that declarations of an ID dominates all uses of it, the common case is that uses of an ID come after the declaration of it, so this is a decent compression scheme. The implementation of get_value_and_type() is something like:

std::pair<Value *, Type *> ModuleParseContext::get_value_and_type(
   const Vector<uint64_t> &ops, unsigned &index)
{
   if (index >= ops.size())
      return {};

   uint64_t op = ops[index++];
   // Context sensitive, for backwards compat mostly, but
   // modules can choose to use absolute or relative encoding.
   if (use_relative_id)
      op = uint32_t(values.size() - op);

   if (op < values.size())
   {
      // Normal reference.
      return { values[op], values[op]->getType() };
   }
   else
   {
      // Forward reference, the type is encoded in the next element.
      if (index >= ops.size())
         return {};

      auto *type = get_type(ops[index++]);
      auto *proxy = context->construct<ValueProxy>(type, *this, op);
      pending_forward_references.push_back(proxy);
      return { proxy, type };
   }
}

I had tons of bugs where I didn’t handle the possible forward references. Very awkward. I was of the impression that only PHI instructions could possibly have forward references, but of course, it’s never that simple.

Speaking of PHI, get_value() changes here to a signed-aware variant, where the relative value ID is encoded with sign_rotation, just like for INTEGER constants. This is because we expect that PHI inputs are forward referenced just as often as backwards referenced.

Overall, it’s just a grind to implement all relevant opcodes. DXIL only uses a subset, but it’s not well documented which subset of LLVM IR is actually used. We just have to implement new stuff as it comes in. DXIL.rst has a list of which LLVM instructions are supported, but this list cannot be trusted because in DXR, DXC emits various vector instructions (so much for being a scalar IR format) and the unreachable terminator which is missing from the table.

Metadata

Metadata lives in its own block hierarchy and has a completely different set of types, llvm::MDNode, llvm::MDOperand, llvm::ConstantAsMetadata, etc.

At the top of the hierarchy we can declare NamedMDNodes, which we see in the LLVM assembly as:

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.resources = !{!4}
!dx.entryPoints = !{!7}

NamedNodes contain a list of MDNodes, which nest into more MDNodes, or terminate in constant values. These correspond to NAMED_NODE, NODE and VALUE record types, not too many surprises here.

Emitting SPIR-V opcodes

After we get through the parsing step, LLVM IR and SPIR-V has many similarities, and translating opcodes isn’t particularly difficult. For each LLVM basic block, we emit a basic block in SPIR-V and translate the opcodes one by one. We preserve the SSA nature as-is. There are of course a lot of details here, but they’re not very interesting, and too many details to enumerate. The two biggest problems we need to focus on are:

Resource access

Accessing textures, constant buffers, structured buffers and the weird and wonderful zoo of resource types is insanely intricate, and I’ll need to dedicate an entire blog post to this.

Control flow structurization

Another massive issue is the control flow. In LLVM, there is no structurization information whatsoever, and we’ll have to reconstruct this somehow. This is at least one more blog post. After emitting SPIR-V code into basic blocks, we need to rewrite the control flow and annotate the basic blocks with merge information, which then allows us to emit a final SPIR-V module, ready for driver consumption.

Conclusion

This was a rough overview of LLVM IR and how it is parsed from scratch. I think it’s safe to say it’s far more difficult to parse than SPIR-V, which is literally just a stream of [N x uint32_t] opcodes. Parsing IR is not the most exciting part of DXIL to SPIR-V conversion, but it had to be done. On the other hand, it might be useful starting knowledge for other projects.

For next post, we’ll look at how to translate the D3D12 binding model into Vulkan.

My personal hell of translating DXIL to SPIR-V – part 1

In vkd3d-proton, we’re translating D3D12 to Vulkan, and the single biggest piece I’ve contributed so far is translating shader model 6 to Vulkan. We already had DXBC support, but what is a graphics API without two completely incompatible IR formats, right? Getting working DXIL support in vkd3d-proton was the biggest target when I began working on vkd3d at the time. The result of this work is a standalone library and tool, dxil-spirv.

Introduction

In this blog series I’d like to go through the DXIL format, explain the problems I’ve had to solve and hopefully serve as an introduction to basic compiler theory, explained by yours truly who has no idea what they’re talking about. What could go wrong!?

I never learned any of this formally, but through working on SPIRV-Cross and dxil-spirv, I just had to learn this by trying and failing. Most of this theory is locked behind academic and computer science jargon, because it is a domain where rigor is actually necessary, and the algorithms for most things have been well known since the 70s (60s?). This isn’t graphics programming where we can be “close enough” and hand wave problems away. The smallest inaccuracy will break anything you come up with. Mix this with recursive tree traversal algorithms and all hell breaks loose when you least expect it. I wonder where compiler engineers pick this stuff up. Is there a secret club I’m not invited in? :p

DXIL has been, and continues to be extremely painful to translate correctly, with new edge cases popping up every week it seems. The main reasons for these are related to one particular problem, which this blog series intends to address over the course of the blog series.

Rewrite goto soup to strict structured control flow

Of the three core problems, unstructured control flow, aka goto soup, is the real difficult part, and solving this problem is one of the single most painful problems I’ve ever worked on. When I think I’ve solved it, the next AAA game we see finds a new way to break it. At the very least, it seems like it gets a little easier with every iteration, so we’re getting asymptotically correct at the very least!

It’s very likely that the existing solution in dxil-spirv is just garbage, and it needs to be rewritten from scratch at some point, but at least the current implementation can play some vidya, so, eh.

LLVM 3.7 IR

LLVM is another problem, since we need to consume a barely documented IR format which was never designed to be consumed as a standard format to begin with. The official DXC compiler is literally a fork of LLVM, forever frozen at the 3.7 version.

It is of course possible (and intended) to use the LLVM library directly, parse the IR that way, and then iterate over the llvm::Module to create SPIR-V, but this requires shipping yet another vendored LLVM copy, and we all love those don’t we … It is just not acceptable to ship that from a practical point of view. It clocks in at a nice 40 MB+ blob, compared to the 2 MB d3d12.dll binary vkd3d-proton compiles to.

The first iteration of dxil-spirv did indeed target LLVM APIs directly for practical reasons. Not having to spend months figuring the arcane byte code format certainly helped! Later, I had to make a LLVM API replacement – with some help from RenderDoc’s low-level parser to get started – which does exactly what is required to parse DXIL. Overall, it has been very helpful to keep a native LLVM port alive through API replication, since I can always cross check the parsing results against the real deal with -DDXIL_SPIRV_NATIVE_LLVM=ON. It worked out very well in the end!

Virtualized resource hell

HLSL, and by extension DXIL does not concern itself one bit how resources are actually accessed by the implementation. Given similar code, there’s about 8 ways to codegen it depending on context obtained by the global or local root signatures. Fun! On top of this, we need vendor specific hackery for perf and feature workarounds.

In this post

As this post is just an introduction to the format, no difficult problems are presented yet. 🙂

Anatomy of a shader blob

In our journey we begin at the output from DXC, a raw shader binary.

// $ dxc -Tps_6_0 -Fo test.dxil test.frag
float4 main(float4 a : A) : SV_Target
{
    return a;
}

When creating a PSO in D3D12, you hand the API this blob of data. This blob is basically the same as DXBC. The only real difference is the DXIL tag. Based on this, we know how to dispatch compilation in vkd3d-shader.

00000000 44 58 42 43 00 00 00 00 00 00 00 00 00 00 00 00 |DXBC............|
.......
00000140 b0 04 00 00 60 00 00 00 2c 01 00 00 44 58 49 4c |....`...,...DXIL|
....

For development, it is then useful to extract the raw LLVM IR part. We can do this with dxil-extract from dxil-spirv:

$ dxil-extract --output test.bc test.dxil # Extract raw LLVM bytecode
$ llvm-dis test.bc # Disassemble to LLVM assembly
$ cat test.ll
; ModuleID = 'test.bc'
source_filename = "test.bc"
target datalayout = "e-m:e-p:32:32-i1:32-i8:32-i16:32-i32:32-i64:64-f16:32-f32:32-f64:64-n8:16:32:64"
target triple = "dxil-ms-dx"

define void @main() {
%1 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
%2 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 1, i32 undef)
%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 2, i32 undef)
%4 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 3, i32 undef)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %1)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 1, float %2)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 2, float %3)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 3, float %4)
ret void
}

; Function Attrs: nounwind readnone
declare float @dx.op.loadInput.f32(i32, i32, i32, i8, i32) #0

; Function Attrs: nounwind
declare void @dx.op.storeOutput.f32(i32, i32, i32, i8, float) #1

attributes #0 = { nounwind readnone }
attributes #1 = { nounwind }

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.viewIdState = !{!4}
!dx.entryPoints = !{!5}

!0 = !{!"clang version 3.7 (tags/RELEASE_370/final)"}
!1 = !{i32 1, i32 0}
!2 = !{i32 1, i32 6}
!3 = !{!"ps", i32 6, i32 0}
!4 = !{[6 x i32] [i32 4, i32 4, i32 1, i32 2, i32 4, i32 8]}
!5 = !{void ()* @main, !"main", !6, null, null}
!6 = !{!7, !11, null}
!7 = !{!8}
!8 = !{i32 0, !"A", i8 9, i8 0, !9, i8 2, i32 1, i8 4, i32 0, i8 0, !10}
!9 = !{i32 0}
!10 = !{i32 3, i32 15}
!11 = !{!12}
!12 = !{i32 0, !"SV_Target", i8 9, i8 16, !9, i8 0, i32 1, i8 4, i32 0, i8 0, !10}

Convenient! This is the same stuff we would see in dxc if we don’t write to a file. The next question is what all of this means. The first entry point into understanding what is going on is DXIL.rst. Unfortunately, this documentation is only good for the bring-up phase of DXIL compilation. The documentation just … stops eventually. At that point we’re on our own, but it’s not like it’s that hard to figure out what is going on. We have the source after all and we can correlate test HLSL with the output.

What is DXIL, really?

At a fundamental level, DXIL is just LLVM 3.7 with some sugar on top. The sugar adds whatever magic that normal LLVM code cannot express, but LLVM has ways to express things like these in a generic way. The solutions feel very clunky at times, but it’s clearly a compromise to attempt to shoehorn a shading language into something very generic like LLVM.

Intrinsics

In this simple case, we’re starting to see code like:

declare float @dx.op.loadInput.f32(i32, i32, i32, i8, i32) #0
declare void @dx.op.storeOutput.f32(i32, i32, i32, i8, float) #1
%1 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %1)

If we cannot express features in raw LLVM, we use dx.op intrinsics, declared as external linkage. The first argument is a constant int, which encodes the actual opcode to use. The loadInput or storeOutput names are not significant here. There are over 200 opcodes like this, which is not surprising given how much junk we’ve accumulated in shading languages over the years. Some of the opcodes are documented, some are not. *shrug* 🙂

Scalar code

In SPIR-V, there is explicit support for vectors and matrices, but DXIL is flat and scalar (except when it isn’t, of course). With modern GPU architectures, 32-bit arithmetic is scalar anyways, but 16-bit and 8-bit arithmetic ops are packed operations on all modern GPUs I know of, so it feels a bit weird to go full scalar. I guess backend compilers just need to suck it up and learn the dark art of re-vectorization. We also end up with quite bloated code, since any vector operation has to be unrolled. On the upside, there are fewer cases to implement and test in dxil-spirv. As an example, we have a shader:

Texture2D<float4> T;
SamplerState S;

float4 main(float2 uv : TEXCOORD) : SV_Target
{
    return T.Sample(S, uv);
}

In places where we cannot ignore vectors, DXIL sometimes reaches for structs instead of the far more natural vector type.

// The resource API is quite peculiar, for later ...
%1 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false)
%2 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 3, i32 0, i32 0, i1 false)

// Load UV, one component at a time
%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
%4 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 1, i32 undef)

// Sample
%5 = call %dx.types.ResRet.f32 @dx.op.sample.f32(i32 60, %dx.types.Handle %1, %dx.types.Handle %2, float %3, float %4, float undef, float undef, i32 0, i32 0, i32 undef, float undef)

// Extract components one by one ...
%6 = extractvalue %dx.types.ResRet.f32 %5, 0
%7 = extractvalue %dx.types.ResRet.f32 %5, 1
%8 = extractvalue %dx.types.ResRet.f32 %5, 2
%9 = extractvalue %dx.types.ResRet.f32 %5, 3
// There is actually a 5th member! If that member is statically used
// the sample opcode is actually a sparse residency query.
// Isn't this fun? Welcome to my personal hell :3

// Store result one by one ...
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %6)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 1, float %7)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 2, float %8)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 3, float %9)

With a simple spot check, we get binary sizes:

  • DXC (-Tps_6_0 test.frag -Qstrip_reflect -Qstrip_debug): 2063 bytes
  • FXC (-Tps_5_0 test.frag -Qstrip_reflect -Qstrip_debug): 268 bytes
  • DXC SPIR-V (-Tps_6_0 test.frag -Qstrip_reflect -Qstrip_debug -spirv): 744 bytes

Fun … DXBC is very compact, at least it has that going for it.

Metadata

In SPIR-V, we have execution modes, decoration and various instructions to encode metadata. These instructions are instructions like any other opcode, but LLVM has a separate system that lives on the side, the metadata nodes. In the IR assembly, we see a spider web of weird data structures.

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.resources = !{!4}
!dx.viewIdState = !{!10}
!dx.entryPoints = !{!11}

!0 = !{!"clang version 3.7 (tags/RELEASE_370/final)"}
!1 = !{i32 1, i32 0}
!2 = !{i32 1, i32 6}
!3 = !{!"ps", i32 6, i32 0}
!4 = !{!5, null, null, !8}
!5 = !{!6}
!6 = !{i32 0, %"class.Texture2D<vector<float, 4> >"* undef, !"", i32 2, i32 1, i32 1, i32 2, i32 0, !7}
!7 = !{i32 0, i32 9}
!8 = !{!9}
!9 = !{i32 0, %struct.SamplerState* undef, !"", i32 4, i32 3, i32 1, i32 0, null}
!10 = !{[4 x i32] [i32 2, i32 4, i32 15, i32 15]}
!11 = !{void ()* @main, !"main", !12, !4, null}
!12 = !{!13, !17, null}
!13 = !{!14}
!14 = !{i32 0, !"TEXCOORD", i8 9, i8 0, !15, i8 2, i32 1, i8 2, i32 0, i8 0, !16}
!15 = !{i32 0}
!16 = !{i32 3, i32 3}
!17 = !{!18}
!18 = !{i32 0, !"SV_Target", i8 9, i8 16, !15, i8 0, i32 1, i8 4, i32 0, i8 0, !19}
!19 = !{i32 3, i32 15}

What does this even mean? These nodes encode the entry point(s), which resources exist, the stage IO variables, and everything like that. It is at least documented, just awkward to read.

!dx.entryPoints = !{!11}
!11 = !{
   void ()* @main, /* Pointer to function */
   !"main", /* Name of entry point. Not relevant before DXR. */
   !12 /* List of stage IO types */,
   !4 /* Resource lists */,
   null}

!12 = !{
   !13, /* Stage input */
   !17, /* Stage output */
   null /* Patch IO for tessellation */}

/* Encode things like semantic, register offsets, component offset, etc ... */
!14 = !{i32 0, !"TEXCOORD", i8 9, i8 0, !15, i8 2, i32 1, i8 2, i32 0, i8 0, !16}

!18 = !{i32 0, !"SV_Target", i8 9, i8 16, !15, i8 0, i32 1, i8 4, i32 0, i8 0, !19}

!4 = !{
   !5, /* SRVs */
   null, null, /* UAV, CBV */
   !8 /* Samplers */}

/* Encode register space, register #t, component types, etc, etc ... */
!6 = !{i32 0, %"class.Texture2D<vector<float, 4> >"* undef, !"", i32 2, i32 1, i32 1, i32 2, i32 0, !7}

The instructions which interact with resources work very different than SPIR-V. In SPIR-V, resources are represented as plain variables, but it is a bit more indirect in DXIL. For example:

%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)

Here we specify:

  • Refer to stage input by index (directly refer to metadata)
  • Refer to row (if the stage input is an array)
  • Refer to component (must be a compile time constant, handy!)

I kinda like this approach actually. It gives us a specific place to deal with stage IO translation, so we don’t have to analyze and track random memory instructions in LLVM, which would be kinda horrible. Especially builtin semantics can get pretty weird in translation and being able to handle all addressing logic in one go is a life saver. I think this is also done so that we avoid the problem of having to deal with vectors. (But again, as we shall see later, someone missed the memo when implementing DXR.)

Resources work similarly:

%1 = call %dx.types.Handle @dx.op.createHandle(i32 57,
   i8 0, /* SRV */
   i32 0, /* Index into metadata */
   i32 0, /* Array offset */
   i1 false) /* NonUniformResourceIndex? */

Again, we request a handle where we refer directly to the metadata. In this case, we request SRV with metadata index #0. This isn’t bad. (But again, as we shall see, there is a completely separate system for accessing resources in DXR … ._. Why …)

As a convenient feature, dxil-spirv can emit SPIR-V -> GLSL through SPIRV-Cross’s API, so we can see the resulting codegen, which in this case closely resembles a 1:1 instruction conversion.

#version 460

layout(set = 1, binding = 2) uniform texture2D _8;
layout(set = 3, binding = 4) uniform sampler _11;

layout(location = 0) in vec2 TEXCOORD;
layout(location = 0) out vec4 SV_Target;

void main()
{
  vec4 _31 = texture(sampler2D(_8, _11), vec2(TEXCOORD.x, TEXCOORD.y));
  // le sigh <_<
  SV_Target.x = _31.x;
  SV_Target.y = _31.y;
  SV_Target.z = _31.z;
  SV_Target.w = _31.w;
}

I’ve certainly seen much worse (DXBC cross compilation comes to mind … *shudder*), but this is not the true codegen that vkd3d-proton would use, since we need to transform this into fully bindless code. That’s for later.

Conclusion

This is a very basic view of how the format is put together. Next time, I’ll probably have a deeper look into the raw LLVM IR and how that is parsed. Better bring bleach for your eyes!

Compressed GPU texture formats – a review and compute shader decoders – part 3/3

This is a long overdue blog post. Last year, I began a blog series where I explored the weird and wonderful world of texture compression formats. Unfortunately, I didn’t have time and motivation to finish it at the time, but it’s time I get back to finish these posts.

Up until now, I’ve covered the major families: DXT (BC 1-5), ETC2 and BPTC (BC 6-7) in part 1 and part 2. Complexity has been steadily increasing, but now we’re finally going to tackle the end boss of texture compression, ASTC.

ASTC

ASTC, or Adaptive Scalable Texture Compression is the result of a fever dream which attempts to answer the question of “how complex can we make a texture compression format and still trick IHVs to make it into silicon”. Complexity in codec design generally translates to better quality / bit at the cost of more expensive decode/encode. As hardware designs evolve, we can afford more complex codecs. Finding a sweet spot is incredibly hard, and ASTC cranked it to 11, for better or worse.

It began life on the Mali Midgard GPU series way back in the day, but has since been widely adopted by the mobile GPU ecosystem by all relevant vendors in the Khronos sphere, at least the core LDR profile.

Profiles you say, in my texture formats? What is this, H.264? Well … we’ll get to that 🙂

Notably, it’s also supported on the Nintendo Switch (Tegra supports Android after all), causing all sorts of fun issues for emulators … There’s actually multiple independent compute shader decoders of ASTC out there in the wild!

ASTC (released in 2012) is still considered the state of the art, and I don’t think anyone else have made a serious attempt at introducing a new hardware accelerated texture compression format since. The state of the art in ASTC compression is still being developed with recent improvements from Pete Harris on astc-encoder, the official reference encoder, and that is to my knowledge the only encoder implementation of ASTC HDR/3D profiles.

Desktop IHVs other than Intel have so far refused to expose support for ASTC, which is annoying, since it seems we’re forever stuck with two different worlds of texture compression, BC variant zoo on desktop and ASTC on mobile.

The compute shader decoder

Last year, around the time I wrote blogs on BPTC, I also implemented an ASTC LDR/HDR decoder that produced bit-exact results compared to the reference implementation and real hardware in my validation suite. Compared to the BC6 and 7 shaders, the shader complexity is ridiculous, and somehow it works. It’s integrated in Granite such that I can load ASTC compressed scenes transparently, which is useful for testing in a “because I can!” sense.

The shader is certainly not optimized for speed. I focused on readability and debuggability, since it was hard enough to implement this as-is. It’s more than fast enough for my purposes at least.

Enter the horror here. Here be dragons.

One format to bind them all

ASTC has a goal of better quality / bit than every existing format at the same time. This is not easy, considering that BC went the route of many different specialized formats designed to solve specific texture compression needs and you would expect each format to excel at their specific use case. As far as I know, ASTC succeeds here, given a sufficiently sophisticated encoder.

To supplant every format, you’d need to support these features at the very least:

  • R (BC4, EAC), RG (BC5, EAC), RGB (BC1, ETC2), RGBA (BC3, ETC2, BC7)
  • 4bpp (BC1, ETC2 RGB, BC4), 8bpp (BC3, BC5, ETC2 RGBA, BC6, BC7), and even lower (2bpp PVRTC?)
  • LDR vs HDR (BC6)
  • Decent multi-partition support
  • Decorrelated channel support

ASTC goes beyond these requirements in an attempt to future-proof the format. Bit-rate is fine grained. From sub 1bpp, up to 8bpp in fine steps. 2D is boring, so ASTC also supports 3D blocks, although adoption of this is non-existing outside Arm as far as I know. For SDF rendering, 3D block compression seems useful in theory at least.

Codec concepts

The ASTC specification in the Khronos Data Format document is 35 pages (!). There’s a lot of ground to cover when implementing this format, but fortunately, the codec features are mostly orthogonal, i.e. the complexity is additive, not exponential, making it somewhat manageable.

Bit-rate scalability

The ASTC format like every other format we’ve seen up until now is block based, where each block consumes a fixed number of bits. ASTC chose 128 bits, but bit-rate scalability is achieved through different block sizes. No longer can we rely on nice 4×4 blocks :(. Instead, we have ridiculous things like 8×6 blocks or 5×5 blocks, all the way up to 12×12 blocks. Very little actually changes depending on block size, we just get more weights to decode.

In practice from what I have seen, encoders tend to focus on specific block sizes like 4×4, 6×6 and 8×8. In an ideal world, we’d be able to change the block size dynamically within a texture (think H.264 macro blocks being split into smaller blocks adaptively), but we need random access to stay sane.

Good old color endpoint + weight architecture

At a fundamental level, ASTC does not change how texture compression works. There’s still the concept of encoding color endpoints on a block level, and then per-texel weights are applied to interpolate between them. However, as we’ll see, a ton of complexity goes into efficiently encoding these color entry points and weights.

If you’re familiar with video encoding, there are some fun parallels to draw here, since H.264 and beyond have just piled on more and more complexity on top of the basic motion compensated DCT blocks since the 80s. Texture compression is very similar. Keep piling codec features on top of the same fundamental architecture forever, what could go wrong!

Bits are boring, trits and quints is where it’s at!

Up until now, color endpoints and weights have been encoded with bits, but especially for weights, it’s very hard to achieve fine grained bit rate deltas. What if we could spend a fractional number of bits instead? ASTC achieves this through trinary and quintary (is this a word?) numbers, encoded into a binary representation.

Every encoded weight or endpoint is given N bits as LSBs, and optionally one trit or quint as the MSBs.

The binary encoding works by grouping values together in blocks. For trits, we’re aiming to encode 5 values together into a single number, which is 3^5 = 243 combinations. If we encode this into 8 bits, we’re pretty close to the theoretically optimal encoding.

This is where we start running into sprinkles of incomprehensible code snippets in the spec which is a mix of C and Verilog designed to be fast in HDL, but unreadable nonsense in software. I have no idea what this code is trying to do, so to decode, I just build a LUT which converts 8 bits into 5 trits and call it a day. Quints are similar, where we encode 3 values together, which is 5^3 = 125 combinations, and fits snugly into 7 bits.

As an example, here’s what I ended up with in GLSL:

// Trit-decoding.
// quant.x = number of bits per value
int block = idiv5_floor(index); // Classic cute trick: (v * 0x3334) >> 16
int offset = index - block * 5;
start_bit += block * (5 * quant.x + 8);

int t0_t1_offset = start_bit + (quant.x * 1 + 0);
int t2_t3_offset = start_bit + (quant.x * 2 + 2);
int t4_offset = start_bit + (quant.x * 3 + 4);
int t5_t6_offset = start_bit + (quant.x * 4 + 5);
int t7_offset = start_bit + (quant.x * 5 + 7);

// ;__;
int t = (extract_bits(payload, t0_t1_offset, 2) << 0) |
    (extract_bits(payload, t2_t3_offset, 2) << 2) |
    (extract_bits(payload, t4_offset, 1) << 4) |
    (extract_bits(payload, t5_t6_offset, 2) << 5) |
    (extract_bits(payload, t7_offset, 1) << 7);

// LUT magic
t = int(texelFetch(LUTTritQuintDecode, t).x);
t = (t >> (3 * offset)) & 7;

int m_offset = offset * quant.x;
m_offset += idiv5_ceil(offset * 8); // (1) Explanation below

if (quant.x != 0)
{
    int m = extract_bits(payload, m_offset + start_bit, quant.x);
    ret = (t << quant.x) | m;
}

… and similar garbage code for quints.

Note for (1): The reason the T value is scattered around is a special feature where the bit stream can be terminated early if the block is only partially filled. Every trit value requires less than 8/5 bits to encode, so after a value is emitted, we know that we must have emitted ceil(8 * count / 5) bits to encode the trits in binary.

Complex weight un-quantization

For purposes of orthogonality, it’s generally desirable that one part of the decoding process does not affect other parts of the decoding process. Weights is one such case. The color interpolator shouldn’t have to care how many bits the weights were encoded with, and thus we decode weights to a fixed range. In our case, weights have a range of [0, 64].

This is very similar to BPTC. In BPTC, the codec defines un-quantization LUTs for 2bpp, 3bpp and 4bpp like this:

const int weight_table2[4] = int[](0, 21, 43, 64);
const int weight_table3[8] = int[](0, 9, 18, 27, 37, 46, 55, 64);
const int weight_table4[16] = int[](0, 4, 9, 13, 17, 21, 26, 30, 34, 38, 43, 47, 51, 55, 60, 64);

Why not just bit-replication you ask? Well, dividing by non-POT values after weight scale is a PITA, that’s why, and we cannot bit-replicate trits and quints either way. ASTC un-quantizes in a more roundabout way, but uses the same idea from BPTC where weight range is normalized to [0, 64].

Again, the un-quantization step is gnarly enough that I just made a LUT for that as well, because why not:

int decode_weight(uvec4 payload, int weight_index, ivec4 quant)
{
    int primary_weight = decode_integer_sequence(payload, 0,
        weight_index, quant.xyz);
    // quant.w is offset into unquant LUT.
    primary_weight = int(texelFetch(LUTWeightUnquantize,
        primary_weight + quant.w).x);
    return primary_weight;
}

Color endpoints

Color endpoints are weird in that there are multiple phases to it:

  • Decode N values of integer sequence; which N values to use depends on which partition the texel belongs to
  • Un-quantize N values to 8 bits in range [0, 0xff]
  • Interpret N 8-bit values in some magical way depending on the endpoint format (of which there are 16!)
  • Emit multiple RGBA values in UNORM8 (LDR) or UNORM12 (HDR)
void decode_endpoint(out ivec4 ep0, out ivec4 ep1, out int decode_mode,
    uvec4 payload, int bit_offset, ivec4 quant, int ep_mode,
    int base_endpoint_index, int num_endpoint_bits)
{
    num_endpoint_bits += bit_offset;
    // Need to explicitly terminate bitstream for color end points.
    payload &= build_bitmask(num_endpoint_bits);

    // Could of course use an array,
    // but that doesn't lower nicely to indexed registers on all GPUs.
    int v0, v1, v2, v3, v4, v5, v6, v7;

    // End point mode is designed so that the top 2 bits encode
    // how many value pairs are required.
#define DECODE_EP(i) \
    int(texelFetch(LUTEndpointUnquantize, quant.w + \
        decode_integer_sequence(payload, bit_offset, i + \
        base_endpoint_index, quant.xyz)).x)

    int hi_bits = ep_mode >> 2;
    v0 = DECODE_EP(0); v1 = DECODE_EP(1);
    if (hi_bits >= 1) { v2 = DECODE_EP(2); v3 = DECODE_EP(3); }
    if (hi_bits >= 2) { v4 = DECODE_EP(4); v4 = DECODE_EP(5); }
    if (hi_bits >= 3) { v6 = DECODE_EP(6); v7 = DECODE_EP(7); }

    switch (ep_mode) { ... }
}

The end point modes are defined as:

Every endpoint has code snippets explaining how to take the n values and turn them into RGBA values, which are to be fed to the weight interpolator stage. As expected, we just need a horrible switch statement which handles every case. :<

Each partition can even select its own encoding format, which is pretty wild. Poor encoder that has to consider all these scenarios …

LDR endpoints

The endpoint decoding process starts off with trivial cases like luminance endpoints.

Mode 0 (direct)

e0 = (v0, v0, v0, 0xFF);
e1 = (v1, v1, v1, 0xFF);

Since we have an explicit decode step here, ASTC also allows us to take advantage of redundancy between the endpoint values, e.g.:

Mode 1 (base + offset)

L0 = (v0 >> 2) | (v1 & 0xC0);
L1 = L0 + (v1 & 0x3F);
if (L1 > 0xFF) { L1 = 0xFF; }
e0 = (L0, L0, L0, 0xFF);
e1 = (L1, L1, L1, 0xFF);

This encoding scheme improves precision when L0 and L1 are sufficiently close, i.e. correlated. BC7 don’t really attempt to take advantage of correlation between end points, outside the minimal shared endpoint / subset bits. BC6 does however, through use of delta bits.

One question that immediately pops into my head is how this is supposed to work in practice for trits and quint encoding. Since bits are reinterpreted and shuffled around like this, any un-quantization that’s not just bit replication would probably create unexpected results in the MSBs.

Luminance + Alpha (Mode 4/5)

ASTC is a little awkward in how 2-component textures are encoded. The common case for 2-component textures is normal maps, where we normally encode it as RG, ala BC5 or EAC. ASTC only supports luminance alpha, so in order to efficiently encode these formats, we have to pre-swizzle the texture into (R, R, R, G), and apply a (R, W, x, x) swizzle in the VkImageView.

Similar to luminance-only, there is a direct mode, and a correlated mode.

RGB

Other highlights from the LDR formats are base + scale, where two endpoints are encoded RGB and RGB * A. Seems very nice for luminance gradients, and quite compact! We also have direct modes, with a special feature which takes some ideas from YUV. Blue contraction improves color accuracy close to gray. Of course, there’s a base + offset mode that we already saw for luma and luma + alpha formats.

RGBA

RGBA encoding is very similar to RGB, with two tacked on alpha values, nothing particularly interesting here.

HDR endpoint insanity

While LDR encoding modes are pretty straight forward once you stare at it long enough, HDR is just incomprehensible. Where BC6 was fairly naive w.r.t. endpoints, ASTC is the complete opposite.

Similar to BC6, HDR is implemented in a way where we interpret a floating point format as UNORM. This means we’re interpolating in a pseudo-logarithmic domain, i.e. the perceptual domain.

While BC6 basically directly interpolates in U/SNORM16 with a direct conversion to FP16, ASTC is a little more … particular about it.

When decoding HDR end points, UNORM12 values are generated. It starts innocent enough in mode 2 where 8-bit inputs are just shifted in place:

int y0, y1;
if (v1 >= v0)
{
    y0 = v0 << 4;
    y1 = v1 << 4;
}
else
{
   // Oh hai thar BPTC shared bits +
   // BC1 symmetry exploitation
   y0 = (v1 << 4) + 8;
   y1 = (v0 << 4) - 8;
}

ep0 = ivec4(ivec3(y0), 0x780 /* 1.0 */);
ep1 = ivec4(ivec3(y1), 0x780 /* 1.0 */);

but eventually you just have to give up trying to reason about the spec when this is the wall of code that greets you.

This is the point where you yell WTF out loud, cry a little inside, contemplate your life decisions and copy paste the spec.

Partition hell

Like BPTC, ASTC supports partitions. As we’ve seen before, this feature is designed to deal with sharp transitions in the block. BPTC made it pretty simple where it was possible to spend 6 bits to select one of the 64 partition layouts. This works fine for BPTC since it’s locked to 4×4 blocks and using LUTs makes sense.

ASTC does not have fixed block sizes, so what to do? One LUT for every possible combination? No. ASTC went the route of procedurally generated partition assignments using a 10 bit seed. This works for any block size and partition count, which “solves” that problem. Again, the spec has a long, incomprehensible code snippet defining the process.

Screw this. As usual, look-up texture it is. ASTC can support up to 4 partitions which is pretty wild. No idea how useful this is in practice, as we’ll probably end up spending all bits just encoding color endpoints at that rate …

This seems like a nightmare for encoders. Most of the resulting partitions are worthless garbage, and exhaustively testing over 1000 partition combinations per block does not seem very practical …

Mode hell

Speaking of many combinations, there’s a lot of different modes. BC7 has only 8 modes, and BC6H has 14. From what I’ve seen, even BPTC encoders just focus on 2 or 3 modes at most. ASTC has several thousand modes if we follow the same logic! 😀

Most of the mode bits are spent on encoding things like:

  • Quantization mode of weight grid
  • Resolution of weight grid [see below]
  • Decorrelated colors
  • Void-extent [see below]
  • Reserved [see below]

Once we have looked at mode bits, we can compute how many bits are required to encode weights, and based on that result, we also know how many bits are required to encode color endpoints. The color endpoint quantization level is thus implicit (so much fun …). There are also tons of edge cases we have to account for and handle error cases …

Weight grid interpolation

A special ASTC feature is that the weight grid isn’t actually 1:1 with texels like any other format we’ve seen so far. The weight grid can be much lower resolution than the block itself, and the real weights are reconstructed with bi-linear interpolation. This seems quite useful for scenarios where the block encodes a smooth gradient and the weights are highly predictable. Might as well spend more bits on encoding color endpoints instead. The spec outlines a specific fixed-point scheme that must be followed exactly, and finally the code snippets in the spec actually make some sense!

This feature is a must-have for larger block sizes. There just isn’t room for a full weight grid when we start to hit big boy block sizes like 8×8. At this low bit-rate, the only tool we have to encode high frequency features is partitions …

Void extent

Encoding constant high-precision colors can actually be kind of awkward in most texture compression formats with limited endpoint precision, but ASTC has a special mode for it, where exact FP16 values can be encoded if desired. The feature even goes so far as to specify that a region outside the block also has the same color, which theoretically allows a texture filtering unit to short circuit. This doesn’t sound very practical, and I’m not sure if this feature is actually used.

Painful error handling

Another “feature” of ASTC is how errors are handled. Unlike other texture compression formats, there are many encodings which are illegal, and a correct decoder must detect all error scenarios and return the proper error color. This error color is either saturated magenta or NaN (ææææææææ!) depending on the implementation and whether or not HDR is supported.

Inconsistent decode formats

ASTC decoders will either decode to UNORM8 or FP16, depending on the profile supported by the GPU. This is kinda annoying, especially when we consider formats like SRGB8 where apparently the RGB portion is supposed to decode to UNORM8 viewed as SRGB, but alpha is still FP16? (._.) Not even the reference decoder seems to get that right, so I just decode alpha as UNORM8 in that case.

Interpolation and final encode

Finally, a breather. Once color endpoints and weights are fully decoded, we enter interpolation. It’s not all trivial though. That would be boring.

The first step is always to expand the UNORM8 or UNORM12 results into UNORM16. This is context sensitive on things like the endpoint format as well as how we’re encoding the value after interpolation. This final encode is the raw data that is fed into the texture filtering unit.

if (decode_mode == MODE_HDR)
{
    ep0 <<= 4; // Simple expansion 12 -> 16-bit
    ep1 <<= 4;
}
else if (decode_mode == MODE_HDR_LDR_ALPHA)
{
    ep0.rgb <<= 4;
    ep1.rgb <<= 4;
    ep0.a *= 0x101; // Bit replicate UNORM8 to UNORM16,
    ep1.a *= 0x101; // for FP16 conv later
}
else if (DECODE_8BIT)
{
    ep0 = (ep0 << 8) | ivec4(0x80); // Treat the data as 8.8 fixed point.
    ep1 = (ep1 << 8) | ivec4(0x80); // Also bake in 0.5 rounding factor.
}
else
{
    ep0 *= 0x101;
    ep1 *= 0x101;
}

// This is why weights have [0, 64] range and not [0, 63].
ivec4 color = (ep0 * (64 - weight) + ep1 * weight + 32) >> 6;

Now we have a complete UNORM16 color value that needs to be encoded.

Encode to UNORM8 / SRGB8

For 8-bit decode, we make use of the 8.8 fixed point expansion, and just shift the result down. Easy!

 imageStore(OutputImage, coord.xy, uvec4(final_color >> 8));

Encode LDR to FP16

This one is annoying. When interpolating, we did bit replication, and we need to convert the UNORM16 value to FP16. It’s not as simple as just dividing by 0xffff, of course. We explicitly need to treat 0xffff as 1.0, and for other values, we divide by 2^16 with round-to-zero semantics. We have to do this by hand of course. Soft-float is so much fun! <_<

// ASTC has a very peculiar way of converting the decoded result to FP16.
// 0xffff -> 1.0, and for everything else we get
// roundDownQuantizeFP16(vec4(c) / vec4(0x10000)).
ivec4 msb = findMSB(color);
ivec4 shamt = msb;
ivec4 m = ((color << 10) >> shamt) & 0x3ff;
ivec4 e = msb - 1;
uvec4 decoded = m | (e << 10);
uvec4 denorm_decode = color << 8;
decoded = mix(decoded, uvec4(denorm_decode), lessThan(e, ivec4(1)));
decoded = mix(decoded, uvec4(0x3c00), equal(color, ivec4(0xffff)));
return decoded;

Encode HDR to FP16

The intention is that the interpolation is logarithmic, but floats are just piecewise logarithmic. BC6 doesn’t care at all and just reinterprets the resulting UNORM16 interpolation as FP16, but ASTC tries to be a little smarter. The mantissa is tweaked a bit.

// Interpret the value as FP16, but with some extra fixups along the way to make the interpolation more
// logarithmic (apparently). From spec:
ivec4 e = color >> 11;
ivec4 m = color & 0x7ff;
ivec4 mt = 4 * m - 512;
mt = mix(mt, ivec4(3 * m), lessThan(m, ivec4(512)));
mt = mix(mt, ivec4(5 * m - 2048), greaterThanEqual(m, ivec4(1536)));

From what I can tell, there is no signed float support in ASTC, except for the void extent.

Decode format extensions

In Vulkan, there is a weird extension called VK_EXT_astc_decode_mode which allows a VkImageView to select which format an ASTC block should decode to. Why is this even a thing? One quirk of ASTC is that even UNORM blocks full of LDR data must be decoded to bit-exact FP16. This means that if a GPU architecture decodes compressed textures into L1 cache, we would end up consuming a lot more cache than strictly necessary. The whole point of the decode mode extension is to allow us to explicitly decode to a lower precision format to improve cache hit rates, presumably. (That makes me wonder what GPUs actually do in practice …)

In Granite, I setup the decode mode as:

// ...
if (format_is_srgb(create_info.format))
   return true;

if (format_is_compressed_hdr(create_info.format))
{
   auto &features = device->get_device_features();
   if (features.astc_decode_features.decodeModeSharedExponent)
      astc_info.decodeMode = VK_FORMAT_E5B9G9R9_UFLOAT_PACK32; // Nice
   else
      astc_info.decodeMode = VK_FORMAT_R16G16B16A16_SFLOAT;
}
else
{
   astc_info.decodeMode = VK_FORMAT_R8G8B8A8_UNORM;
}
// ...

Fortunately, there are now special SFLOAT format variants for ASTC in the VkFormat enum, which are tied to the ASTC HDR extension, and can be used to notify applications if the blocks likely contain HDR blocks. The awkward part of ASTC is that even when using the UNORM format, you can still decode HDR blocks, and it just happens to work on GPUs which support ASTC HDR. Profiles are fun, right? ._. With decode mode, we can at least clamp down on these shenanigans. HDR blocks fail to decode in UNORM8 mode.

Conclusion

So far, this is the endgame of texture compression, and I really don’t want to look at this again. 🙂 It’s been a year since I looked into this last time, so I might have missed some details.

There’s tons of smart ideas in this format, but I also feel it’s too clever for its own good. Writing a decoder was difficult enough, and I don’t even want to think about how painful an encoder would be. A format is only as good as its encoding ecosystem after all.

Compressed GPU texture formats – a review and compute shader decoders – part 2

This is the second part of the blog series I started in part 1. We have covered the S3TC, RGTC and ETC family of formats. This served as a good introduction to the topic of texture compression, but from here, the complexity will explode.

BPTC

This post will be dedicated to the BPTC compressed formats. These formats represent BC6 and BC7 formats in Vulkan, and is the state-of-the-art in texture compression on desktop GPUs, completing RGB and RGBA compression. BC1 and BC3 were the only proper desktop alternatives before these formats came along, and the quality of BC1 and BC3 aren’t … great.

BPTC also adds support for HDR. This is a big deal as before BPTC it was not possible to properly compress HDR data.

I pondered a bit over which format I should present first, but I think it’s appropriate to start with BC6 since it is much simpler than BC7 in terms of overall complexity. I feel like when BC6 was designed it sacrificed complexity from BC7 in order to add HDR.

Common ideas

Partitions

As we saw in ETC2, there was a very crude and early attempt to add partitions into the block format. The T and H formats in particular let you specify two different color endpoint pairs which could be selected at will using 1 bit per texel. Rather than letting the partition be dictated by the 2×4 / 4×2 sub-block, you could select any partition scheme.

Specifying an entire bit per texel to select a partition is quite overkill however, especially for a format which just gives you 64 bits for color. This makes the T and H modes crude hacks on top of ETC1. BPTC instead adds the concept of a pre-made table of partition shapes, and support for more than 2 partitions. Rather than specifying the partition index for each texel, we specify say, 64 common shapes, and just spend up to 6 bits per block (6 / 16 bits per texel) to encode the information. Of course, this idea falls flat if the exact partition we need doesn’t exist, but BPTC’s partitions seem well thought out. This is just a pain to implement since it means copy-pasting over a lot of tables ._.

Variable bit depth for endpoints and weights

Earlier formats were extremely rigid in how the blocks were to be encoded. You get X number of bits to specify endpoints and 2 bits per texel for the weights, a neat 32/32 bit split between the control block and texel data. Overall, this is a good setup for 64 bit block formats like BC1 and ETC, but it falls a bit short for 128 bit block formats like BPTC. Now we have more freedom and headroom to either spend lots of bits on endpoint accuracy, more weight bits per texel, or perhaps more partitions — which requires a lot more bits to encode multiple endpoints. Essentially, everything becomes a trade-off. More partitions can deal better with uncorrelated texels, more endpoint precision can deal with smooth gradients and more weight bits improves general PSNR when dynamic range in the block is large.

How all of this is configured is done through …

Mode hell

As we also saw with ETC2, there were multiple “modes” a block could enter into, depending on the encoded bits. We have the “differential”, “individual”, “T”, “H” and “Planar” modes. This already adds a lot of complexity to an encoder — which needs to figure out which mode is best through heuristics and trial and error — and decoder — which has to implement everything.

BPTC makes the mode idea more explicit and a certain number of bits are reserved to express the “mode” a block will be in.

Exploiting endpoint symmetry

While earlier formats used endpoint symmetry as a way to enable different modes (e.g. BC1), BPTC has already encoded the mode explicitly. Instead, we exploit symmetry to save one weight bit. We can simply assume that the MSB weight bit of the first texel is 0. We can always flip the order of the endpoints to make this work. Actually, we can save one bit per partition, since one texel per partition can be assumed to have MSB weight bit of 0. Which texels to treat specially is all done through another look-up table (sigh …) defined by the specification. Unfortunately, this means bitfield extraction from the 128-bit payload becomes very awkward and irregular … For this purpose I had to make a bitfield helper function. I doubt it’s very efficient, but you gotta do what you gotta do …

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bitextract.h

Normalized weight un-quantization

Up until now there hasn’t been any good way of interpolating between endpoints in any format. Interpolation in S3TC and RGTC involves a non-POT divider, which is awkward to make bit-exact, and ETC technically does not interpolate between endpoints, as there is an offset table (which is basically the same thing in practice, but it doesn’t concern itself with a divider). While ETC has bit-exact decoding, S3TC and RGTC do not, as the interpolation is defined in terms of floating point.

With the number of weight bits being variable in BPTC, it makes sense to normalize the weights to a range which is easier to work with, and can easily support bit-exact decoding. Values in the range of [0, 64] were chosen. BPTC can use either 2, 3 or 4-bit weights, and the specification defines a table which normalizes the weights onto the [0, 64] range. Interpolation can then be easily done in fixed point as:

interpolated = (a * (64 - weight) + b * weight + 32) >> 6;

This avoids the non-POT divider headache we’ve seen earlier. I tried finding a neat arithmetic expression to re-normalize the weights without a LUT, but I gave up pretty quickly.

No planar mode?

One thing I found rather puzzling was the lack of a planar mode in BPTC. ETC2 has it and seems kinda useful …

BC6 – 4×4 – 128 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bc6.comp

BC6 is laser focused on compressing HDR data (FP16). There is only RGB support, no RGBA at all.

Representing floating point without floating point

In the other formats we have looked at, we have used normalized integers as a way to represent endpoints and colors. Interpolation of integer endpoints is very easy since it’s just fixed point math. When we start introducing floating point into the mix, there is a question of how we represent this efficiently. As we have seen, we cannot spend too many bits on representing endpoints, even with 8-bit color. With FP16, representing endpoints in full 16-bits per channel is extremely wasteful. As with 8-bit color, we need to have a simple and efficient solution for quantization of FP16 values with an arbitrary amount of bits.

A hypothetical solution for FP16 could be achieved by storing the exponent (5 bits) and just quantize the mantissa accordingly. BC6 isn’t far from this idea, although, it is much simpler than that. BC6 exploits the fact that floating point representations of finite numbers is monotonic for positive numbers.

For example, if we consider that the internal representation of endpoints is 16-bit unsigned, we interpolate endpoints as 16-bit integers, and perform a scaling operation and bitcast to FP16:

fp16_bits = (interpolated_value * 31) / 64;
fp16_texel = bitcast<fp16>(fp16_bits);

What this effectively does is to map 0 to 0.0 in FP16 and 0xffff to 0x7bff, which is 65504.0 and the largest finite representable value in FP16 (0x7c00 is +Inf). Everything in-between is monotonically increasing. The interpolation itself ends up being non-linear (closer to logarithmic), but for HDR data this should be fine. Linear light values are perceptually logarithmic anyways, so it might actually make more sense to interpolate in a logarithmic space rather than linear here, a very nice hack indeed! Of course, this is just a crude approximation as the FP16 representation is only piece-wise logarithmic.

For the signed format (BC6H_SFLOAT), we do a very similar fix-up step, but it is a bit more involved since we need to take care of the sign bit. The interpolated value is assumed to be a signed integer in [-0x8000, 0x7fff] range.

signed = interpolated_value < 0;
fp16_bits = (abs(interpolated_value) * 31) / 32;
fp16_bits |= signed ? 0x8000 : 0;
fp16_texel = bitcast<fp16>(fp16_bits);

A recognized bug in the specification (or feature depending on the situation) is that -0x8000 will be translated to -Inf (0x7c00 | sign_bit). It is not possible to represent +Inf.

Transformed endpoints

A thing BC6 can do is to assume correlation between two endpoints. This is similar to the “differential” mode in ETC2, but BC6’s variant of it is far more flexible. Rather than giving us X number of bits per component, we encode the first endpoint with more bits, and then the other endpoint as an offset from the first, with fewer bits. This combines very well with multiple partitions, since the other partition’s endpoints are also encoded as differentials from the base endpoint.

Mode “hell”

BC6 has a lot of modes to choose from, 14 to be specific. However, that is only at first glance. I like to separate these into 2 major types:

  • 2 partition modes
  • 1 partition modes

After selecting how many partitions you have, most of the modes just let you specify how many bits are spent on the base endpoint color, and how many bits are spent for each transformed endpoint (delta bits). Essentially, modes 0, 1, 2, 6, 10, 14, 18, 22, 26 are all the same, with different bit-allocation schemes. Mode 30 appears to be slightly different on first glance since it does not have transformed endpoints, but that’s just because delta bits == endpoint bits at this point, so it’s meaningless to transform the endpoints. The story is the same for the modes where number of partitions is 1.

Based on the mode, we either get 2 bits per texel of weights or 3 bits.

The major issue in decoding BC6 is that the bit layout for each mode is completely nonsensical and irregular. The bits are packed in seemingly random places, so unfortunately the decoder ends up with:

if ((mode & 2) == 0)
{
    if ((mode & 1) != 0)
        interp = decode_bc6_mode1(payload, linear_pixel, part, anchor_pixel);
    else
        interp = decode_bc6_mode0(payload, linear_pixel, part, anchor_pixel);
}
else
{
    switch (mode)
    {
    case 2:
        interp = decode_bc6_mode2(payload, linear_pixel, part, anchor_pixel);
        break;
    case 3:
        interp = decode_bc6_mode3(payload, linear_pixel);
        break;
    case 6:
        interp = decode_bc6_mode6(payload, linear_pixel, part, anchor_pixel);
        break;
    case 7:
        interp = decode_bc6_mode7(payload, linear_pixel);
        break;
    case 10:
        interp = decode_bc6_mode10(payload, linear_pixel, part, anchor_pixel);
        break;
    case 11:
        interp = decode_bc6_mode11(payload, linear_pixel);
        break;
    case 14:
        interp = decode_bc6_mode14(payload, linear_pixel, part, anchor_pixel);
        break;
    case 15:
        interp = decode_bc6_mode15(payload, linear_pixel);
        break;
    case 18:
        interp = decode_bc6_mode18(payload, linear_pixel, part, anchor_pixel);
        break;
    case 22:
        interp = decode_bc6_mode22(payload, linear_pixel, part, anchor_pixel);
        break;
    case 26:
        interp = decode_bc6_mode26(payload, linear_pixel, part, anchor_pixel);
        break;
    case 30:
        interp = decode_bc6_mode30(payload, linear_pixel, part, anchor_pixel);
        break;
    default:
        interp = DecodedInterpolation(ivec3(0), ivec3(0), 0);
        break;
    }
}

Not pretty 🙁 Most of the code in bc6.comp is used to decode all the weird variants.

Endpoint quantization

Once we have the endpoints, we un-quantize them to full 16-bit integer range by a simple shift. At this point, we can interpolate the endpoints using the normalized weights, and apply the BC6 fixup to turn it into a final FP16 value.

ivec3 unquantize_endpoint(ivec3 ep, int bits)
{
    ivec3 unq;
    // Specialization constant
    if (SIGNED)
    {
        // Sign-extend
        ep = bitfieldExtract(ep, 0, bits);
        if (bits < 16)
        {
            ivec3 sgn = 1 - ((ep >> 30) & 2); // 1 or -1
            ivec3 abs_ep = abs(ep);
            unq = ((abs_ep << 15) + 0x4000) >> (bits - 1);
            // Special cases. Boolean mix FTW.
            unq = mix(unq, ivec3(0), equal(ep, ivec3(0)));
            unq = mix(unq, ivec3(0x7fff), greaterThanEqual(abs_ep, ivec3((1 << (bits - 1)) - 1)));
            unq *= sgn;
        }
        else
            unq = ep;
    }
    else
    {
        // Zero-extend
        ep = ivec3(bitfieldExtract(uvec3(ep), 0, bits));
        if (bits < 15)
        {
            unq = ((ep << 15) + 0x4000) >> (bits - 1);
            // Special-cases.
            unq = mix(unq, ivec3(0), equal(ep, ivec3(0)));
            unq = mix(unq, ivec3(0xffff), equal(ep, ivec3((1 << bits) - 1)));
        }
        else
            unq = ep;
    }
    return unq;
}

Summary

BC6 introduces a lot of new concepts that BC7 will make use of as well. The main complication is all the different modes and the awkward bit packing layouts. After extracting the quantized endpoints, the process from there is quite straight forward.

BC7 – 4×4 – 128 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bc7.comp

BC7 is similar to BC6 in many ways, and uses many of the same ideas.

8-bit endpoints

Just like BC6 endpoints are un-quantized to 16 bits before interpolation, BC7 un-quantizes to 8-bit integers. This is done through the typical “bit-replication” algorithm that we often see when converting RGB444 and RGB565. The weight interpolation is done the exact same way as BC6 with a [0, 64] range of weights.

Support for 3 partitions

BC6 only supports 2 partitions, but BC7 can support 3. There are separate tables for 3 partition modes compared to 2 partition modes. (Fun …)

Flexible alpha channel

Unlike BC6, BC7 fully supports alpha, and finally we have a way to encode RGB and A together. This is a big improvement over the older formats like S3TC and ETC2 which encode alpha completely separately in different block formats. When we encode RGB and A together, we can allocate bits between them as we please, e.g.:

  • No alpha. If alpha is constant 1.0 inside a block, we can use all 128 bits for RGB.
  • Correlated RGB and A. If these components are correlated, we can encode endpoints as RGBA with one weight per texel. This saves a lot of weight bits.
  • Uncorrelated RGB and A. This is the case essentially assumed by BC2, BC3 and ETC2. However, BC7 will let us tune a little bit how many bits we spend on color and how many bits we spend on alpha.

Uncorrelated color channel support

Since we can encode RGB and A as uncorrelated endpoints, what if we could freely select which component is uncorrelated? BC7 can do this, and this is expressed through the rotation bits. After decoding, we can swap A out with any other component. That way we can technically encode (R, GB), (RB, G) or (GB, R) for example. This seems rather niche, but it’s there … I suspect the designers happened to have 2 extra bits lying around they could use for this purpose.

No transformed endpoints

In a somewhat curious design choice, there is no support for transformed (correlated) endpoints in BC7. I suspect that for this reason, the number of modes could be kept somewhat sensible. I also suspect the need for transformed endpoints isn’t as great since we’re not working with HDR values anymore.

Modes

There are 8 modes in BC7, which all seem to be carefully chosen based on their use cases. Each mode makes some choices like:

  • How many partitions? (1 – 3)
  • Is there alpha? (Mode [0, 3] don’t have, [4, 7] have).
  • How many bits per endpoint component? (Color and alpha have separate bit depth)
  • How many bits per index?
  • Is alpha uncorrelated or correlated? (Mode 4 and 5 have two weight indices)
  • Which component is uncorrelated? (For mode 4 and 5)

This is all tabulated by the specification, and from there the algorithms are similar for all the modes. Other than this, there might be some leftover bits, and I believe the BC7 designers just invented some use for them. In some modes, the bits can be used as a shared LSB of the endpoints for example, or it can be used to select how many weight bits color and alpha channels receive respectively in mode 4.

Summary

BPTC is a fairly advanced texture compression format. BC6 and BC7 share many ideas as expected and introduces a large, but not unmanageable configuration space. From an implementation point-of-view though, the large reliance on look-up tables is annoying, but understandable.

BPTC is focused on perfecting color texture encoding and leaves the other kind of encoding to RGTC, which I think does a great job with 1 and 2 channel textures already.

Up next?

In the third (probably not final) part we tackle the monster that is ASTC. Complexity will jump another 10x.

Compressed GPU texture formats – a review and compute shader decoders – part 1

Compressed texture formats is one of the esoteric aspects of graphics programming almost no one cares all that much about. Neither did I, however, I’ve recently taken an academic interest in the zoo of compressed texture formats.

During development in Granite, I occasionally find it useful to test scenes which target mobile on desktop and vice versa, and in Vulkan, where there are no fallback paths for unsupported compression formats, we gotta roll our own decompression.

While it really isn’t all that useful to write a decoder for these formats, my goal is to create a suite of reasonably understandable compute shader kernels which can decode all of the standard formats I care about. Of course, I could just use a Frankenstein decoder which merges together a lot of C reference decoders and call it a day, but that’s not aesthetically pleasing or interesting to me. By implementing these formats straight from the Khronos Data Format specification, I learned a lot of things I would not otherwise know about these formats.

There are several major families of formats we can consider multi-vendor and standardized. Each of them fill their own niche. Unfortunately, desktop and mobile each have their own timelines with different texture compression standards, which is not fully resolved to this day in GPU hardware. (Basis Universal is something I will need to study eventually as well as it aims to solve this problem in software.)

By implementing all these formats, I got to see the evolution of block compression formats, see the major differences and design decisions that went into each format.

The major format families

First, it is useful to summarize all the families of texture compression I’ve looked at.

S3TC / DXT

The simplest family of formats. These formats are also known as the “BC” formats in Vulkan, or rather, BC 1, 2 and 3. This is the granddad of texture compression, similar to how I view MPEG1 in the video compression world.

These formats are firmly rooted in desktop GPUs. They are basically non-existent on mobile GPUs, probably for historical patent reasons.

RGTC

A very close relative of S3TC. These formats are very simple formats which specialize in encoding 1 and 2 uncorrelated channels, perfect for normal maps, metallic-roughness maps, etc. It is somewhat questionable to call these a separate family of formats (the Data Format specification separates them), since the basic format is basically exactly equal to the alpha format of S3TC, except that it extends the format to also support SNORM (-1, 1 range) alongside UNORM. These formats represent BC4 and BC5 in Vulkan.

These formats are firmly rooted in desktop GPUs. They are basically non-existent on mobile GPUs.

ETC

The ETC family of formats is very similarly laid out to S3TC in how different texture types are supported, but the implementation detail is quite different (and ETC2 is quite the interesting format). To support encoding full depth alpha and 1/2-component textures, there is the EAC format, which mirrors the RGTC formats.

These formats are firmly rooted in mobile GPUs. ETC1 was originally the only mandated format for OpenGLES 2.0 implementations, and ETC2 was mandated for OpenGLES 3.0 GPUs. It has almost no support on desktop GPUs. Intel iGPU is an exception here.

BPTC

This is where complexity starts to explode and where things get interesting. BC6 and BC7 are designed to compress high quality color images at 8bpp. BC6 adds support for HDR, which is to this day, one of only two ways to compress HDR images.

On desktop, BPTC is the state of the art in texture compression and was introduced around 2010.

ASTC

ASTC is the final boss of texture compression, and is the current state of the art in texture compression. Its complexity is staggering and aims to dominate the world with 128 bits. Mere mortals are not supposed to understand this format.

ASTC’s roots are on Mali GPUs, but it was always a Khronos standard, and is widely supported now on mobile Vulkan implementation (and Intel iGPU :3), at least the LDR profile. What you say, profiles in a texture compression format? Yes … yes, this is just the beginning of the madness that is ASTC.

PVRTC?

PVRTC is a PowerVR-exclusive format that has had some staying power due to iOS and I will likely ignore it in this series. However, it seems like a very different kind of format to all the others and studying it might be interesting. However, there is zero reason to use this format in Granite, and I don’t want to chew over too much.

What is a texture compression format anyway?

In a texture compression format, the specification describes a process for taking random bits given to it, and how to decode the bit-soup into texels. There are fundamental constraints in texture compression which is unique to this problem domain, and these restrictions heavily influence the design of the formats in question.

Fixed block size

To be able to randomly access any texel in a texture, there must be an O(1) mapping from texture coordinate to memory address. The only reasonable way to do this is to have a fixed block size. In all formats, 4×4 is the most common one. (As you can guess, ASTC can do odd-ball block sizes like 6×5).

Similarly, for reasons of random access, the number of bits spent per block must be constant. The typical block sizes are 64-bits and 128-bits, which is 4bpp and 8bpp respectively at 4×4 block size.

Image and video compression has none of these restrictions. That is a major reason why image and video compression is so much more efficient.

A set of coding tools

Each format has certain things it can do. The more complex the operations the format can do, the more expensive the decoding hardware becomes (and complex a software decoder becomes), so there’s always a challenge to balance complexity with quality per bit when standardizing a format. The most typical way to add coding tools is to be able to select between different modes of operation based on the content of the block, where each mode is suited to certain patterns of input. Use the right tool for the job! As we will see in this study, the number of coding tools will increase exponentially, and it starts to become impossible to make good use of all the tools given to you by the format.

Encoding becomes an optimization task where we aim to figure out the best coding tools to use among the ones given to us. In simpler formats, there are very few things to try, and approaching the optimal solution becomes straight forward, but as we get into the more esoteric formats, the real challenge is to prune dead ends early, since brute forcing our way through a near-infinite configuration space is not practical (but maybe it is with GPU encode? :3)

Commonalities across formats

Image compression and video compression uses the Discrete Cosine Transform (DCT) even to this day. This fundamental compression technique has been with us since the 80s and refuses to die. All the new compression formats just keep piling on complexity on top of more complexity, but in the center of it all, we find the DCT, or some equivalent of it.

Very similarly, texture compression achieves its compression through interpolation between two color values. Somehow, the formats all let us specify two endpoints which are constant over the 4×4 block and interpolation weights, which vary per pixel. Most of the innovation in the formats all comes down to how complicated and esoteric we can make the process of generating the endpoints and weights.

The weight values are typically expressed with very few bits of precision per texel (usually 2 or 3), and this is the main way we will keep bits spent per pixel down. This snippet is the core coding tool in all the formats I have studied:

decoded_texel = mix(endpoint0, endpoint1, weight_between_0_and_1);

To correlate, or not to correlate?

The endpoint model blends all components in lock-step. Typically the endpoint will be an RGB value. We call this correlated, because this interpolation will only work well if chrominance remains fairly constant with luminance being the only component which varies significantly. In uncorrelated input, say, RGB with an alpha mask, many formats let us express decorrelated inputs with two sets of weights.

decoded_rgb = mix(endpoint0_rgb, endpoint1_rgb, rgb_weight);
decoded_alpha = mix(endpoint0_alpha, endpoint1_alpha, alpha_weight);

This costs a lot more bits to encode since alpha_weight is very different from rgb_weight, but it should be worth it.

Many formats let us express if there is correlation or not. Correlation should always be exploited.

Working around the horrible endpoint interpolation artifacts

Almost all formats beyond the most trivial ones try really hard to come up with ways to work around the fact that endpoint interpolation leads to horrible results in all but the simplest input. The most common approach here is to split the block into partitions, where each partition has its own endpoints.

S3TC – The basics

A compute shader decoder:

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/s3tc.comp

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.h

BC1 – 4×4 – 64 bits

The BC1 format is extremely simple and a good starting point. 32 bits is used to encode two RGB endpoints in RGB565 format. The other 32 bits encode 16 weights, with 2 bits allocated to each texel.

This lets us represent interpolation weights of 0, 1/3, 2/3 and 1.

Since there is a symmetry in this design, i.e.:

mix(a, b, l) == mix(b, a, 1.0 - l)

there would be two ways to specify the same block, where we swap endpoints and invert the weights to compensate. This is an extra bit of information we can exploit. Based on the integer representation of the two endpoints, we can check if one of greater than the other, and use a different decoding mode based on that information. This exploitation of symmetry will pop up again in many formats later! In the secondary mode, we add support for 1-bit alpha, called a punch-through in most formats. In this mode, the interpolation weights become 0, 1/2, 1 and BLACK. This lets us represent fully transparent pixels. However, color becomes BLACK, so this will only work with pre-multiplied alpha schemes, otherwise there will be black rings around textures. I don’t think this mode is used all that much these days, but it is an option.

That is it for this format, it really is that simple.

One thing to note is that the specification is defined in terms of floating point with under-specified requirements for precision, and thus there is no bit-exact representation of the decoded values. Almost all hardware decoders of this format will give slightly different results, which is unfortunate. MPEG1 and MPEG2 also made the same mistake back in the day, where the DCT is specified in terms of floating point.

BC2 – 4×4 – 128 bits

BC2 is a format which adds alpha support by splicing together two blocks. A BC1 block describes color, and a second block adds an alpha plane with 4-bit UNORM. This format is quite obscure since the next format, BC3, generally does a much better job at compressing alpha. A curious side effect of BC2 and 3 is that the punch-through mode in the BC1 block no longer exists, i.e. the symmetry exists, so we lose 1 bit of information. I wonder why that information bit was not used to toggle between BC2 alpha decode (noisy alpha) and BC3 alpha decode (smooth alpha) …

BC3 – 4×4 – 128 bits

The alpha encoding of BC3 is very similar to how BC1 works. It also forms the basis of RGTC. The 64 bits of the alpha block spends 16 bits to encode 2 8-bit endpoints for alpha. We now get 3 bits as interpolation weights instead of 2 since there’s 48 bits left for this purpose. Similar to BC1, we can exploit symmetry and get two different modes based on the endpoints. In the first mode, the interpolation weights are as expected, [0, 1, 2, …, 7] / 7. In the second mode, we have 6 interpolated values [0, 1, 2, …, 5] / 5, and two values are reserved to represent 0.0 and 1.0. This can be very useful. This is essentially a very early attempt to introduce partitions into the mix, as we can essentially split up a block into 3 partitions on demand: (Fully opaque texels, fully transparent texels, the in-betweens). This can let us specify a tighter range for the endpoints as there is never a need to use a full [0, 0xff] endpoint range.

Summary

The S3TC formats are very simple, but there are certainly things to note about them. Alpha support is just bolted on, it is not integrated into the format. This means that even though the block is 128 bits, there is no way to spend more than 64 bits on color, even if the alpha plane has a completely flat value.

RGTC

RGTC (red-green) is basically BC3’s alpha block format turned into its own thing. Their main use is with non-color textures, e.g. normal maps, metallic-roughness maps, luminance maps, etc. It is very simple, and quality is quite good.

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.comp

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.h

BC4 – 4×4 – 64 bits

This is BC3’s alpha block format as its own format, which returns one component. The only real difference from BC3 alpha is that it also supports an SNORM variant, which is very useful for normal maps, although I only bothered with UNORM, since my shaders need to assume input can be from any format.

BC5 – 4×4 – 128 bits

RGTC assumes uncorrelated channels, and thus the only sensible choice was to just slap together two BC4 blocks side-by-side, and voila, we can encode 2 channels instead of 1.

Summary

RGTC is simple and nice. It only needs to consider single channels of data, and writing encoders for it is very easy, and it is probably the simplest format out there. For what they do, I really like these formats.

Like S3TC, there is no bit-exact decoding, which is rather unfortunate.

ETC – Refining S3TC

ETC, or Ericsson Texture Compression is a family with multiple generations. ETC2 is backwards compatible with ETC1 in that all valid ETC1 blocks will decode the same way in ETC2, but ETC2 exploits some undefined behavior of ETC1 to extend the format into something more interesting.

ETC has a bit-exact decode, which makes verification very easy. 😀

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/etc2.comp

ETC1 – 4×4 – 64 bits

In many ways, ETC1 is quite similar to BC1, but there are some key differences. Just like BC1, 32 bits are spent to encode endpoints, and 32 bits are spent to give 16 texels 2 bits of weight information each. The main difference between ETC1 and BC1 is how endpoints work.

Sub-blocks

As a very crude form of partitioning, ETC1 allows you to split a block into either 2×4 or 4×2 sub-blocks, where each sub-block has its own endpoints. To do this, endpoints are expressed in a more compact way than BC1. Rather than specifying two RGB values, ETC in general likes to express endpoints as RGB +/- delta-intensity, where delta-intensity is described by a table. This makes things far more compact since we enforce constant chrominance. By saving so many bits, we can express 4 endpoints in total, 2 for each sub-block.

Uncorrelated or correlated endpoints?

Since we have to specify two sets of endpoints, the format gives us a way to specify if the two endpoints have completely different colors, or if the endpoints should be specified in base + offset form. This is controlled with a single bit, which changes the encoding from ep0 = RGB444, ep1 = RGB444 to ep0 = RGB555, ep1 = ep0 + sign_extend(RGB333). These values are not allowed to overflow in any way, which is something ETC2 exploits to great effect later.

NOTE: I found it more instructional to call it uncorrelated and correlated endpoint modes, but the specification calls it “individual” and “differential” modes.

Summary

ETC1 is somewhat different than BC1, but overall, it’s quite similar. It turns out, if you add a few restrictions on top of ETC1, you get ETC1S, which can trivially be transcoded to BC1. Basically, enforce the correlated endpoint mode, set the RGB333 bits to 0, enforce that delta-luma is the same for both sub-blocks.

There is also no way to express alpha with ETC1, which is unfortunate, but ETC1 is completely obsolete for my use cases either way.

ETC2 RGB – 4×4 – 64 bits

As mentioned earlier, ETC1 is a sub-set of ETC2. ETC2 adds a bunch of new and curious modes which gives us a small glimpse into more flexible ways to express endpoints, and even adds a mode which I have never seen in any other formats ever since.

Exploiting undefined behavior

When you select the correlated (differential) endpoint mode, there were some restrictions on overflow. We can exploit this fact in order to gain 3 new modes of operation for the ETC2 color codec!

First, we check if R + sign_extend(dR) is outside [0, 31] range. If so, we activate the so-called “T” mode. In this mode, we essentially add a partitioning scheme to the codec. We now remove the concept of two sub-blocks and let all texels access all available endpoints. We encode two RGB444 values (A, B), and a delta value (d). We form a T-shape by specifying 4 possible color values as A, B, B + d, B – d. This can be useful if the block is smooth, except for some weird outliers. A would be the outlier color, and B represents the middle of the smooth colors.

If G + sign_extend(dG) overflows, we enter a very similar “H” mode. In this mode, we do the exact same thing, except that the 4 possible colors become A + d, A – d, B + d, B – d.

If B + sign_extend(dB) overflows, we enter a very interesting mode, which I have never seen again in future formats. I’m not sure why, since it seems very useful for expressing smooth gradients. Essentially, in this mode we don’t encode weights per texel, but rather express RGB at texel (0, 0), at texel (4, 0) and texel(0, 4), and just bilinearly interpolate across the block to obtain the actual color. This is very different from the other endpoint interpolation we’ve seen earlier, because that flattens everything into a single line in the color space, but now we can access an entire 2D plane in the space instead.

Punch-through alpha

Like BC1’s 1-bit alpha scheme, ETC2 is very similar. When this format is enabled, we remove the capability for uncorrelated endpoints (individual mode), and replace the bit with a selection to select if all texels are Opaque, or potentially transparent. This idea is the exact same as BC1. In the transparent mode, code == 2 marks the texel as being transparent black. It does not work in planar mode though, this bit is ignored there.

Alpha support

Very similar to BC3, ETC2 also supports full 8-bit alpha by slapping together a separate block alongside the color block. The way this works is very similar to how RGTC works, but instead of two endpoints, ETC2 encodes a center point, and then uses tables to expand that range into 8 possible values using a table selector and a multiplier. These 8 possible values for the block are then selected with 3bpp indices. We lose the capability to cleanly represent 0.0 and 1.0 though, which is somewhat curious.

EAC – 4×4 – 64 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/eac.comp

EAC is ETC2’s version of RGTC, it is designed as a way to encode 1 and 2 de-correlated channels, basically the exact same approach as RGTC where the alpha block format is reused for the 1/2-channel formats. EAC is a bit different in that the internal precision is 11 bits (for some reason).

Unfortunately, EAC is kinda awkward since it’s technically bit-exact in the final fixed point value between [0, 2047], but it specifies many different ways how this can be converted to a floating point value.

Next up, BPTC

S3TC, RGTC and ETC represents the simpler formats, and hopefully I’ve summarized what these formats can do, next up, I’ll go through the BPTC formats, which significantly increases complexity.

Clustered shading evolution in Granite

Shading many lights in a 3D engine is kinda hard once you step outside the bubble of classic deferred shading. Granite supports a fair amount of different ways to do lighting, mostly because I like to experiment with different rendering structures.

I presented some work on this topic at SIGGRAPH 2018 in the Moving Mobile Graphics course when I still worked at Arm: https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-20-66/5127.siggraph_2D00_2018_2D00_mmg_2D00_3_2D00_rendering_2D00_hanskristian.pdf. Unfortunately, some slides have videos embedded, and they didn’t seem to have made the transition unscathed to PDF.

Since then I’ve looked into VK_EXT_descriptor_indexing in order to remove some critical limitations with my older implementation and I ended up with some uncommon implementation details.

Classic deferred

Just to summarize what I mean by this, this is the good old method of rendering light volumes on top of a G-buffer and light is accumulated per-light by blending on top of the frame buffer. This method is considered completely obsolete on desktop these days, but it’s still quite viable on mobile with on-chip G-buffers.

Where classic deferred breaks down

  • Very high bandwidth on desktop due to fill-rate/bandwidth
  • No forward shading support (well, duh)
  • No transparent objects
  • No volumetric lighting

Thus, I explored some alternatives …

Why I don’t like tile deferred shading

This is a somewhat overloaded term, but this is the method where you send the G-buffer to a compute shader. The G-buffer is split into tiles, assigned to a workgroup, depth range (or multiple ranges) is found for the tile, which then forms a frustum. Lights are then culled against this frustum, and shaded in one go. This technique was a really great fit for the PS3 SPUs, as shown by Battlefield 3 back in the day.

It still doesn’t really solve the underlying issues of classic deferred except that it is far more bandwidth efficient on desktop-class GPUs.

Forward shading isn’t feasible unless you split the algorithm into multiple stages with Z-prepass -> build light list per tile -> resubmit geometry and shade, but then it’s probably called something entirely different … (Is this what’s known as Forward+ perhaps?)

Transparency isn’t doable unless you render all transparent geometry in a separate pass to find min/max depth per pixel or something, and forget about volumetrics.

On mobile TBDR architectures you lose all bandwidth savings from staying on-tile, and doing FRAGMENT -> COMPUTE barriers on a tile-based architecture is usually a terrible idea for pipelining. The exception here is tile shaders on recent iPhone hardware which seems almost designed to do this algorithm in hardware.

Clustered shading – the old implementation

Clustered shading is really nice in that it is completely agnostic to a depth buffer, so all the problems mentioned earlier just go away. Lights are assigned in 3D-space rather than screen-space. The original paper on the subject is from around the same time tile deferred was getting popular.

Abandoning the frustum structure

In Granite, I chose early on to abandon the “frustum” layout. Culling spot lights and point lights against elongated frustums analytically is very hard and computationally expensive, and getting the Z slices correct is also very fiddly.

A common workaround for culling is using conservative rasterization, but that is a feature I cannot rely on. I figured that using a more grid-like structure I could get away with much simpler culling math. Since all elements in the grid-based cluster are near-perfect cubes, I could get away with treating the elements as spheres, and https://bartwronski.com/2017/04/13/cull-that-cone/ makes spot light-to-sphere culling very cheap and quite tight. Here’s a visualization of the structure. This structure is stored in a 3D texture. Each “mip-level” is packed in the Z dimension as the resolution of each level is the same.

Bitscan loops instead of light lists

Light lists is the approach where each element in the cluster contains a (start, count) and all lights found in that range are shaded. Computing this list on GPU is rather messy. The memory footprint for a single element is unknown in CPU timeline, and we cannot deal properly with worst-case scenarios. This is easier when we cluster on the CPU instead, but that’s boring!

I really wanted to cluster lights on the GPU, so I landed on a bitmask approach instead. The worst case storage is just 1 bit per light per element rather than 32 bit.

The main limitation of this technique is still the number of lights we could feasibly support. With a bitmask structure we need to allocate for worst-case and it can get out of hand when we consider worst case with 1000+ lights. I only had modest ideas in the beginning, so I supported 32 spot lights and 32 point lights, which were encoded in RG32UI per element in a 3D texture. At a resolution of the cluster at 64x32x16x9, culling on the GPU is very fast, even on mobile. We can set the ceiling higher of course if we expand to RGBA32UI or use more texels per element.

Bitscan loops are great for scalarization

A thing I realized quickly when doing clustered forward shading is the importance of keeping VGPRs down on AMD hardware. The trick to move VGPRs to more plentiful SGPRs is to ensure that values are uniform across a subgroup. E.g. instead of doing this:

// VGPR
int light_bitmask = fetch_bitmask_for_world_coord(coord);

vec3 color = vec3(0.0);
while (light_bitmask != 0)
{
    int lsb = findLSB(light_bitmask);

    // All light data must be loaded in VGRPs since lsb is a VGPR.
    color += shade_light(lsb);
    light_bitmask &= ~(1 << lsb);
}

we can do a simple trick with subgroup operations:

// VGPR
int light_bitmask = fetch_bitmask_for_world_coord(coord);

// OR over all active threads.
// As this is the same value for all threads, compiler promoted to SGPR.
light_bitmask = subgroupOr(light_bitmask);

vec3 color = vec3(0.0);
while (light_bitmask != 0)
{
    int lsb = findLSB(light_bitmask);

    // All light data can be loaded into SGPRs instead.
    // Far better occupancy, much amaze, wow!
    color += shade_light(lsb);
    light_bitmask &= ~(1 << lsb);
}

Uniformly loading light data from buffers is excellent. I’ve observed up to 15% uplift on AMD by doing this. The light list approach mentioned earlier has a much harder time employing this kind of optimization. We would have to scalarize on the cluster element itself, which could lead to very bad worst-case performance.

No bindless – ugly atlasing

Another problem with clustered shading (and tile deferred for that matter) is that we need to shade a lot of lights in one go, and those lights can have shadow maps. Without bindless, all shadow maps for spot lights must fit into one texture, and point lights must fit into one texture. Atlassing is the classic solution here, but it is a little too messy for my taste. As the number of lights was rather low, I just had a plain 2D texture for spot lights, and a cube array for point lights. Implementing variable resolution with an atlas is also rather annoying, and for point lights, I would be forced to flatten the cube down to 6 2D rects and do manual cube lookup instead without proper seam filtering, ugh.

Scaling to “arbitrary” number of lights

While performance for reasonable number of lights was quite excellent compared to alternative techniques, I couldn’t really scale number of lights arbitrarily, and it has been nagging me a bit. Memory becomes a concern, and while the “list of lights” approach is likely less memory hungry in the average case, it has even worse worst-case memory requirements, and it’s not very friendly for GPU culling.

Kinda clustered shading? New bindless hotness

The talk on Call of Duty’s renderer in Advances in Real-Time Rendering 2017 presents a very fresh idea on how to do shading, and it hits all the right buttons. Culling on GPU, bitscan loops, scalarization, scales to a lot of lights, a lot of things to like here.

I spent some time this holiday season implementing a new path for clustered shading based on this technique, and I ended up deviating in a few places. I’ll go through my implementation of it, but it will make more sense once you study the presentation first.

Decoupling XY culling from Z

The key feature of the Call of Duty implementation is how it partitions space. Rather than a full 3D cluster, we decouple XY and Z dimensions, so rather than O(X * Y * Z) we get O(X * Y + Z).

Z is also binned linearly, and we can have several thousand bins in Z. This makes everything much nicer to deal with later. Culling here is trivial, since we compute min/max Z in view space for each light, which is very simple.

For each Z-slice, we just need to figure out the minimum light index and maximum light index which hits our Z-slice. Of course, to make the ranges as tight as possible, we sort lights by Z distance.

Data structures

  • Each tile in XY needs a bitmask array, u32[ceil(num_lights / 32)] for each tile. This can be tightly packed in a single buffer.
  • A buffer containing Z-slices as described above.
  • Per-light information: position/radius/cone/light type/shadow matrix/etc

Going back to frustum culling

Now that we cluster in XY and Z separately, I went back to frustum partitioning, and now we need a way to do frustum culling against spot and point lights … Conservative rasterization really is the perfect extension to use here, just a shame it’s not widely available yet on all relevant hardware.

The presentation has an alternative for conservative rasterization as current consoles do not support this feature, which is mostly to render light volumes a-la classic deferred (not completely dead yet!) at full-resolution and splatting out bits with atomics as fragment threads are spawned. If you have depth information you can eliminate coverage using classic deferred techniques. However, I went in a completely different direction without using depth at all.

Compute shader conservative rasterization

It felt natural to do all the culling work in compute shaders. This is where most of the “fun” is. This is also a rather esoteric way of doing it, but I like doing esoteric stuff with compute shaders, see https://themaister.net/blog/2019/10/12/emulating-a-fake-retro-gpu-in-vulkan-compute/ as proof of that.

Conservative sphere rasterization

To solve this problem, we’re going to tackle it geometrically instead. First, we solve the screen-space bounding box problem. Fortunately, this problem is separable, so we can compute screen-space bounds in X and Y separately.

(Behold glorious Inkscape skillz ._.)

What we want to do is to figure out where P_lo and P_hi intersect the near plane. P can be rotated in 2D by the half-angle in two directions. This way we find tangent points on the circle. sin(theta) is conveniently equal to r / length(L), so building a 2×2 rotation matrix is very easy. After rotating P, we can project P_lo and P_hi, and now we have clip space bounds in one dimension. Compute separately for XZ and YZ dimensions and we have screen-space boundaries.

Projecting a sphere to clip space creates an ellipsis, and to compute the ellipsis, we need to rotate view space such that the sphere center lies perfectly on the X or Y axis. For simplicity, we orient it on the +X axis. We can then perform the range test, and an ellipsis is formed. We can now test any point directly against this ellipsis. If the sphere intersects the near plane in any way, we can fall back to screen-space bounding box.

Here’s a ShaderToy demonstrating the math. Of course, Inigo Quilez did it first :p

In the real implementation, we only need to compute setup data once for each point light, to rasterize a pixel, we apply a transform, and perform a conservative ellipsis test, which is rather straight forward.

Conservative spot light rasterization

I tried an analytical approach, but I gave up. Spot-Frustum culling is really hard if you want tight culling, so I went with the simpler approach of just straight up rasterizing 6 triangles which forms a spot light. We can rasterize in floating point since we don’t care about water-tight rasterization rules, and it’s conservative anyways. It’s not the prettiest thing in the world to do primitive clipping inside a shader, but you gotta do what you gotta do …

The … peculiar shader can be found here.

Binning shader

Once we have setup data for point lights and spot lights, we do the classic culling optimization where we bin N lights in parallel over a tile, broadcast the results to all threads, which then computes the relevant lights per pixel. Subgroup ballots is a nice trick here, which replaces the old shared memory approach. Each workgroup preferably works on 32 lights at a time to compute a 32-bit bitmask.

Shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer_bindless_binning.comp

The shading loop

The final loop to shade becomes something like:

uint cluster_mask_range(uint mask, uvec2 range, uint start_index)
{
	range.x = clamp(range.x, start_index, start_index + 32u);
	range.y = clamp(range.y + 1u, range.x, start_index + 32u);

	uint num_bits = range.y - range.x;
	uint range_mask = num_bits == 32 ?
		0xffffffffu :
		((1u << num_bits) - 1u) << (range.x - start_index);
	return mask & uint(range_mask);
}

vec3 shade_clustered(Material material, vec3 world_pos)
{
    ivec2 cluster_coord = compute_clustered_coord(gl_FragCoord.xy);
    int linear_cluster_coord = linearize_coord(cluster_coord);
    int z = compute_z_slice(dot(world_pos - camera_pos, camera_front));

    uvec2 z_range = cluster_range[z];

    // Find min/max light we need to consider when shading slice Z.
    // Make this uniform across subgroup. SGPR.
    int z_start = int(subgroupMin(z_range.x) >> 5u);
    int z_end = int(subgroupMax(z_range.y) >> 5u);

    for (int i = z_start; i <= z_end; i++)
    {
        // SGPR
        uint mask =
            cluster_bitmask[linear_cluster_coord *
                            num_lights_div_32 + i];

        // Restrict to lights within our Z-range. VGPR now.
        mask = cluster_mask_range(mask, z_range, 32u * i);
        // SGPR again.
        mask = subgroupOr(mask);

        // SGPR
        uint type_mask = cluster_transforms.type_mask[i];

        // Good old scalarized loop <3
        while (mask != 0u)
        {
            int bit_index = findLSB(mask);
            int light_index = 32 * i + bit_index;
            if ((type_mask & (1 << bit_index)) != 0)
            {
                result += compute_point_light(light_index,
                                              material,
                                              world_pos);
            }
            else
            {
                result += compute_spot_light(light_index,
                                             material,
                                             world_pos);
            }
        }
    }
}

Shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer_bindless.h

Potential performance problems

By decoupling XY and Z in culling there’s a lot of potential for false positives where large lights might dominate how Z-ranges are computed and trigger a lot of over-shading. I haven’t done much testing here though, but this is probably the only real weakness I can think of with this technique. Regular tile-deferred has similar issues.

Culling tightness

I’m using smaller lights here to demonstrate. Red or green light signifies that a light was computed for that pixel:

Placing point light in volumetric fog for good measure. The weird red/green “artifacting” around the edges is caused by the forward shading and subgroupOr logic when shading to ensure subgroup uniform behavior.

Potential improvements

Clipping Z-range calculation against view frustum might help a fair bit, since Z-range can be way too conservative for large positional lights like these. Especially spot lights which point to the side like the image above. Classic deferred has a very similar problem case unless ancient stencil culling techniques are used to get double sided depth tests.

Conclusion

I’m happy with the implementation. Performance seems very good, but I haven’t dug deep in analysis there yet. I was mostly concerned getting it to work. Just waiting for mobile GPU vendors to support bindless, so I can test there as well I guess.

YUV sampling in Vulkan – a niche and complicated feature – VK_KHR_sampler_ycbcr_conversion

Sometimes I like to implement Vulkan extensions in Granite just because. This time around we’re looking at YUV.

For anyone who have worked with multimedia before, YUV (or YCbCr as it’s also referred to) is a nightmare. The act of converting YUV to RGB is very simple as it’s just a color space transform with a 3×3 matrix, but YUV means many things. There is no end to how overloaded YUV can be, and making sure you know exactly which YUV flavor you’re dealing with can be quite tricky. From a system integration point of view, YUV is a serious pain.

When dealing with video content in Vulkan, you will have to consume YUV somehow.

YUV has some characteristics which all make sense for video compression, but complicate things.

  • Planar: Each color component is often packed in different 2D images.
  • Luma/Chroma: Y refers to luminance, UV (or CbCr) refers to chrominance (color). This is critical for video compression.
  • Downsampled chroma. Human eyes have higher resolution for luminance vs color, so less bandwidth on color is an easy way to save space.

There is a lot of variance here between different YUV formats, as there is no obvious standard for these kinds of things.

  • How many planes? 2 or 3 are common (Y, U, V) vs (Y, packed UV). Packed single plane can be used in some other obscure scenarios. (YUYV and variations on that).
  • For packed representations, which component comes first?
  • How many bits per component? 8 is the most common, but 10-bit YUV content can be found in the wild.
  • How much is chroma downsampled? 2x horizontally and vertically is by far the most common, often referred to as YUV420p (the naming convension in YUV makes no sense, don’t try to find any).
  • Where is the texel center for the chroma samples? Common values are co-sited at every other luma sample, or in the mid-point between groups of 2×2 luma samples.
  • What is the exact color space conversion matrix from YUV to RGB?
  • How is chroma reconstructed to full resolution?

Dealing with YUV without fancy extensions

To render a YUV frame in RGB is not necessarily a difficult task, but depending on how many formats you need to deal with, shader variants can quickly get out of hand. You’ll typically do something like this:

layout(binding = 0) uniform TexLuma;
layout(binding = 1) uniform TexCb;
layout(binding = 2) uniform TexCr;

layout(location = 0) out vec3 FragColor;
layout(location = 0) in vec2 TexCoord;

const mat3 yuv_to_rgb_matrix = mat3(....);

void main()
{
    float Luma = textureLod(TexLuma, TexCoord, 0.0).x;
    float Cb = textureLod(TexCb, TexCoord, 0.0).x; // For mid-point chroma
    float Cr = textureLod(TexCr, TexCoord, 0.0).x;
    vec3 yuv = vec3(Luma, Cb, Cr);
    // Possibly expand range here if using TV YUV range and not PC YUV range.
    yuv = rescale_yuv(yuv);
    FragColor = yuv_to_rgb_matrix * yuv;
}

This is fine, but there is some motivation to do better here. Since watching video is one of the most common operations a GPU does, it would be nice if we could do this more efficiently, especially on low-power mobile devices with batteries attached to them. It would also be really nice if we could sample a video texture in our shader and have the GPU just “deal with it”.

The planar texture formats

VK_KHR_sampler_ycbcr_conversion adds a lot of new texture formats to Vulkan. For YUV420p, we’re going to look at this format, VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM.

In planar formats, we see that 3 color components are spread out across 3 planes, with a “420” here to mean that the second and third components are half resolution. “GBR” is a bit strange, but G refers to Y, and B and R refer to Cb and Cr respectively. Green is the strongest contributor to the luma Y channel after all, so I guess it kinda makes sense like that …

By having planar texture formats, it means that a texture unit of the GPU is actually able to sample 3 samples at once, meaning we will put a lot less stress on the GPU texturing unit. If we think of rendering video, this is a large part of the work done on the GPU. For lower-end mobile devices rendering 4K video without melting, this is actually really important.

The image aspects

In order to be able to refer to each plane separately when copying data in and out of the texture, we can use VK_IMAGE_ASPECT_PLANE_{0,1,2}_BIT. When copying to and from plane 1 and 2 in YUV420p, the resolutions of those planes are halved. VK_IMAGE_ASPECT_COLOR_BIT refers to the whole “GBR” as a whole, and it’s only useful when sampling the image.

Disjoint image allocation

Since we have 3 image planes, it is not weird to desire combining three separate images together. E.g. we can allocate three R8_UNORM images and make it planar later. This is supported and we’ll have to create and allocate images in a particular way.

First, we create the R8_UNORM images with VK_IMAGE_CREATE_ALIAS_BIT. This means that we will alias the image meaningfully with another image, even when using OPTIMAL image layout and image layouts are shared across aliases! We’ll use this to alias with a plane inside the planar texture later.

Be aware of alignment requirement. I’ve found that planar textures can need larger alignment than the single-component textures might need, so either using standalone allocations per plane, or bumping alignment to something like 64k works around that.

When we create the planar texture, we specify DISJOINT_BIT and ALIAS_BIT. For disjoint, it means we need to query allocation requirements and bind memory separately for each plane. To do this, we need to use vkGetImageMemoryRequirements2 and vkBindImageMemory2. Here, we just bind the same memory we used for our separate textures.

Setting up a sampler conversion object

The vkCreateSamplerYcbcrConversion function creates an object which encodes exactly how we will convert the planar components into RGB values.

VkSamplerYcbcrConversionCreateInfo info = { VK_STRUCTURE_TYPE_SAMPLER_YCBCR_CONVERSION_CREATE_INFO };

// Which 3x3 YUV to RGB matrix is used?
// 601 is generally used for SD content.
// 709 for HD content.
// 2020 for UHD content.
// Can also use IDENTITY which lets you sample the raw YUV and
// do the conversion in shader code.
// At least you don't have to hit the texture unit 3 times.
info.ycbcrModel = VK_SAMPLER_YCBCR_MODEL_CONVERSION_YCBCR_709;

// TV (NARROW) or PC (FULL) range for YUV?
// Usually, JPEG uses full range and broadcast content is narrow.
// If using narrow, the YUV components need to be
// rescaled before it can be converted.
info.ycbcrRange = VK_SAMPLER_YCBCR_RANGE_ITU_NARROW;

// Deal with order of components.
info.components = {
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
};

// With NEAREST, chroma is duplicated to a 2x2 block for YUV420p.
// In fancy video players, you might even get bicubic/sinc
// interpolation filters for chroma because why not ...
info.chromaFilter = VK_FILTER_LINEAR;

// COSITED or MIDPOINT? I think normal YUV420p content is MIDPOINT,
// but not quite sure ...
info.xChromaOffset = VK_CHROMA_LOCATION_MIDPOINT;
info.yChromaOffset = VK_CHROMA_LOCATION_MIDPOINT;

// Not sure what this is for.
info.forceExplicitReconstruction = VK_FALSE;

// For YUV420p.
info.format = VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM;

VkSamplerYcbcrConversion conversion;
table->vkCreateSamplerYcbcrConversionKHR(device, &info, nullptr,
                                         &conversion);

Passing along to VkImageView and VkSampler

Both an image view and sampler which are used with KHR_sampler_ycbcr_conversion must have this sampler conversion object passed into it via pNext. This is because part of this information will be split up between the two objects. Planar and swizzle information is likely part of image view, while filtering and chroma siting is likely part of sampler object.

Immutable sampler

Some information is also relevant for the shader compiler, so for YCbCr sampling there are some restrictions. You have to use a COMBINED_IMAGE_SAMPLER and the sampler must be immutable in the descriptor set layout. Since that immutable sampler contains the conversion object, this allows the shader compiler to see how to complete the transform where hardware support stops. This might just be performing the YUV to RGB conversion, or it lets the shader implement almost everything on its own.

In Granite, I pass along immutable sampler information in a rather crude way where I look at the declared name of a combined image sampler in the shader.

Putting it all together I can write shader code like this in Granite and sample YUV420p content directly:

#version 450

// "LinearYUV420P" guides reflection to use an immutable sampler.
// Not ideal since I need to know which YUV conversion to use ...
// But it works for now.
layout(set = 0, binding = 0) uniform sampler2D uSamplerLinearYUV420P;
layout(location = 0) in vec2 vUV;
layout(location = 0) out vec4 FragColor;

void main()
{
    // It's just like sampling a regular texture :D
    FragColor = textureLod(uSamplerLinearYUV420P, vUV, 0.0);
}

Getting fancy with hardware video decoding

The natural way to extend this is to use hardware video decoding, export each plane as external memory, and import it in Vulkan. DISJOINT allocation comes in handy here since we don’t have to rely on a video decoder exporting a full planar image as one large unit. I haven’t looked into this yet though, but I suggest looking at MPV here. There seems to be some support for this kind of zero-copy flow with Vulkan.

Was all of this really all that useful?

Not really, but I have some ideas to add a “video texture material” in the future to Granite, so it might come in handy later. There might be some use for it in graphics programming where bandwidth is saved by encoding post-process render targets to YUV. That way, sampling the YUV post targets shouldn’t require a lot of manual hackery.