My personal hell of translating DXIL to SPIR-V – part 5

Time for despair

It has taken way too long to get to this point, but we must address the final challenge of translating DXIL, a claustrophobic mess that haunts me to this day, the most nightmarish set of poorly understood heuristics which come together to put pretty pixels on screen. It is of course, the act of untangling spaghetti and turning unstructured control flow into structured control flow.

True impostor syndrome kicks in

Doing compiler things without knowing what you’re doing is agonizing. You can never trust yourself and whatever I did for DXIL structurization has turned out to fail 30+ times, and each time I fix a critical bug and move on. The only thing I can say for sure is that it’s becoming asymptotically correct, and I have been able to make 100k+ AAA shaders compile and validate correctly, but every new insane scenario DXC (LLVM) throws at me could be the one to break everything, because if you cannot mathematically prove a compiler algorithm you might be extremely screwed because the edge cases are truly insane. The true compiler nerds are probably giggling in the corner 🙂

Goal for this post

The goal for this post is to go through the basic theory needed to understand the structurizer. The topic itself will take many posts to fully cover, so this is a good start.

GraphViz – visual debugging

GraphViz is a corner stone of debugging control flow, so the very first thing we need is to be able to dump out a CFG to a PDF. Fortunately, there is just the tool for the job.

For example we can start with an extremely basic shader from the test suite:

RWByteAddressBuffer Buf : register(u0);

[numthreads(1, 1, 1)]
void main(uint3 index : SV_DispatchThreadID)
{
  uint result = 0;

  // Single branch
  [branch]
  if (index.x == 10)
    result = 40;
  else
  result = 50;

  // 2-level branch
  [branch]
  if (index.x == 20)
  {
    [branch]
    if (index.y == 30)
      result += 20;
    else
      result *= result;
  }
  else
  {
    [branch]
    if (index.y == 40)
      result += 70;
    else
      result *= 10;
  }

  Buf.Store(0, result);
}

If we dump this input to GraphViz we get:

digraph {
1 [label="1", shape="box"];
2 [label=".entry", shape="box"];
2 -> 1;
3 [label="2", shape="box"];
2 -> 3;
3 -> 1;
4 [label="3", shape="box"];
1 -> 4;
5 [label="4", shape="box"];
1 -> 5;
6 [label="7", shape="box"];
5 -> 6;
7 [label="8", shape="box"];
5 -> 7;
8 [label="9", shape="box"];
7 -> 8;
6 -> 8;
9 [label="5", shape="box"];
4 -> 9;
10 [label="6", shape="box"];
4 -> 10;
10 -> 8;
9 -> 8;
}

Then, we have this trivial Python script to dump to PDF:

#!/usr/bin/env python3

from graphviz import Source
import sys

def main():
  s = Source.from_file(sys.argv[1])
  s.view()

if __name__ == '__main__':
  main()

For debugging purposes, we can dump out a GraphViz after every major transformation step, of which there are many. For reference, here’s the final structured output, which we will study in greater detail. Of course, this is baby mode compared to what we have to deal with later, but gotta start somewhere.

SPIRV-Tools also has a tool to extract a GraphViz from a SPIR-V module.

spirv-cfg -o test.graph test.spv

This is super handy as well! The graph can be visualized with the same Python script.

SPIRV-Tools is also critical since we lean heavily on the SPIR-V validator to verify that our SPIR-V output conforms to structured rules.

Basic theory for the non-compiler nerds

When trying to learn this stuff, you will encounter terminology from graph theory and any learning material surrounding this topic will usually come from CS textbooks or lectures which assume you have all that background knowledge (I have an EE background, not CS or fancy math). These concepts become the basic primitives of any algorithms we have to perform on the CFG. None of this is particularly difficult, sure, but needs to be understood at a deep level.

Reachability

Asks the question of “can block A reach block B”? In the example above, 4 can reach 9 by branching via either 7 or 8, but 4 cannot reach 6. This is quite intuitive.

Back-edge

In loops, a back-edge branches back to the beginning of the loop. This is detected when A can reach B and B branches to A.

Here is an example where 3 has a back-edge to itself.

Note: I tend to mix up back-edge as a block (continue statement) and the branch itself.

This makes the CFG cyclic. We can eliminate back-edges from a DXIL CFG and achieve an acyclic CFG.

I believe this property is what makes the CFG reducible. Irreducible CFGs can have completely random gotos that make it impossible to ensure an acyclic graph without some extra preprocessing. DXIL fortunately does not support irreducible CFGs. It is also its only saving grace …

In our implementations back-edges are treated as special branches that happen in loops, and for sake of reachability analysis we only consider reachability to be acyclic.

There can only be one back-edge branch to a particular block. I think this is what LLVM defines. I have never seen a counter example, but if it happens, we can trivially rewrite the graph to only have one back-edge.

Predecessors

Usually abbreviated “preds”. If A branches directly to B, A is a predecessor of B. Back-edges are not counted in our implementation, they are recorded separately as a pred_back_edge.

Successors

Usually abbreviated “succs”. If A branches directly to B, B is a successor of A. Back-edges are not counted in our implementation, they are recorded separately as a succ_back_edge.

Domination

Domination happens when we have two blocks A and B and in order to reach B, control flow must pass through A as well. A block dominates itself. Strict domination is when A != B.

Post-domination

Post-domination happens when we have two blocks A and B, and if control flow reaches block A, control flow must then flow into B.

Immediate dominator

Among all blocks which dominate a block A, finds the closest block which dominates A. Often called idom.

Immediate post-dominator

Same as idom, but post-dominance rather than dominance.

Domination frontier

Assume we have a block A which dominates a set of blocks B. If a block in B branches to a block C, and A does not dominate C, C is in the domination frontier set of A. This is somewhat abstract, but it’s an extremely useful analysis tool for later. E.g.:

For block 4, 9 is in the domination frontier. For block 7, 9.4.ladder is in the domination frontier.

Post-domination frontier

Same idea, but for post-dominance. E.g. for block 7, 4 would be in the post-domination frontier. For 9.4.ladder, 1 would be in the frontier, etc. The post-domination frontier is particularly useful as an analysis tool to figure out which control flow decisions contribute to a block being executed.

Common (post-)dominator

Given candidates B and C, finds the closest block A which (post-)dominates both B and C.

Loop analysis

Loops in unstructured control flow can be very awkward to reason about, but we have a way to specify loops precisely.

The loop header

A loop header is any block which has a back-edge branching to it. In reducible CFGs which DXIL requires, the loop header must dominate the back-edge.

The continue block

The back-edge block is the continue block. SPIR-V allows a “continuing construct” which means it can span multiple blocks, but we don’t use that.

Defining the loop body

When we have a loop header and a continue block, we can deduce the loop body as any block which is dominated by header and can reach the continue block. Once we can no longer reach the continue block, we are breaking out of the loop. We call these blocks “loop exits”.

For example, 3 is a loop header, 6 is a continue block. The loop body is {3, 4, 5, 7, 8, 6}. 9 is the only loop exit.

Issues for later …

As we will see later, this definition of a loop is fundamentally flawed when we consider subgroup operations in non-uniform control flow. A single-threaded CPU execution model that LLVM caters to does not care about any of this, but GPU execution certainly does since semantics change depending on how we define convergence.

Computing the CFG

In the implementation, we will recompute the CFG at many points, usually after performing transformations on the CFG. Here are some of the things we need to do:

Compute post-visit order

Essentially, a depth-first traversal, where each node is assigned a visitation index. The rule is if A can reach B, post_visit_index(A) > post_visit_index(B). We also detect back edges this way. A node is only visited once.

If we don’t visit a node, it means it’s not reachable anymore (can happen after rewriting the CFG), and we remove the node, which means purging dead predecessors.

We can also build an array of nodes which we can traverse through and know that we will walk the tree in an appropriate order. Iterating over the post-visit order backwards will ensure that dominators are visited before our dominated blocks for example. This is used when we …

Compute immediate dominators

The algorithm for this is pretty easy once we have post-visit orders in place.

void CFGNode::recompute_immediate_dominator()
{
  if (pred.empty())
  {
    // For entry block only.
    immediate_dominator = this;
  }
  else
  {
    immediate_dominator = nullptr;

    for (auto *edge : pred)
    {
      if (immediate_dominator)
        immediate_dominator =
          CFGNode::find_common_dominator(immediate_dominator, edge);
      else
        immediate_dominator = edge;
    }
  }
}

Finding a common dominator (simplified) is done by:

while (a != b)
{
  if (a->post_visit_order < b->post_visit_order)
    a = a->immediate_dominator;
  else
    b = b->immediate_dominator;
}
Compute reachability

Here we build a simple LUT where we can test if post_visit_order A can reach post_visit_order B.

void CFGStructurizer::visit_reachability(const CFGNode &node)
{
  uint32_t *dst_reachability =
    &reachability_bitset[node.forward_post_visit_order *
                         reachability_stride];

  for (auto *succ : node.succ)
  {
    // Inherit reachability from all successors.
    const uint32_t *src_reachability =
      &reachability_bitset[succ->forward_post_visit_order *
                           reachability_stride];
    for (unsigned i = 0; i < reachability_stride; i++)
      dst_reachability[i] |= src_reachability[i];
  }

  // We can reach ourselves.
  dst_reachability[node.forward_post_visit_order / 32] |=
    1u << (node.forward_post_visit_order & 31u);
}

void CFGStructurizer::build_reachability()
{
  reachability_stride = (forward_post_visit_order.size() + 31) / 32;
  reachability_bitset.clear();
  reachability_bitset.resize(reachability_stride *
                             forward_post_visit_order.size());
  for (auto *node : forward_post_visit_order)
    visit_reachability(*node);
}

It’s nice how the structures we build feed into each other. We compose this as saying that if B can reach C, and A can reach B, A can reach C.

Compute dominance frontiers

Similarly, the dominance frontier is built bottom-up.

void CFGStructurizer::recompute_dominance_frontier(CFGNode *node)
{
  // If we don't dominate our successor, it's a frontier.
  for (auto *succ : node->succ)
    if (succ->immediate_dominator != node)
      node->mark_dominance_frontier(succ);

  // Inherit frontiers.
  if (auto *idom = node->immediate_dominator)
    for (auto *frontier_node : node->dominance_frontier)
      if (!idom->dominates(frontier_node))
        idom->mark_dominance_frontier(frontier_node);
}

The recursive definition here is that if A dominates B, then any dominance frontier of B is also in the dominance frontier of A.

Congratulations! You can now say dominance frontier casually with a straight face :V

The backwards traversal

To compute post-dominance and post-dominance frontiers, we have to do the same thing, just backwards. Flip successors into predecessors and vice versa, but we have some problems. The normal traversal has one unique entry node, but there can be multiple leaf nodes (think returns from a branch)! This means there are many entry blocks when we traverse backwards, and this breaks the algorithm.

Another problem is loops. Consider this example, which demonstrates both cases:

Here, 6 is not reachable through backwards traversal from 1. 7 is a return block. What we need to solve the backwards traversal is to ensure we have a unique start node. For this purpose, we will invent an EXIT node. Any return will be considered to branch to that EXIT node. Back-edges are special. For backwards traversal, we will instead invent a “fake” branch from the back-edge to the common post-dominator of all loop exits. (That is a mouthful.)

This is the first of many silly ideas I came up with … After this point, assume I have no idea what I’m talking about. 😐

Dotted lines are fake branches we can use for backwards traversal.

What does it mean to be structured?

A structured CFG basically means the control flow closely maps to higher level languages without gotos. Rather than spaghetti, the control flow is like an onion. It’s somewhat difficult to precisely define it (study the SPIR-V specification), but the basic gist is something like:

Every block exists in a scope

A block can be enclosed by many scopes. A scope could be a selection construct or a loop construct.

Selection construct

When a node has more than one successors, we begin a selection construct, where that node is a header. A selection construct has a merge node where execution reconvenes. A header must dominate its merge node. The selection construct is whatever the header dominates minus whatever the merge node dominates. Intuitively:

// HEADER
if (blah)
{
  // IN CONSTRUCT
  if_true();
}
else
{
  // IN CONSTRUCT
  if_false();
}
// MERGE, not part of construct

No crossing streams

A block from outside a construct must not branch into the construct, except through the header block.

No shared merge block

A block cannot be the merge block for multiple headers.

Loop construct

A loop header begins the construct. The loop construct contains all blocks which are dominated by the loop header, minus whatever is dominated by the merge block. See example earlier.

Notice how this is (fatally) different from unstructured loops. Unstructured loops leave the construct when they can no longer reach the continue block. Structured loops don’t exit until they reach the merge block.

Leaving a construct

A loop defines a “break” target as well as a “continue” target. The only valid ways to leave a selection construction is through its own merge block, the innermost break, or innermost continue target.

Unlike some languages, SPIR-V does not allow “multi-break”. A loop construct functions as a barrier to how far we can break. We can only break out of one loop construct at a time (except with return, but DXIL code is all inlined). Solving this is … an ordeal 🙂

Switch constructs

Switch blocks are selection constructs, but they have a special property where the merge target of switch also serves as a break block. Oddly enough, you can break directly out of a switch and a loop at the same time, which is not allowed in normal high level languages, but oh well.

Edge cases for selection constructs

It is possible that a selection construct has one branch being a “break” or “continue”, and the other branch is as normal. This special case does not require a selection construct.

Another situation is when both paths of a selection construct ends up breaking. Execution will never merge in this case, so it is possible to setup the merge block as an unreachable block with OpUnreachable instead.

Why do merge blocks matter?

In a world with SIMT-style execution of GPUs, two sets of invocations may branch off, but at some point they might want to join back again. Defining this can be done naturally through merge blocks. Without a notion of merge blocks, there is no way to meaningfully distinguish

from

In a scalar world such as CPUs this has same behavior, but certainly not in a SIMT world. If 4 does anything using subgroups or just normal texturing with implicit LOD, well … everything breaks.

From my understanding, OpenCL has a notion of divergence and convergence, but any well defined convergence must come through a merge block. In completely unstructured control flow, convergence is not well defined. From SPIR-V:

Uniform Control Flow: Uniform control flow (or converged control flow) occurs if all invocations (in the invocation group, unless otherwise stated) execute the same dynamic instance of an instruction. Uniform control flow is the initial state at the entry point, and lasts until a conditional branch takes different control paths for different invocations (non-uniform or divergent control flow). Such divergence can reconverge, with all the invocations once again executing the same control-flow path, and this re-establishes the existence of uniform control flow. If control flow is uniform upon entry into a structured loop or selection, and all invocations leave that loop or selection via the header block’s declared merge block, then control flow reconverges to be uniform at that merge block.

So here, we have explicit wording that we must use merge blocks to get guaranteed re-convergence. I couldn’t find any strong wording about this in CUDA, but the closest from skimming the docs was this:

For divergent control flow, the optimizing code generator automatically determines points of re-convergence

This is interesting, because this is exactly what we are going to have to do as well in dxil-spirv, but this is not exactly a strong guarantee. In practice, it should work fine, but I think this presents a good reason why structured control flow is a thing. I just wish IRs did not take the lazy route of just derping everything into unstructured control flow and expect implementations to guess the semantics. Sigh …

The classic subgroup breakage – maximal convergence gone awry

It’s time to go through the classic example of why we can reach catastrophic derpiness with unstructured control flow.

StructuredBuffer<uint> RO : register(t0);
RWStructuredBuffer<uint> RW : register(u0);

[numthreads(16, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
    uint v = RO[thr];
    uint result;
    while (true)
    {
        uint first = WaveReadLaneFirst(v);
        if (v == first)
        {
            result = WaveActiveSum(v);
            break;
        }
    }

    RW[thr] = result;
}

It’s crystal clear what the programmer’s intention is here. We partition a wave into sets, which then participate in a wave-op in isolation, then they exit the loop. The loop is iterated N times. This is a very common waterfall pattern, but any experienced shader programmer will avoid this exact implementation of said pattern because it breaks over half the compilers out there … The real problem here is that according to unstructured control flow, the WaveActiveSum() is not inside the loop, because it cannot reach the continue block. Therefore we end up with a completely broken:

define void @main() {
...
  br label %6

; The loop, it does not do anything.
  %7 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %5)
  %8 = icmp eq i32 %5, %7
  br i1 %8, label %9, label %6

; Outside the loop.
  %10 = call i32 @dx.op.waveActiveOp.i32(i32 119, i32 %5, i8 0, i8 1)
  call void @dx.op.bufferStore.i32(...)
  ret void
}

The wave operation now happens outside the loop, with all threads active, no matter what. It is impossible to deduce the programmer’s intention at this point. Structured control flow does not have these ambiguities, because the loop does not end until we reach the merge block.

The high level code would look something like:

// A degenerate loop.
for (;;)
{
  if (RO[thr] == subgroupBroadcastFirst(RO[thr]))
  {
    break;
  }
  else
  {
  }
}
RW[thr] = subgroupAdd(RO[thr]);

Indeed, the ACO ISA is devoid of branches here as the loop does nothing.

Funnily enough, while writing this, I found that the latest DXC versions actually go to great length to try (and fail!) to make this work somehow, through an absolutely disgusting trick:

@dx.break.cond = internal constant [1 x i32] zeroinitializer

define void @main() {
  %1 = load i32, i32* getelementptr inbounds ([1 x i32], [1 x i32]* @dx.break.cond, i32 0, i32 0)
  %2 = icmp eq i32 %1, 0
  %3 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 1, i32 0, i32 0, i1 false)
  %4 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false)
  %5 = call i32 @dx.op.threadId.i32(i32 93, i32 0)
  %6 = call %dx.types.ResRet.i32 @dx.op.bufferLoad.i32(i32 68, %dx.types.Handle %4, i32 %5, i32 0)
  %7 = extractvalue %dx.types.ResRet.i32 %6, 0
  br label %8

; <label>:8 ; preds = %13, %0
  %9 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %7)
  %10 = icmp eq i32 %7, %9
  br i1 %10, label %11, label %13

; <label>:11 ; preds = %8
  %12 = call i32 @dx.op.waveActiveOp.i32(i32 119, i32 %7, i8 0, i8 1)
  br i1 %2, label %14, label %13

; <label>:13 ; preds = %11, %8
  br label %8

; <label>:14 ; preds = %11
  call void @dx.op.bufferStore.i32(i32 69, %dx.types.Handle %3, i32 %5, i32 0, i32 %12, i32 undef, i32 undef, i32 undef, i8 1)
  ret void
}

What on earth is going on here? So, somehow DXC recognizes the horrible mistake of unstructured control flow, and tries to hack around it by forcing the breaking path to have a reachable path to the continue block. It does this by conditionally branching to continue or breaking depending on a value that is always 0. Apparently, it seems to do this in the most ridiculous way possible, by reading from a 1-element Private array which is zeroinitialized. This is probably done in an attempt to avoid optimizations which remove the branch. The high-level language now avoids the maximal convergence.

const uint _13[1] = uint[](0u);

bool _21 = _13[0u] == 0u; // Always evaluates to true.
for (;;)
{
  _52 = _46.value == subgroupBroadcastFirst(_46.value);
  if (_52)
  {
    _53 = subgroupAdd(_46.value);
    if (_21)
      break;
    else
      continue;
  }
}

However, we now have a situation where convergence depends entirely on how smart the backend compiler is. A compiler can detect that _21 is constant and eliminate the continue branch as dead code. At the same time however, this also means that we regress back to maximal convergence and everything breaks.

This particular case works as expected on AMD/Intel D3D12 drivers, but apparently NVIDIA’s D3D12 driver optimizes this back to maximal convergence and even this nasty hack fails. Fun times … I’m not even sure I can call that a bug. It’s not like any of this is well defined in DXIL 🙂

This isn’t a fantasy example, Microsoft’s own examples trip this scenario as well:

// Is equivalent to writing a loop like this:
while (true) {
  if (WaveReadLaneFirst(expr) == expr) {
    sum = WavePrefixSum(value));
    break;
  }
}

Sigh … Loops are a real PITA.

Conclusion

This concludes CFG 101. Unfortunately, we still have not reached the actual juicy algorithms, but it took way too many words just to go through the bare minimum of CFG theory.