My personal hell of translating DXIL to SPIR-V – finale

It has been quite a while since the last post in this series, for good reason. Finding the energy to write this down simply does not happen overnight. It is not a coincidence that these kinds of blog posts tend to appear at the end of a long summer vacation!

As teased in the previous post, it is time to finally tackle the real meat of structured control flow. Reading that introductory post is mandatory to have any hope understanding what is going on here, and to be fair, even with that knowledge, I’m not confident I can convey all the bizarre algorithms I made up to make this somewhat work.

Most of what I’m trying to explain here are ideas that have grown organically over time with little to no theoretical backing, and with hindsight there should be better approaches, but taking advantage of that likely needs a complete rewrite. Take this post with a grain (or three) of salt if you intend to write your own structurizer.

Ideally, I would have used an off-the-shelf implementation, but I never found one that satisfies:

  • Standalone
  • Convinces me that it will generate fast code, i.e. does not confuse the hell out of existing compilers with ridiculous branching patterns
  • Does not collapse into pure gibberish on mildly weird code (and DXC sure loves to emit weird code)
  • Understands the SPIR-V flavor of structured control flow
  • Has a concept of convergence
  • An algorithm I can actually understand and debug

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
— Brian W. Kernighan

My little DXIL hell blog series – previous entries

What the algorithm needs to do

The two major requirements of the structurizer is the need to rewrite control flow, and marking required structured information in SPIR-V, such as:

    • Selection merge: For any block that does conditional branch or switch
    • Loop merge + continue block: For any block that has a back-edge

If we can insert these constructs and make the SPIR-V validator pass, we’ve done our job and the code tends to “just werk”. In terms of quality of implementation, it is imperative that the control flow is as natural as possible. The output control flow (passed through SPIRV-Cross) is the only help we have in vkd3d-proton when debugging shaders, so anything that can help make the CFG more readable is extremely important. In more concrete terms, emitting the simplest possible structured construct is the goal.

Unfortunately, it is impossible to simply take the DXIL control flow, analyze it and insert structured information as necessary. Instead, we have to (heavily) modify the control flow graph first, so that it is possible to insert structured information while satisfying SPIR-V requirements later. Modifying a control flow graph is a massive topic, but before falling in that rabbit hole, I’ll try to outline some of the concepts at a higher level to help ramp up the brain clock frequency. In the sections after this, I’ll aim to provide concrete examples with graphviz illustrations.

Dummy basic blocks aiding merge constructs

I refer to these blocks as ladders. A requirement of SPIR-V is that a basic block can only be the merge target of one header. If multiple scopes want to merge to a specific block, we need ladders which can merge them one at a time.

There are exceptions to this rule. A loop merge can serve as a merge block for any selection construct contained within in the sense that selection constructs can just OpBranch to a loop merge as it pleases. It might be tempting to cheese the SPIR-V requirements by turning every complex selection construct into a block-like loop, but a significant hurdle is that there is no multi-break support. This is consistent with all higher level shading languages, since they don’t support goto. Even if we could, it would lead to unreadable garbage code which we need to avoid at all cost. Some structurization algorithms rely on this feature (WebAssembly), but we’ll need to tackle this on hard mode.

Remove fake merge blocks

Any block which has two or more predecessors is potentially hazardous and carries some semantic meaning. The implication is that we need to merge control flow. Of course, DXIL is completely unstructured and it is not well defined when control flow actually converges, but maximal convergence is a good initial assumption, where every block that has two or more predecessors must be a merge block. Unfortunately, this assumption will break down in certain edge cases. When we run into “impossible” scenarios that make no sense from a high level language point of view, we can take the foot off the pedal and reconsider if we really need to merge.

DXC tends to “merge” control flow in bizarre ways that does not necessarily reflect the intent of the original code. As discussed in the previous post, I covered the issue of subgroup operations in loops, but the issue can manifest in scenarios unrelated to subgroups as well.

Fortunately, the scenarios that I’ve seen are limited to basic blocks that exist only to emit a PHI opcode or two, and then actually perform the break. The solution is to simply split the block. It is generally safe to do this as long as the block is not control dependent, i.e. it has no derivative, barrier or subgroup operations.

Promote to loop blocks

A fairly common pattern is:

do
{
  blah();
  blah();
  if (foo) break;
  blah();
  blah();
} while(false);

There is no loop here, but in some situations we are kind of forced to insert a “block” construct. A block construct is a loop with a continue block that is not reachable in the CFG. This code is inevitable with inlined functions that return early.

A cute trick spirv-opt uses is an OpSwitch with one default case. It is also possible to break out of the switch this way. dxil-spirv did not go this route since we have to handle multi-break anyways, and a switch does not help with that.

Rewriting multiple breaks

This gets extremely hairy, and dealing with it is just another Tuesday.

This is also a problem spirv-opt deals with. When inlining, it is possible to return from two inner loops like such:

void func()
{
  while (cond1)
  {
    a();
    while (cond2)
    {
      b();
      if (foo) return; // y u do dis ;____;
      c();
    }
    d();
  }
}

I have seriously contemplated “un-inlining” code to support shenanigans like this, but that is a level of insanity I am not willing to engage with.

What is a break? A miserable pile of heuristics

In selection constructs, we’ll end up in a situation where there is no clear merge block, i.e. the header does not dominate the immediate post-dominator, yet we have to emit a merge block for any selection. If our block branches to A and B, we might have to select one of three situations:

  • The branch to A is a “breaking construct”, B is not. Merge to B. Path in A will leave the selection construct through an outer switch or loop break.
  • Vice-versa above.
  • Both paths can be considered to be breaking constructs. Merge to an OpUnreachable dummy node.

This is harder than it sounds. Tie-breaking between A and B paths can be quite the ordeal.

Transposing control flow

There are some extremely rare scenarios where we need to transpose code. For example, code that lives inside a loop must be moved outside the loop for structurization to be possible. These shaders cause serious brain damage, be warned!

Note that transposing control flow is a term I pulled out of the aether. Such is life when you don’t really know what you’re doing.

If we hit this scenario, we must give up any hope of debugging the code, it will look incomprehensible at this point.

Switch blocks with free-flowing fall-through

Switch blocks in unstructured control flow are truly cursed. They can fall-through case labels, but we also find scenarios where multiple blocks fall-through to the same case label. Spicy!

Rewrite PHI nodes

When rewriting control flow, we will break any PHI nodes. One solution would have been to de-SSA the shader, but I went with the more principled approach and retain SSA throughout. I’m not sure in hindsight that it was the correct solution due to its complexity, but de-SSA is not trivial either, especially with switch blocks that fall-through. SPIRV-Cross has some truly cursed code to deal with that, and it’s horrible enough that I did not want to replicate that here.

Infinite loop shenanigans

Infinite loops can be quite tricky to analyze since it is somewhat vague what it means to exit such a loop. Typically, it would be a return inside the loop. There are various horrible scenarios like these.

The transform passes

The structurizer is organized as a series of transform passes more or less. Once we have computed the CFG, we can perform a pass, at which point we need to recompute the CFG before the next pass. When we rewrite the CFG in any way, we technically have to recompute the CFG, but this is quite troublesome, since this kind of code would break:

for (auto &node : post_visit_order)
{
  if (something(node))
  {
    modify_cfg();
    recompute_cfg(); // Would affect post_visit_order
  }
}

Recomputing the CFG every time a branch changes would just be too slow, so the implementation tries to avoid recomputing the CFG until we have traversed through the entire CFG. To make this work, we have to recompute what we need in-place while modifying the CFG. For example, we have to patch in which blocks are immediate dominators and post-dominators. Reachability information can be inherited from other blocks by copying the post-visit order index. This can indeed lead to many subtle bugs, but I don’t know what a better solution would look like. It seems to work fine for simple scenarios where we don’t significantly rewrite the control flow and dominance and post-dominance relationships aren’t affected globally.

Some algorithms have extremely awkward transforms that must be implemented as a loop that terminates ones there are no special cases left to handle, but these scenarios are also extremely rare, e.g.:

do
{
  did_work = false;
  for (auto &node : post_visit_order)
  {
    if (need_complex_weird_thing(node))
    {
      rewrite_branches_in_a_horrible_way();
      recompute_cfg();
      did_work = true;
      break;
    }
  }
} while (did_work);

As you’ll quickly figure out reading this, this is really just a series of jank heuristics layered on top of each other. There is no strong unifying algorithm to speak of here, but the advantage is that we slowly and surely chip away at the insanity LLVM emits into something that ends up looking acceptable. If we throw generic algorithms at it, I think the result will end up looking nothing like the input whatsoever and collapse into noise.

Stage 1 – Cleanup breaking PHI constructs

void CFGStructurizer::cleanup_breaking_phi_constructs()

As the very first thing we do, we aim to remove merge blocks that should not be considered to be a merge as these blocks will create problems for us later. Let’s consider this shader:

Block 3 here is problematic. Both 2 and 4 break, but it is illogical for there to be a block 3 here that merges execution. There is no way to express this merge in high level code without rewriting everything, so what we want to do instead is to split up the node.

The heuristic here is fairly simple. We’re looking for a block that has no operations associated with them, only PHI nodes. It must have exactly one successor. We also avoid removing blocks that are good loop merge candidates. If a back edge branches to the candidate block, we should not remove it, since it is a clear loop merge candidate. With these checks in mind, we can rewrite the PHI in 3 where the inputs from 2 and 4 are added directly to 5. The merge is now gone and we don’t have to consider it for later.

Stage 2 – Create continue block ladders

void CFGStructurizer::create_continue_block_ladders()

It is often the case that continue block (i.e. back edges) have multiple predecessors. Making the continue block a merge target does not seem to be valid, so we need to create a dummy predecessor instead.

Using this shader as an example:

6 is the back-edge and the continue block (they both mean the same thing really). The 4 to 6 branch is a typical continue statement. We’d also like to merge execution in 6 since it has two preds, but to do this we’ll need a helper pred block of 6, as otherwise spirv-val complains. I’m not 100% sure if it’s illegal to merge to a continue construct, but it is very awkward for sure. Two quotes from SPIR-V 2.11.2 jump out that might explain this:

  • a loop construct: the blocks structurally dominated by a loop header, excluding both the loop header’s continue construct and the blocks structurally dominated by the loop header’s merge block
  • if a structured control-flow construct S contains the header block for a selection, loop or switch construct different from S, then S must also contain that construct’s merge block

Since the continue block is excluded from the loop construct, the inner selection construct cannot merge to it, since all merge targets must be contained within the enclosing loop construct.

In any case, the final shader ends up looking like:

As we can see here, 6.pred takes on the role of the merge block before we continue. 3.succ takes on the role of a loop block where continue is translated to inner break.

(Circles are loop headers, triangles are selection headers, dashed lines are merge blocks and dotted lines are continue blocks.)

There are also scenarios where we don’t do this. We abandon the idea of having merged execution in the continue construct, since we really need to use continue as a way to break out of weird control flow.

bool CFGStructurizer::continue_block_can_merge(CFGNode *node) const

The heuristics here are … questionable at best. The comments say it all:

// This algorithm is very arbitrary and should be seen as a nasty heuristic which solves real shaders
// we see in the wild. It's probably safe to block continue merge in far more cases than this, but we
// want to be maximally convergent as often as we can.

Here we see a real-world shader, where 17 is a continue block we probably shouldn’t consider to be a merge block as well. 24 is deduced to be a far more reasonable candidate for merging and we’d have to emit extremely bizarre code to make the merge-to-continue style work. Note that 24 can break out of the loop (27) as well as continue to 17. If we enclose 17 in a loop block, we would not be allowed to break anymore without some horrible rewrites due to the double break, and you probably see why the heuristic is in place. Generally, it’s very hard to say if a continue block should be maximally convergent or not (it’s not in SPIR-V as-is). We’re okay with hand-wavy heuristics here I think.

Final output:

Stage 3 – De-interleave unrolled loop breaks

This is a recent addition and I’ll defer explaining it for later due to its high complexity. I don’t want to cause brain damage just yet 🙂

Stage 4 – Recover nested selection constructs

void CFGStructurizer::split_merge_scopes()

This is where things start to get kinda difficult, but I’ll stick to simple examples. The gist of this pass is to study all blocks which are merge block candidates, and introduce intermediate merge blocks for nested selection constructs. Consider a basic nested if-else pattern:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  [branch]
  if (thr & 4)
  {
    [branch]
    if (thr & 6)
      RW[0] = 1;
    else
      RW[0] = 2;
  }
  else
  {
    [branch]
    if (thr & 12)
      RW[0] = 3;
    else
      RW[0] = 4;
  }
  RW[1] = 40;
}

void CFGStructurizer::rewrite_selection_breaks(
    CFGNode *header, CFGNode *ladder_to)

We could just give up and treat .entry as a loop header, where all branches to 7 are treated as break, but that won’t do. This is where we study any potential merge block. In this case, we find block 7. We identify .entry as its immediate dominator. From here we carve out a subset of the CFG that is dominated by .entry and can reach 7.

template <typename Op>
void CFGNode::traverse_dominated_blocks(const Op &op) const

This utility is used quite a bit, and we’re interested in finding selection headers that have a post dominance frontier of .entry. There are some other checks to avoid wild edge cases, but see code for details.

if (node->succ.size() >= 2 && !branch_is_loop_or_switch)
{
   auto *outer_header =
     get_post_dominance_frontier_with_cfg_subset_that_reaches(
       node, ladder_to, nullptr);

   if (outer_header == header)
      construct.insert(node);
}

We’re only interested in the next scope level here. The candidates we find are 1 and 2. For 1, we create a ladder block, and rewrite all branches to 7 which are dominated by 1, similar for 2.

// Stop rewriting once we hit a merge block.
traverse_dominated_blocks_and_rewrite_branch(inner_block,
    ladder_to, ladder, [inner_block](CFGNode *node) -> bool {
       return inner_block->selection_merge_block != node;
    });

This looks like:

There are some edge cases here. For example, if our input CFG was this one, we would end up creating redundant ladder blocks. To avoid this, we stop the traversal early if the selections in 1 and 2 end up merging their scopes. Ahead of time, we have done some basic analysis to mark selection merge blocks candidates which is used for this purpose.

// Do we keep going?
return inner_block->selection_merge_block != node;

This analysis then recurses into the inner selection constructs to resolve multiple nested scopes.

In hindsight, this approach of recursive traversal is not efficient at all, but there are also a lot of particular edge cases to consider where we skip rewriting branches. The details are a bit too esoteric for this kind of post, so again I’ll refer to the code instead.

Another common case is if-else ladders:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
   uint val;
   [branch]
   if (thr & 4)
      val = 1;
   else if (thr & 3)
      val = 2;
   else if (thr & 8)
      val = 5;
   else if (thr & 16)
      val = 21;
   else if (thr & 32)
      val = 20;
   else
      val = 50;
   RW[0] = val;
   RW[1] = 40;
}

Both of these scenarios are perfectly recovered, which is nice. When possible, we always want to avoid adding loop blocks, for our own sanity.

Stage 5 – Remove degenerate blocks

void CFGStructurizer::eliminate_degenerate_blocks()

This stage isn’t very interesting, but when adding ladders in stage 4, it’s very easy to end up with ladders that introduce merges that we did not intend to exist. Very similar ideas to stage 1. We might also introduce blocks which don’t do anything useful, i.e. they might have one predecessor and one successor with no opcodes and no PHI nodes. For sanity’s sake, we nuke those blocks as well.

if (noop && single_pred && single_succ)
   easy_choice();
else if (noop && merge_candidate_is_on_breaking_path(node))
   danger_zone();

A random example from a real world shader is 10.16.ladder.19.ladder.21.ladder which was introduced by stage 4. This node is on a breaking path, and we really should not merge execution here. There’s also a bunch of useless single pred/succ blocks we can nuke as well.

We have to be careful though. Not all no-op blocks are redundant, and we can only remove degenerate merge blocks if they are on a breaking path. I’ll cover it in more detail in the intermission.

Stage 6 – Duplicate impossible merge constructs

void CFGStructurizer::duplicate_impossible_merge_constructs()

Breaking constructs are a blight on my soul, and we’re still not done dealing with this. We’re now entering the kind of edge case where CFG attempts to merge execution in breaks, but this block contains actual code … what to do!?

Here’s another random example I found:

228 is a loop header, and 319 attempts to break out with a double break (._.). As discussed already, merging execution on a breaking path is questionable at best, and it is impossible to express this without completely rewriting the CFG. This time, there is actual code in 319, so we have to be extremely careful.

To solve this, we bring out the big guns and just duplicate the block. Fortunately, I have not run into a scenario where this kind of block is control dependent …

If the words here did not cause concern, the code comments probably do:

// WARNING: This check is EXTREMELY sensitive and microscopic changes to the implementation
// will dramatically affect codegen.
bool breaking = merge_candidate_is_on_breaking_path(node);

if (breaking && !node->ir.operations.empty() &&
    !block_is_control_dependent(node))
   duplicate_queue.push_back(node);

Intermission – what is a breaking path anyways?

So far, we have done a ton of work to consider breaking paths, trying to classify code into a “non-breaking” path where control flow can merge with maximal convergence, and breaking paths where we can avoid merging until we hit the outer merge block. As hinted to earlier, it can get very tricky to identify what a breaking path even is. For loops, this is somewhat easy, but for selection constructs, it gets pretty complicated.

bool CFGStructurizer::control_flow_is_escaping(
    const CFGNode *node, const CFGNode *merge) const

This function is load bearing for a lot of the code in the structurizer, so I should probably try to explain the rationale behind it. The merge node in this context is a post-dominator of node and the construct we’re analyzing. Node does not dominate merge.

Loops

If we recall what an unstructured loop is, the loop body is where we are dominated by a loop header, and we can still reach the back edge. The moment a branch enters a path of no return, we can consider it a break. The rules I came up with are:

  • Node has a loop header which dominates it (duh)
  • Node cannot reach the back-edge (otherwise it would be in the loop)
  • Back-edge cannot reach node (we have not merged the loop yet)
  • All post-domination frontiers of node can reach back-edge (a breaking construct can have control flow inside of it, and we only care about the break itself)

Using these rules is fairly straight forward, but it gets trickier!

Selections

In a selection, there is no guiding star like the loop header and back edge to analyze. Rather, we need heuristics to deduce if a branch looks and smells like a break. This is the kind of cruft that accumulates over time as games ship more and more difficult shaders …

Load bearing blocks – are we important?

If we’re a break block with two or more predecessors, we can consider that CFG reachability does not change in meaningful ways if we remove the block. In particular, we look at the immediate dominator and the merge block. We then perform a query: given reachability chain A -> B -> C, does removing B change reachability for A -> C? If we’re load bearing like this, it’s not a break.

bool CFGStructurizer::exists_path_in_cfg_without_intermediate_node(
   const CFGNode *start_block,
   const CFGNode *end_block,
   const CFGNode *stop_block) const

Analyze domination frontiers

Next heuristic is to see if node can reach the merge target without encountering any unrelated domination frontiers along the way. If there are domination frontiers other than merge, it probably means we’re still on a structured path and we end the heuristic.

If we get to this point, we need further checks to be sure we make the right decision. Heuristics are so much fun!

Did we just branch to a continue block?

If we didn’t emit a special block in stage 2, branching to the continue block is definitely a breaking construct. Accept immediately.

Post-dominance shenanigans

Here we adopt the same idea as loop analysis. If a post-dominance frontier can reach a domination frontier before reaching merge block, that is a very strong indication we’re on a breaking path. For example, if we consider this CFG:

Node 6 here attempts to break out to 9.pred, but 8 does not. This is obvious to human inspection, but why is that the case here? The key lies in post-domination frontiers. For 6, we have 4 as a PDF. This node has a domination frontier 8, so 4 is in a structured flow, thus we have shown that we went from a normal path (4) to a breaking path (6). For 8, 4 is in the post-domination frontier as well, but 4 does not dominate 8, so we don’t consider it. To understand what 8 is, we need to analyze further …

Do we post-dominate any useful work?

This is an esoteric check. This is mostly to catch more scenarios in stage 5, since accidental ladder merge blocks can get really weird. Essentially, if we don’t post-dominate anything useful, it’s probably a break construct.

Tie-break patterns

If we still don’t know what is going on (node 8), we attempt a final check. We back-trace from 9.pred and try to see if we can back-trace to a node that is not our candidate but can end up reaching 8. In this case, we will back-trace through 6, and find 4. 4 can reach 8. In some sense, we’re just trying to query if our neighbor predecessors of 9.pred are break blocks instead.

Stage 3 – De-interleave unrolled loop breaks

Time for more gnarly code that will cause mild brain damage (sorry!). Dealing with breaks isn’t bad enough, what if I told you it was possible to have 2 or more different break merge scopes? This is a situation where nothing will help except a complete rewrite of the CFG. Here’s an example:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint id : SV_DispatchThreadID)
{
   uint v;
   uint w = 1;
   uint dummy = 0;

   [unroll]
   for (int i = 0; i < 4; i++)
   {
      InterlockedAdd(RW[0], w, v); w = v;

      [branch]
      if (w & 13)
      {
         InterlockedAdd(RW[0], w, v); w = v;
         dummy = 1;
         break;
      }

      [branch]
      if (w & 1)
      {
         [branch]
         if (w & 4)
            InterlockedOr(RW[0], w, v); w = v;
         dummy = 2;
         break;
      }

      [branch]
      if (w & 2)
      {
         InterlockedOr(RW[0], w, v); w = v;
         dummy = 3;
         break;
      }
   }

   InterlockedAdd(RW[0], w, v); w = v;
   InterlockedAdd(RW[0], dummy, v);
}

Spaghetti, yum! Fun fact, if I were to unroll this to 32 or more loop iterations, graphviz will take minutes to complete and without a proper fix dxil-spirv will explode, creating endless ladder blocks. It is truly the Stairway to Heaven of CFGs! I ran into this with a game that used 128 iterations instead, and the shader never completed. Of course, this was an SSR shader, and I cannot fathom why anyone would want to unroll a massive loop with several breaks 128 times, it’s pure and utter chaos, but here we are …

5 is the merge block of our “loop”. 2, 4 and 7 are all break blocks, and the predecessors interleave for every iteration. This is simply impossible to figure out with structured control flow as-is, so it’s time to take out the hammer, we simply have to rewrite this CFG into something extremely ugly instead.

bool CFGStructurizer::serialize_interleaved_merge_scopes()

We start similarly as stage 4 where we find selection merge candidates, identify the immediate dominator, and analyze the inner scopes. What we are looking for is two or more merge candidates that cannot reach each other with a post-domination frontier that partially overlaps. For example, 6 and 11 are PDFs of 2 and 10 is a PDF of 7. This is a bizarre case where we’re breaking in different directions all the time. In this case, we simply have to transpose the entire shader into this.

5.pred is a dispatch switch block. It creates a PHI variable where the predecessors signal which break block to dispatch to.

Technically, we could consider doing something similar in stage 6, where any merging break constructs are transposed outside the loop, but this transposition obliterates any hope of emitting readable code, so we reserve this only as a last resort.

Stage 7 – Rewrite transposed loops

More brain damage incoming, but just like stage 3, I’ll defer this until the bonus section since it’s ridiculous nonsense 🙂

Structurization passes

void CFGStructurizer::structurize(unsigned pass)

In all the passes above, we mostly focused on rewriting the CFG, tweaking it, etc, but now we’re mostly focused on emitting structure information. This is done twice. In pass 0, we focus on analysis, rewrite the CFG to resolve any issues, and in pass 1, we only mark structurization information. If we are forced to modify the CFG at that point, we have a bug in the implementation.

The natural merge and actual merge

Before digging into conditional branches, switches and loops, we have to elaborate on the system used in dxil-spirv to deal with multi-break.

Switches and loops are easy to identify, but in terms of structurization there is a difficult requirement that any merge block we decide must be dominated by the header. For loops, the merge block must also post-dominate the header, discounting any block that terminate the CFG like returns and DXR intersection termination. Switch blocks are a little more relaxed since it’s legal in SPIR-V to break from inside a switch block to the outer loop merge block.

For a more concrete example of what I’m blabbing on about:

3 is our loop header. 4 is clearly a breaking construct that breaks to the exit block, 5 is the back-edge and 6 is the “natural” loop merge if all breaks were removed. As expected, we cannot just say that 6 is the merge block of 3 and call it a day, since the branch from 4 to 1 is simply illegal in structured control flow.

For every loop, we essentially start considering two kinds of merges.

  • The natural merge (CFGNode::loop_ladder_block), where the loop should merge.
  • The actual merge (CFGNode::loop_merge_block), equal to the common post dominator of all loop exits.

If these two don’t match, we’ve got trouble. The idea is that in pass 0, we find these issues, fix them up, and if we do this analysis again in pass 1, these nodes must match.

Switch blocks have similar considerations where we find the merge point through common post domination analysis, and a natural merge, where it makes sense to merge.

auto *merge = find_common_post_dominator(node->succ);
auto *natural_merge = find_natural_switch_merge_block(node, merge);

For example, here is a monster switch where the natural merge seems to be 86, but 24 is the post-dominator. This is a funny case with multi-way case fall-through, which is not legal in structured control flow.

I’m sorry about these examples being a bit random and unfocused, but I find these by adding breakpoints where needed and run the entire game repro suite over it until some shader triggers it. The lucky winner is publicly shamed here 🙂 It’s very hard to come up with HLSL that will trigger these exact scenarios. AAA shaders are unhinged enough to trip all these cases.

Marking header and merge block relationships

The actual merge location needs to be recorded so that we can fix things up if there is a problem. The basic idea is that the header records which kind of merge it is (loop or selection), and points to the block in question. In the first pass, the header does not necessarily need to dominate the merge block.

The merge block has a list of headers associated with it. If that number is greater than two, we have a problem, which needs to be corrected later. With any kind of breaking construct, it is inevitable that we will see a merge block with two or more headers associated with it.

In this example, 1 will be associated with both 3 and .entry as relevant headers.

The passes

void CFGStructurizer::structurize(unsigned pass)
{
   if (find_switch_blocks(pass))
   {
      // Can rewrite the CFG in slightly weird ways.
      recompute_cfg();
      if (find_switch_blocks(pass))
      {
         LOGE("Fatal, detected infinite loop.\n");
         abort();
      }
   }

   find_loops();
   find_selection_merges(pass);
   fixup_broken_selection_merges(pass);
   if (pass == 0)
      split_merge_blocks();
}
Switch blocks

Switch blocks can get extremely tricky since they support fall-through, and fall-throughs look deceptively like a merge. The most interesting bit is analyzing case fall-through.

CFGNode *CFGStructurizer::find_natural_switch_merge_block(
    CFGNode *node, CFGNode *post_dominator) const
  • Sort all case labels (and default) by post visit index order.
  • If a case label can reach another case label, it must be sorted right before. (SPIR-V requirement)
  • If a conflict arises where two case labels can reach another case label, consider that the natural merge, since it’s the only thing that makes sense.
  • If a dominance frontier of a case label can be reached by another case label, and that dominance frontier is not another case label, we need to consider it a natural merge. In the massive example above, 86 is such a dominance frontier. 86 is the dominance frontier of 27, and 86 is also reachable by 26, uh oh!

Certainly, this cannot be robust against all possible insane code you can emit with unstructured switch statements if there are multiple such “natural merge” candidates, but I’ll have to tackle that when I see it. I’d go bald if I had to anticipate every possible scenario like that.

In this particular case, we can take advantage of the fact that it is valid to break out of a loop from inside a switch construct in SPIR-V. This is not allowed in high level shading languages, but SPIRV-Cross works around it.

Selections
void CFGStructurizer::find_selection_merges(unsigned pass)

This is pretty simple. Start at any block with more than 1 predecessor, find the idom and mark the selection. There can be conflicts however, if the idom is already marked as a loop or another selection, we need to do something.

In this shader, 9.pred and 8 are both merge candidates where 3 is the idom. Since 3 is already a loop header, we must create an inner block, 3.succ, which can serve that purpose. 8 is then analyzed, which finds 3.succ to be its idom. Since it already has a selection merge associated with it, we create yet another succ helper, 3.succ.succ. 3.succ is promoted to a loop block header. To remember that 3.succ is indeed a loop header despite having no back-edge, we use the freeze_structured_analysis flag.

In hindsight, it might have been possible to use a single block switch block for 3.succ here instead.

Fixing up broken selection merges
void CFGStructurizer::fixup_broken_selection_merges(unsigned pass)

While we have accounted for all merge blocks, not all selection headers have been marked in any way. This typically happens for break constructs, since these blocks don’t dominate much code at all. In the example above, 4 is a great example. We’re mostly interested in two things here:

  • Figure out if one branch is a “break” and the other path is not. We are forced to merge somewhere, and we must merge in the direction away from the break.
  • Find common post-dominator of all successors so that we can resolve weird breaks properly later. In this case, 4 would be marked as a header for 9.pred.

Unfortunately, there are half a million edge cases here as well, but I’ll omit them here since it’s just endless tie-break rules to figure out which branch (or neither) gets to be the lucky merge.

Loops

Loops are quite horrible since code quite often breaks out of multiple loops at once. A case study that covers a fair amount of interesting scenarios would be:

uint func(uint3 dispatch)
{
   [loop]
   for (int i = 0; i < 10; i++)
   {
      [loop]
      for (int j = 0; j < 20; j++)
      {
         [branch]
         if (dispatch.y < 10)
            return 50;

         [branch]
         if (dispatch.z < 10)
            return 70;
         dispatch.y++;
      }

      dispatch.x++;
   }

   return 80;
}

RWStructuredBuffer<uint> buf : register(u0);

[numthreads(1, 1, 1)]
void main(uint3 dispatch : SV_DispatchThreadID)
{
   buf[dispatch.x] = func(dispatch);
}

To begin, we find all blocks with a back edge. These must be loop headers. The meat of the analysis happens in

auto result = analyze_loop(node);

where we study all exits from the loop body, and deduce what kind of exit it is. As discussed earlier, a loop exit happens when we can no longer reach the back-edge.

We’re mostly interested in the question if an exit is a break or not. We aim to find one block that we consider the natural merge (CFGNode::loop_ladder_block), and the post-dominator of all exits (CFGNode::loop_merge_block). If the back-edge can exit the loop, that exit must be our natural merge, otherwise, we should choose a block that the loop header dominates. This can happen if the loop only exits from the header, or there is an infinite loop that only exits by breaking for example.

In this particular example:

  • Loop header 2, loop_ladder_block = 7, loop_merge_block = 5
  • Loop header 1, loop_ladder_block = 8, loop_merge_block = 5
Resolving merge conflicts
void CFGStructurizer::split_merge_blocks()

In the case where more than one header is associated with a merge block, we have to do some horrible things. 5 is the lucky winner this time and we get to show off triple breaks. The headers are {1, 2, 4}. The basic idea is to work inside out so we start at block 4. Looking at our loop scopes, we know that we can only break so far. Scanning through the header list from right to left, we find 2.

// Start from innermost scope,
// and rewrite all escape branches to a merge block
// which is dominated by the loop header in question.
// The merge block for the loop must have
// a ladder block before the old merge block.
// This ladder block will break to outer scope,
// or keep executing the old merge block.
for (size_t i = node->headers.size() - 1; i; i--)
{
   auto *current_node = node->headers[i];

   // Find innermost loop header scope we can
   // break to when resolving ladders.
   CFGNode *target_header =
     get_target_break_block_for_inner_header(node, i);
   ...
}

The end result ends up looking quite weird as seen below. I should mention that spirv-opt needs to do similar things when inlining functions, but of course, inlining already structured code is quite a bit simpler.

This is where we make use of the distinction of loop_ladder_block versus loop_merge_block. The basic idea is that we can create a pred helper of loop_ladder_block which then dispatches either another break or continues the “normal” path of branching to loop_ladder_block. If the outermost header is not a loop itself, we have to force one. 1.pred here is a helper block that allows the final break. The branch from 4 to 5 (via 3) therefore punches through the intermediate blocks. 7.pred.pred and 8.pred create PHI variables which signal if the break is a multi-break or not.

Mark branches that are expected

// Before we start splitting and rewriting branches,
// we need to know which preds are considered "normal",
// and which branches are considered ladder breaking branches
// (rewritten branches).
// This will influence if a pred block gets false or true
// when emitting ladder breaking blocks later.
Vector<UnorderedSet<const CFGNode *>> normal_preds(node->headers.size());
for (size_t i = 0; i < node->headers.size(); i++)
   if (node->headers[i]->loop_ladder_block)
      for (auto *pred : node->headers[i]->loop_ladder_block->pred)
         normal_preds[i].insert(pred);

Conditionally break

auto *ladder = create_helper_pred_block(loop_ladder);
// Merge to ladder instead.
traverse_dominated_blocks_and_rewrite_branch(header, node, ladder);
ladder->ir.terminator.type = Terminator::Type::Condition;
ladder->ir.terminator.conditional_id = module.allocate_id();
ladder->ir.terminator.false_block = loop_ladder;
// True target is the innermost loop construct.

PHI magic for predecessors after rewriting branches

for (auto *pred : ladder->pred)
{
   IncomingValue incoming = {};
   incoming.block = pred;
   bool is_breaking_pred = normal_preds.count(pred) == 0;
   incoming.id = module.get_builder().makeBoolConstant(is_breaking_pred);
   phi.incoming.push_back(incoming);
}
ladder->ir.phi.push_back(std::move(phi));

This is trivial for compilers to optimize away if they don’t need this pedantic level of structured control flow. The conditional branch in these blocks is a boolean PHI which takes constant inputs based on the predecessors. Thus, we can statically deduce that 4 -> 5 is a direct branch.

Of course, this is a well-behaved example without any serious edge cases, but the basic idea remains the same.

The idea of a ladder block like 7.pred.pred and 8.pred that dispatches shows up several places.

Structurization summary

This mostly concludes the structurization passes. It’s not exhaustive by any means, but the broad strokes should be covered, and I think I somehow managed to explain the overall architecture.

I don’t expect anyone to grasp all of this by just reading this. I can barely understand what is going on myself, and it’s not like any of this is backed by best practices. This approach just happened to work and has been refined one horrible AAA shader at a time.

Rewriting PHI nodes

After rewriting the CFG, PHI nodes will break since the original predecessors do not match reality anymore. Fortunately, this part of the algorithm isn’t as esoteric and has well known algorithms.

void CFGStructurizer::insert_phi()

Fixup broken value dominance

void CFGStructurizer::fixup_broken_value_dominance()

This isn’t related to PHI, but it is related to SSA. SSA values created in one block can be consumed in another as long as the creating block dominates the consumer. If that’s no longer the case, we have to do something. In my case, I just lowered it directly to an OpVariable with explicit OpStore / OpLoad instead. It’s quite rare for this to happen in practice.

This might be a hint that keeping SSA throughout was a mistake. (._.)

Reinserting PHI values

void CFGStructurizer::insert_phi(PHINode &node)

Given that we have N SSA values created in N blocks, we’ll have to insert new PHI nodes such that every predecessor of the original PHI node gets a proper value. Any mem2reg algorithm does this, but I’ll try to summarize the basic ideas. There’s also one special edge case that mem2reg would never consider.

End state

The algorithm is iterative, and we’re done when every predecessor of the target node can find an input value created by a node that dominates the predecessor, at which point we can emit the OpPhi. The algorithm will add more input values until we satisfy this condition.

Finding the next frontier

For all our current input nodes, look at their dominance frontiers. We pick the dominance frontier that’s earliest in the CFG (highest post visit index) and can reach the final node. Reachability is tricky when the PHI node is in a loop header. We have to consider the back-edge as well since it’s also an input.

A frontier is only considered if it makes forward progress, i.e. we ask the question if having a frontier here is useful at all.

bool CFGStructurizer::phi_frontier_makes_forward_progress(
   const PHI &phi, const CFGNode *frontier,
   const CFGNode *end_node) const

For example, if the frontier cannot reach any existing input, it makes forward progress. Also, if a frontier is such that any path from input to final node must go through the frontier, it also makes forward progress, because we can remove that input from consideration in the next iteration.

There’s probably a more elegant solution, but I couldn’t be arsed to study the algorithm in detail.

Placing a new input value at frontier

Look at all preds of the frontier and find a candidate input value. If there is no candidate (quite possible and expected in some cases), use OpUndef.

OpSelect workaround

Due how we rewrite branches through ladders, it’s possible that code which did not reach each other now do. Thus we can have a situation where a frontier dominates an existing input and the only way to resolve this is through OpSelect. Either we keep the existing input or use the PHI’d value from predecessors depending on the intent of the predecessor.

When testing this, I realized that this might be a bogus case that actually never happens, because I was not able to trigger it in any shader in the repro suite … Might as well just remove the dead code and see what happens 😀

Bonus round – brain damaging edge cases

I teased loop transposition as a thing, so I’ll try to explain the problem. I don’t think I can properly explain the solution though.

All of these cases are exceptionally rare. I’ve only seen a single example of each case across the entire repro shader corpus.

Outer loop transposition

void CFGStructurizer::rewrite_transposed_loop_outer(
   CFGNode *node, CFGNode *impossible_merge_target,                                                 
   const LoopMergeAnalysis &analysis)

This code is truly cursed, because it pokes a hole in the model of unstructured loops and structured loops. The loop in question is 6. The loop ladder block is 16 since the back-edge (14) branches there. However, we observe a bizarre merge to 12. Normally, we would be able to consider 12 to be a break target, but 12 does not post-dominate anything, so we’ll never deal with this in split_merge_blocks(). 5 is the post-dominator of 6‘s loop exits. We also cannot just split up 12 and treat it as a break block, because:

  • 16 can reach 12, clearly it’s outside the loop
  • It has a conditional branch on its own, duplicating the block brings its own set of problems

We’re in a situation where 12 is technically inside the loop in structured control flow (we never merged), but outside the loop in unstructured control flow. On top of this, 16 can reach 12, so in a structured sense, we’re branching back into our own loop … So much fun!

This is a similar approach as split_merge_blocks(). Create a ladder block, rewrite all branches there, and dispatch based on the predecessor. Leads to hilariously unreadable code, but it’s our only option here.

The final output also needs to deal with multi-break, and the resulting CFG collapses into white noise.

Inner loop transposition

void CFGStructurizer::rewrite_transposed_loop_inner(
   CFGNode *node, CFGNode *impossible_merge_target,
   const LoopMergeAnalysis &analysis)

This is very similar to outer loop transposition, but it is somehow more cursed.

This CFG is incomprehensible, and I honestly have no idea how this can possibly be generated from HLSL, but here we are. 113 and 121 are inner nested loops. The odd scenario here is that the normal loop exit of 120 actually breaks further out than what we would expect. 129 is a break block that breaks out of the two loops, but it doesn’t break as far as 120 does, since it ends up in 89. This is extremely weird. I solved this similarly to the outer loop scenario. Collect all branches in a ladder and dispatch …

Ladders, ladders everywhere! This CFG is pure non-sense, I’m just including it as an example of how cursed some shaders turn out to be.

Merge to outer loop continue

This is similar to inner loop transposition where we have to be very careful. The loop in 16 can exit in 22, and this confuses the algorithm, since we have to find the common post-dominator of all loop exits. The back-edge 25 does not have any forward successors so it confuses the analysis, but we specifically detect 22 and transpose the loop so that we can merge execution, before breaking out of the outer loop.

I omitted the final CFG here since it looks like a complete mess. A spider web of code if you will (ahem) :V

Conclusion

Joke’s on me, because with my luck, a new game probably ships that breaks everything, but for now, this ends my brain dump of the truly cursed corner of vkd3d-proton which is the DXIL -> SPIR-V conversion.

At some point, all of this probably has to be rewritten and I’ll have to abandon any hope of having somewhat debuggable codegen, but for now, this is my questionable take on structurization.