Conquering FidelityFX FSR4 – Enabling the pretty pixels on Linux through maniacal persistence

As AMD’s RDNA4 GPUs released, FSR 4 was released alongside it to much fanfare. It’s a massive leap in quality over FSR 3.1, and with FSR 4 moving to a machine learning model instead of an analytical model, FSR 3.1 marks the farewell to fully open and grok-able upscalers. It’s truly a shame, but the graphics world cares little about such sentimentality when pretty pixels are at stake.

FSR 4 is not actually released yet to developers for integration in engines, but exists as a driver toggle that can be enabled for certain FSR 3.1 games. The FSR 3.1 implementation had the foresight to let the implementation peek into the driver DLLs and fish out a replacement implementation. This is of course a horrible mess from any compatibility perspective, but we just have to make it work.

From day 1 of RDNA4’s release, there’s been a lot of pressure from the Linux gaming community to get this working on Linux as well. Buying an RDNA4 GPU would be far less enticing if there’s no way to get FSR 4 working after all …

First battle – how is FSR 4 even invoked?

Given the (currently) proprietary nature of FSR 4, we could easily have been in the situation of DLSS where it’s literally impossible to re-implement it. DLSS uses interop with CUDA for example and there’s 0% chance anyone outside of NVIDIA can deal with that. Fortunately NVIDIA provides the shims required to make DLSS work on Proton these days, but we’re on our own at this time for FSR 4.

It all started with an issue made on vkd3d-proton’s tracker asking for FSR 4 support.

Somehow, with the OptiScaler project, they had got to the point where vkd3d-proton failed to compile compute shaders. This was very encouraging, because it means that FSR 4 goes through D3D12 APIs somehow. Of course, D3D12’s way of vendor extensions is the most disgusting thing ever devised, but we’ll get to that later …

The flow of things seems to be:

FSR 3.1 DLL tries to open the AMD d3d12 driver DLL (amdxc64.dll) and queries for a COM interface.
Presumably, after checking that FSR 4 is enabled in control panel and checking if the .exe is allowed to promote, it loads amdxcffx64.dll, which contains the actual implementation.
amdxcffx64.dll creates normal D3D12 compute shaders against the supplied ID3D12Device (phew!), with a metric ton of undocumented AGS magic opcodes. Until games start shipping their own FSR 4 implementation, I’d expect that users need to copy over that DLL from an AMD driver install somehow, but that’s outside the scope of my work.
amdxcffx64.dll also seems to call back into the driver DLL to ask for which configuration to use. Someone else managed to patch this check out and eventually figure out how to implement this part in a custom driver shim.

With undocumented opcodes, I really didn’t feel like trying anything. Attempting to reverse something like that is just too much work. There was high risk of spending weeks on something that didn’t work out, but someone found a very handy file checked into the open source LLPC repos from AMD. This file is far from complete, but it’s a solid start.

At this point it was clear that FSR 4 is based on the WMMA (wave matrix multiply accumulate) instructions found on RDNA3 and 4 GPUs. This is encouraging since we have VK_KHR_cooperative_matrix in Vulkan which maps directly to WMMA. D3D12 was supposed to get this feature in SM 6.8, but it was dropped for some inexplicable reason. It’s unknown if FSR 4 could have used standard WMMA opcodes in DXIL if we went down that timeline. Certainly would have saved me a lot of pain and suffering …

Dumping shaders

Next step was being able to capture the shaders in question. Ratchet & Clank – Rift Apart was on the list of supported games from AMD and was the smallest install I had readily available, so I fired it up with RGP on Windows and managed to observe WMMA opcodes. Encouraging!

Next step was dumping the actual DXIL shaders, however, it seems like the driver blocks FSR 4 if RenderDoc is attached (or some other unknown weirdness occurs), so a different strategy was necessary.

First attempt was to take a FSR 3.1 application without anti-tampering and hook that somehow. The FSR 3.1 demo app in the SDK was suitable for this task. The driver refused to use FSR 4 here, but when I renamed the demo .exe to RiftApart.exe it worked. quack.exe lives on!

Now that I had DXIL and some example ISA to go along with it, it was looking possible to slowly piece together how it all works.

Deciphering AGS opcodes

Shader extensions in D3D is a disgusting mess and this is roughly how it works.

Declare a magic UAV at some magic register + space combo
Emit a ton of oddly encoded atomic compare exchanges to that UAV
- DXC has no idea what any of this means, but it must emit the DXIL as is
- Compare exchange is likely chosen because it’s never okay to reorder, merge or eliminate those operations
Driver compiler recognizes these magic back-doors, consumes the magic stream of atomic compare exchanges, and translates that into something that makes sense

dxil-spirv in vkd3d-proton already had some code to deal with this as Mortal Kombat 11 has some AGS shaders for 64-bit atomics (DXIL, but before SM 6.6).

RWByteAddressBuffer MAGIC : register(u0, space2147420894);

uint Code(uint opcode, uint opcodePhase, uint immediateData) 
{ 
   return (MagicCode << MagicCodeShift) | 
          ((immediateData & DataMask) << DataShift) | 
          ((opcodePhase & OpcodePhaseMask) << OpcodePhaseShift) | 
          ((opcode & OpcodeMask) << OpcodeShift); 
} 

uint AGSMagic(uint code, uint arg0, uint arg1) 
{ 
       uint ret; 
       MAGIC.InterlockedCompareExchange(code, arg0, arg1, ret); 
       return ret; 
} 

uint AGSMagic(uint opcode, uint phase, uint imm, uint arg0, uint arg1) 
{ 
       return AGSMagic(Code(opcode, phase, imm), arg0, arg1); 
}

Every WMMA opcode is translated to a ton of these magic instructions back-to-back. A maximum of 21 (!) in fact. dxil-spirv has to pattern match more or less.

The exact details of how WMMA is represented with these opcodes isn’t super exciting, but for testing this functionality, I implemented a header.

It seems like a wave matrix is represented with 8 uints. This tracks, as for 16×16 matrix and wave32 + FP32, you need 256 bits per lane in the worst case.

struct WMMA_Matrix 
{ 
       uint v[8]; 
};

Here’s how WMMA matmul can be represented for example:

WMMA_Matrix WMMA_MatMulAcc(WaveMatrixOpcode op, 
 WMMA_Matrix A, WMMA_Matrix B, WMMA_Matrix C) 
{ 
 // A matrix 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 0, WaveMatrixRegType_A_TempReg), 
   A.v[0], A.v[1]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 0, WaveMatrixRegType_A_TempReg), 
   A.v[2], A.v[3]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 1, WaveMatrixRegType_A_TempReg), 
   A.v[4], A.v[5]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 1, WaveMatrixRegType_A_TempReg), 
   A.v[6], A.v[7]); 

 // B matrix 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 0, WaveMatrixRegType_B_TempReg), 
   B.v[0], B.v[1]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 0, WaveMatrixRegType_B_TempReg), 
   B.v[2], B.v[3]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 1, WaveMatrixRegType_B_TempReg), 
   B.v[4], B.v[5]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 1, WaveMatrixRegType_B_TempReg), 
   B.v[6], B.v[7]); 

 // C matrix 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 0, WaveMatrixRegType_Accumulator_TempReg), 
   C.v[0], C.v[1]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 0, WaveMatrixRegType_Accumulator_TempReg), 
   C.v[2], C.v[3]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 1, WaveMatrixRegType_Accumulator_TempReg), 
   C.v[4], C.v[5]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 1, WaveMatrixRegType_Accumulator_TempReg), 
   C.v[6], C.v[7]); 

 // Configure type 
 AGSMagic(WaveMatrixMulAcc, 1, 
   int(op) << int(WaveMatrixOpcode_OpsShift), 0, 0); 

 // Read output 
 WMMA_Matrix ret; 
 ret.v[0] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(0, 0, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[1] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(1, 0, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[2] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(2, 0, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[3] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(3, 0, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[4] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(0, 1, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[5] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(1, 1, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[6] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(2, 1, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[7] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(3, 1, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 return ret; 
}

Fortunately, FSR 4 never tries to be clever with the WMMA_Matrix elements. It seems possible to pass in whatever reason uint you want, but the shaders follow a strict pattern so dxil-spirv can do type promotion from uint to OpTypeCooperativeMatrixKHR on the first element and ignore the rest.

It took many days of agonizing trial and error, but eventually I managed to put together a test suite that exercises all the opcodes that I found in the DXIL files I dumped.

Tough corner cases

Ideally, there would be a straight forward implementation to KHR_cooperative_matrix, but that’s not really the case.

FP8 (E4M3)

FSR 4 is heavily reliant on FP8 WMMA. This is an exclusive RDNA4 feature. RDNA3 has WMMA, but only FP16. There is currently no SPIR-V support for Float8 either (but given that BFloat16 just released it’s not a stretch to assume something will happen in this area at some point).

To make something that is actually compliant with Vulkan at this point, I implemented emulation of FP8.

Converting FP8 to FP16 is fairly straight forward. While we don’t have float8 yet, we have uint8. Doing element-wise conversions like this is not strictly well defined, since the wave layout of different types can change, but it works fine in practice.

coopmat<float16_t, gl_ScopeSubgroup, 16u, 16u,
        gl_MatrixUseA> 
CoopMatFP8toFP16(
  coopmat<uint8_t, gl_ScopeSubgroup, 16u, 16u,
          gl_MatrixUseA> coop_input) 
{ 
 coopmat<float16_t, gl_ScopeSubgroup,
         16u, 16u, gl_MatrixUseA> coop_output;

 for (int i = 0; i < coop_input.length(); i++) 
 { 
   int16_t v = (int16_t(int8_t(coop_input[i])) << 7s) & (-16385s);
   coop_output[i] = int16BitsToFloat16(v); 
 }
 // Handles denorm correctly.
 // Ignores NaN, but who cares.
 // There is no Inf in E4M3.
 return coop_output * float16_t(256.0); 
}

I’m quite happy with this bit-hackery. Sign-extend, shift, bit-and, and an fmul.

FSR 4 also relies on FP32 -> FP8 conversions to quantize accumulation results back to FP8 for storage between stages. This is … significantly more terrible to emulate. Doing accurate RTE with denorm handling in soft-float is GPU sadism. It explodes driver compile times and runtime performance is borderline unusable as a result.

8-bit Accumulation matrix conversions

In many places, we need to handle loading 8-bit matrices, which are converted to FP32 accumulators and back. Vulkan can support this, but it relies on drivers exposing it via the physical device query. No driver I know of exposes 8-bit accumulator support in any operations, which means we’re forced to go out of spec. With some light tweaks to RADV, it works as expected however. A driver should be able to expose 8-bit accumulator types and do the trunc/extend as needed. It’s somewhat awkward that there is no way to expose format support separately, but it is what it is.

Converting types with Use conversion

In several places, the shaders need to convert e.g. Accumulation matrices to B matrices. This is a use case not covered by KHR_cooperative_matrix. The universal workaround is to roundtrip store/load via LDS, but that’s kind of horrible for perf. I ended up with some hacky code paths for now:

On RDNA4, Accum layout is basically same as B layout, so I can abuse implementation specific behavior by copying over elements one by one into a new coopmat with different type.
On RDNA3, this doesn’t work since len(B) != len(Accum).
NV_cooperative_matrix2 actually has a feature bit to support this exact use case without hackery, so I can take advantage of that when RADV implements support for it.

The debugging process

After implementing all the opcodes and making all my tests pass, it was time throw this at the wall. (Screenshots from TkG, pardon the French)

… fun. First process was to figure out if there was something I may have missed about opcode behavior. For SSBOs and LDS, I simply assumed that 4 byte alignment would be fine, then I found:

coopMatLoad(_86, _39._m0, ((30976u * _76) + _79) + gl_WorkGroupID.z, 16u, gl_CooperativeMatrixLayoutColumnMajor);

1 byte aligned coopmat (8-bit matrix), yikes … Technically out of spec for cooperative_matrix, but nothing we can do about that. AMD deals with this just fine. In the original code, I had used indexing into a u32 SSBO and just divided the byte offset by 4, but obviously that breaks. I added support for 8-bit SSBO aliases in dxil-spirv, updated test suite and we get:

Still not right. Eventually I found some questionable looking code with LDS load-store. The strides couldn’t possibly be right. It turns out that offset/stride for LDS is in terms of u32 words, not bytes (?!). This detail wasn’t caught in the test suite because that bug would cancel each other out on a store and load. Fortunately, this quirk made my life easier in dxil-spirv, since there’s no easy way of emitting 8-bit LDS aliases.

Things were looking good, but it was far from done. There was still intense shimmering and ghosting, which couldn’t be observed from screenshots and I couldn’t figure it out by simply staring at code. It was time to bring out the big guns.

Build a tiny FSR 4 test bench from scratch

I needed a way to directly compare outputs from native and vkd3d-proton, to be able to isolate exactly where implementations diverged. Fortunately, since we’re not getting blocked by driver when trying to capture anymore, I captured a RenderDoc frame from the 3.1 demo.

Fortunately the inputs and outputs are very simple.

A pre-pass that reads textures, and does very light WMMA work at the end
A bunch of raw passes that spam WMMA like no tomorrow, likely to implement the ML network
A final post-pass that synthesizes the final image with more WMMA work of course

The middle passes were very simple from a resource binding standpoint:

One big weight buffer
One big scratch buffer

In RenderDoc, I dumped the buffer contents to disk, and built a small D3D12 test app that invokes the shaders with the buffers in question. Every dispatch, dump the scratch buffers contents out to disk, and by running it against the native D3D12 driver and vkd3d-proton, figure out where things diverge.

Surely the FSR 4 shaders cannot be bugged, right?

Turns out, yes, they can be 🙂 Turns out many of the shaders only allocate 256 bytes of LDS, yet the shaders actually need 512. Classic undefined behavior. The reason this “happens” to work on native is that AMD allocates LDS space with 512 byte granularity. However, dxil-spirv also emits some LDS to deal with matrix transpositions and it ended up clobbering the AGS shader’s LDS space …

One disgusting workaround later …

// Workaround for bugged WMMA shaders.
// The shaders rely on AMD aligning LDS size to 512 bytes.
// This avoids overflow spilling into LDSTranspose area by mistake, which breaks some shaders.

if (
  address_space == DXIL::AddressSpace::GroupShared &&
  shader_analysis.require_wmma)
{
   // ... Pad groupshared array to 512 byte.
}

and games were rendering correctly. FP16 path on RDNA4 conquered.

Performance?

Absolute garbage, as expected. 1440p on 9070xt on native is about 0.85 ms and my implementation was about 3 ms.

Going beyond

Can we implement FP8?

RADV obviously cannot ship FP8 before there is a Vulkan / SPIR-V spec, but with the power of open source and courage, why not experiment. Nothing is stopping us from just emitting:

OpTypeFloat 8

and see what happens. Georg Lehmann brought up FP8 support in NIR and ACO enough to support FP8 WMMA. Hacking FP8 support into dxil-spirv was quite straight forward and done in an hour. Getting the test suite to pass was smooth and easy, but … the real battle was yet to come.

vkd3d-proton bug or RADV?

Games were completely broken in the FP8 path. Fortunately, this difference reproduced in my test bench. The real issue now was bisecting the shader itself to figure out where the shaders diverge. These shaders are not small. It’s full of incomprehensible ML gibberish, so only solution I could come up with was capturing both FP16 and FP8 paths and debug printing side-by-side. Fortunately, RenderDoc makes this super easy.

First, I had to hack together FP8 support in SPIRV-Cross and glslang so that roundtripping could work:

Eventually, I found the divergent spot, and Georg narrowed it down to broken FP8 vectorization in ACO. Once this was fixed, FP8 was up and running. Runtime was now down to 1.3 ms.

Get rid of silly LDS roundtrips

FSR 4 really likes to convert Accumulator to B matrices, but on RDNA4, the layouts match (at least for 8-bit) so until we have NV_cooperative_matrix2 implementation, I pretended it worked by copying elements instead, and runtime went down to about 1 ms. RADV codegen for coopmat is currently very naive, especially buffer loading code is extremely inefficient, but despite that, we’re pretty close to native here. Now that there is a good use case for cooperative matrix, I’m sure it will get optimized eventually.

At this point, FP8 path is fully functional and performant enough, but of course it needs building random Mesa branches and enabling hacked up code paths in vkd3d-proton.

RDNA3?

RDNA3 is not officially supported at the moment, but given I already went through the pain of emulating FP8, there’s no reason it cannot work on RDNA3. Given the terrible performance I got in FP16 emulation, I can understand why RDNA3 is not supported though … FSR 4 requires a lot of WMMA brute force to work, and RDNA3’s lesser WMMA grunt is simply not strong enough. Maybe it would work better if a dedicated FP16 model is designed, but that’s not on me to figure out.

It took a good while to debug this, since once again the test suite was running fine …

LDS roundtrip required

Unfortunately, RDNA3 is quite strange when it comes to WMMA layouts. Accumulator has 8 elements, but A and B matrices have 16 for some reason. NV_cooperative_matrix2 will help here for sure.

Final shader bug

After fixing the LDS roundtrip, the test bench passed, but games still looked completely broken on RDNA3. This narrowed down the problem to either the pre-pass or post-pass. Dual GPU and opening up a capture of the FSR 3.1 demo side by side on RDNA3 and RDNA4, I finally narrowed it down to questionable shader code that unnecessarily relies on implementation defined behavior.

// Rewritten for clarity

float tmp[8]; 
tmp[0u] = uintBitsToFloat(_30._m0[384u].x); 
tmp[1u] = uintBitsToFloat(_30._m0[384u].y); 
tmp[2u] = uintBitsToFloat(_30._m0[384u].z); 
tmp[3u] = uintBitsToFloat(_30._m0[384u].w); 
tmp[4u] = uintBitsToFloat(_30._m0[385u].x); 
tmp[5u] = uintBitsToFloat(_30._m0[385u].y); 
tmp[6u] = uintBitsToFloat(_30._m0[385u].z); 
tmp[7u] = uintBitsToFloat(_30._m0[385u].w);

for (int i = 0; i < mat.length(); i++)
  mat[i] += tmp[i];

storeToLDS(mat);

// Read LDS and write to image.
LDS[8 * SubgroupInvocationID + {4, 5, 6, 7}]

This is not well behaved cooperative matrix code since it relies on the register layout. RDNA3 and 4 actually differ. The columns are interleaved in very different ways.

I found a workaround which can be applied to RDNA3 wave32.

for (int i = 0; i < mat.length(); i++)
  mat[i] += tmp[((i << 1u) & 7u) + (gl_SubgroupInvocationID >> 4u)];

That was the best I could do without resorting to full shader replacement. The actual fix would be for the shader to just perform this addition after loading from LDS, like:

storeToLDS(mat);

float tmp[8]; 
tmp[0u] = uintBitsToFloat(_30._m0[384u].x); 
tmp[1u] = uintBitsToFloat(_30._m0[384u].y); 
tmp[2u] = uintBitsToFloat(_30._m0[384u].z); 
tmp[3u] = uintBitsToFloat(_30._m0[384u].w); 
tmp[4u] = uintBitsToFloat(_30._m0[385u].x); 
tmp[5u] = uintBitsToFloat(_30._m0[385u].y); 
tmp[6u] = uintBitsToFloat(_30._m0[385u].z); 
tmp[7u] = uintBitsToFloat(_30._m0[385u].w);

// Read LDS and write to image.
LDS[8 * SubgroupInvocationID + {4, 5, 6, 7}] + tmp[{4, 5, 6, 7}]

This would at least be portable.

With this, RDNA 3 can do FSR 4 on vkd3d-proton if you’re willing to take a massive dump on performance.

Conclusion

I was fearing getting FSR 4 up and running would take a year, but here we are. Lots of different people in the community ended up contributing to this in smaller ways to unblock the debugging process.

There probably won’t be a straight forward way to make use of this work until FSR 4 is released in an official SDK, FP8 actually lands in Vulkan, etc, etc, so I’ll leave the end-user side of things out of this blog.

Graphics programming like it’s 2000 – An esoteric introduction to PlayStation 2 graphics – Part 1

Graphics programming in 2025 can be confusing and exhausting. Let’s travel back to a simpler time. Imagine it’s 2000 again and we’re anticipating what will turn out to be the most successful game console of all time. In our reverie, we have acquired a virtual development kit from the future to get ahead of the curve.

Like many others, we must do our taxes and start with Hello Triangle. However, this Hello Triangle will likely be the strangest Hello Triangle yet.

Like any graphics chip – even to this day – it chews a sequence of commands and spits out pixels. The GS chip itself is based around the idea of writing to various hardware registers to trigger work. Everything from drawing triangles to copying image data around is all done by poking the right registers in the right order. To automate this process of peeking and poking hardware registers, the front-end is responsible for reading a command stream and tickle the registers.

To get graphics on the screen, our goal will be to prepare a packet of data that a hypothetical GS can process.

FILE *file = fopen("dump.gs", "wb");
if (!file)
    return 1;

Where we’re going we need no pesky API.

First, we need to program some HW registers:

struct {
    GIFTagBits tag;
    PackedADBits prmode;
    PackedADBits frame;
    PackedADBits scissor;
};

The GIFTag tells the hardware how to interpret the packet, which is followed by 3 Address + Data packets that tickle the hardware register of our choosing.

.tag = {
  // Loops once to program 3 registers
  .NLOOP = 1,
  // End of packet
  .EOP = 1,
  // 128-bit form
  .FLG = GIFTagBits::PACKED,
  // Three registers per loop
  .NREG = 3,
  // Up to 16 x 4 bits to program 16
  // different registers in one go.
  // A_D is a general "poke" interface
  // that can access any HW register.
  // 0x111 splats the bits.
  .REGS = int(GIFAddr::A_D) * 0x111,
},

PRMODE: Programs global settings like texture on/off, fogging on/off, blending on/off, etc. We just need to turn Gouraud shading on, i.e., color is interpolated across the triangle.

.prmode = {
  // Gouraud shading
  .data = Reg64<PRIMBits>({ .IIP = 1 }).bits,
  .ADDR = uint8_t(RegisterAddr::PRMODE),
},

FRAME: Program where the frame buffer is in VRAM. There is no height. That’s what scissor is for.

.frame = {
  // Programs the frame buffer
  // with 32-bit color.
  .data = Reg64<FRAMEBits>({
    .FBP = fb_address / 8192,
    .FBW = fb_width / 64,
    .PSM = PSMCT32 }).bits,
  .ADDR = uint8_t(RegisterAddr::FRAME_1),
},

SCISSOR: Set the scissor rect.

.scissor = {
  .data = Reg64<SCISSORBits>({
    .SCAX0 = 0,
    .SCAX1 = fb_width - 1,
    .SCAY0 = 0,
    .SCAY1 = fb_height - 1 }).bits,
  .ADDR = uint8_t(RegisterAddr::SCISSOR_1),
},

This now forms a packet and we can write that out to file.

Time for a new packet. We need to clear the frame buffer to some aesthetically pleasing color. SPRITE primitive to the rescue.

struct {
    GIFTagBits tag;
    PackedRGBAQBits rgba;
    PackedXYZBits xyz0;
    PackedXYZBits xyz1;
};

Unlike those silly modern GPUs we have a straight forward quad primitive here. It takes two points – meaning we cannot freely rotate sprites 90 or 270 degrees this way – but we have triangles for those edge cases.

The GIFTag programs primitive list of SPRITE and sets it up so that we interpret 3 registers as RGBA color followed by XYZ. Writing to XYZ “kicks” the vertex. Sounds familiar? glVertex3f in hardware? Yup, yup!

.tag = {
    // One primitive.
    .NLOOP = 1,
    .EOP = 1,

    // Begin a new primitive sequence.
    .PRE = 1,
    .PRIM = Reg64<PRIMBits>({
       .PRIM = int(PRIMType::Sprite) }).words[0],

    .FLG = GIFTagBits::PACKED,
    .NREG = 3,
    .REGS =
       int(GIFAddr::RGBAQ) |
       (int(GIFAddr::XYZ2) * 0x110),
},

.rgba = {
    .R = 0x20,
    .G = 0x30,
    .B = 0x40,
    .A = 0xff,
},
.xyz0 = {
    // Top-left coordinate in 12.4 fixed point.
    .X = 0 << 4,
    .Y = 0 << 4,
    .Z = 0,
},
.xyz1 = {
    // Bottom-right coordinate in 12.4 fixed point.
    .X = fb_width << 4,
    .Y = fb_height << 4,
    .Z = 0,
},

Then the final packet for our triangle:

struct PackedVertex {
    PackedRGBAQBits rgba;
    PackedXYZBits xyz;
};

const struct {
    GIFTagBits tag;
    PackedVertex verts[3];
};

Now we just need to program the hardware to read RGBA + XYZ in a loop 3 times and we can draw a triangle:

.tag = {
    // Three vertices
    .NLOOP = 3,
    .EOP = 1,

    // Begin a new primitive sequence.
    .PRE = 1,
    .PRIM = Reg64<PRIMBits>({
       .PRIM = int(PRIMType::TriangleList) }).words[0],

    .FLG = GIFTagBits::PACKED,

    // Every loop writes RGBA, then kicks vertex
    .NREG = 2,
    .REGS =
       int(GIFAddr::RGBAQ) |
       (int(GIFAddr::XYZ2) * 0x10),
},
.verts = {
    {
       .rgba = { .R = 0xff },
       .xyz = { .X = 300 << 4, .Y = 100 << 4 },
    },
    {
       .rgba = { .G = 0xff },
       .xyz = { .X = 100 << 4, .Y = 400 << 4 },
    },
    {
       .rgba = { .B = 0xff },
       .xyz = { .X = 500 << 4, .Y = 400 << 4 },
    },
},

Now the triangle is in memory and we need to display its lovely pixels on screen. To this this we must program the CRTC. This is mostly boilerplate.

// Program special registers which control the CRTC,
// aka display controller.
PrivRegisterState priv = {};

// Only enable display circuit 1.
priv.pmode.EN1 = 1;
priv.pmode.EN2 = 0;
// Just has to be 1. *shrug*
priv.pmode.CRTMD = 1;
// Normal NTSC 480i.
priv.smode1.CMOD = SMODE1Bits::CMOD_NTSC;
priv.smode1.LC = SMODE1Bits::LC_ANALOG;
priv.smode2.INT = 1;
// Effectively disables alpha blending against BG color.
priv.pmode.MMOD = PMODEBits::MMOD_ALPHA_ALP;
priv.pmode.SLBG = PMODEBits::SLBG_ALPHA_BLEND_BG;
priv.pmode.ALP = 0xff;

// Program the framebuffer pointer.
priv.dispfb1.FBP = fb_address / 8192;
priv.dispfb1.FBW = fb_width / 64;
priv.dispfb1.PSM = PSMCT32;
priv.dispfb1.DBX = 0;
priv.dispfb1.DBY = 0;

// Center the display area so it covers the full screen.
priv.display1.DX = 640; // Overscan centering.
priv.display1.DY = 50; // Overscan centering.
priv.display1.MAGH = 4 - 1; // 640 width framebuffer.
priv.display1.MAGV = 0; // No scaling vertically.
priv.display1.DW = (640 - 1) * SMODE1Bits::CLOCK_DIVIDER_COMPOSITE;
priv.display1.DH = 448 - 1;

Flush out all this to disk, load it up in parallel-gs-stream and presto:

Compile-able source code can be found here for reference:

https://gist.github.com/HansKristian-Work/b88066eb8f14be21277c550a6f775956

Bonus hackery

parallel-gs-stream can read from a mkfifo file, so you could technically open a file as a FIFO and animate a triangle by writing the SPRITE + TRIANGLE packets followed by vsync packet in a loop. No need to complicate things.

Future entries

Stay tuned for simple texture mapping with perspective correction.

Real-time video streaming experiments with forward error correction

As I previously discussed in my PyroFling post about real-time video streaming, one challenge I mentioned was related to error correction. Using UDP, packet loss is inevitable, so there’s two approaches to reduce streaming jank:

Error masking – hallucinate missed frames
Forward error correction (FEC) – add redundancy to avoid dropped packets

Re-sending packets is a waste of time in a low latency environment like this, so we can ignore that. If re-sending was okay, I’d just use TCP and forget about all of this anyway.

With intra-refresh, error masking is half decent, so I wanted to focus on FEC. Error correction is its own field of study, but I didn’t have the time to actually study the field. State-of-the-art error correction is extremely advanced, complex to implement and IP encumbered (*ahem*, RaptorQ, *ahem*), but I evaluated some less recent approaches.

Understanding the data

Which FEC mechanism we choose needs to take the input data into consideration.

N bytes split into 1024 byte sub-packets

A video packet is a variable number of bytes every frame, which I split up into 1024 bytes (+ header) each. A successful transmission only happens when all N bytes are received successfully. Sending partially valid data to the video decoder is likely going to result in horrible things happening, so if even one sub-packet is dropped, I have to drop the full frame.

Some error correction schemes rely on fixed block lengths, which isn’t ideal for our variable length input. The classic example everyone taking classes on the subject learn is the Hamming (7, 4) code, but this code is better suited for noisy analog channels where we don’t know if any bit was actually received correctly. What we really want is a method that takes extra knowledge about packet loss into account.

Erasure channel

Sending UDP packets over the internet functions like an erasure channel. At the receiver, we know if data was missed. Corrupt packets are dropped by the network (random bit-errors causing CRC check to pass is theoretically possible I suppose, but I don’t consider that).

Small block size

Since a packet is received all or nothing, we’re actually error correcting in a vectorized fashion. The message we’re looking at error correcting is byte n for for all sub-packets P in [0, ceil(N / 1024)), where N is the video packet size in bytes. The error correction algorithm will perform the exact same operation for every byte n in a given packet.

Another way of looking at this is to consider every packet a single 8192-bit number, but that’s a very mathematical way of looking at it. Either way, given a typical 10 mbit/s stream at 60 fps, we expect about 20 sub-packets per video frame. Some frames will be very small, and some will be larger.

Flexible redundancy ratios and block sizes

Some block codecs like Reed-Solomon are well known and very powerful, but seemed a bit too rigid in its block structure. It also has block lengths that seem better suited to bit-streams.

Being able to adapt how much FEC is used is quite useful in a dynamic system such as streaming. A Compact Disc (which uses Reed-Solomon) has to bake in a fixed amount of error correction, but with streaming, a feedback channel can let us dynamically adjust the amount of error correction used as needed. I quickly rejected these codes.

The most basic FEC – YOLO XOR

I don’t think this code has a formal name, but understanding it is the foundation for the upcoming section. Given N packets, just take the XOR of all the packets and send that as one FEC block. If there is 1 packet loss, we can recover it by taking the XOR of all received packets and the FEC block. For small video frames with a small number of packets, this is actually the method I went with.

The downside of course is that there is no obvious way to recover more than 1 packet. I found that spurious packet loss could have 2 or 3 drops in some cases, especially in very large video frames that span up to 100 sub-packets, so this approach was too naive for me.

Fountain codes

While looking around, I ran into a very clever scheme called a fountain code, in particular, the Luby Transform. There is a nice YouTube video explaining it. I also dug up Chapter 50 in an old textbook I used in my university studies, which had a chapter dedicated to this with more mathematical rigor.

This method has some nice properties that are well suited for network transmission:

Designed for erasure channels (e.g. IP networks)
Flexible FEC ratios
Receiver can complete decode after receiving enough data packets (with some major caveats)

A fountain code is called so because the encoder can spit out an arbitrary number of packets. There is no fixed block structure, and the process is pseudo-random. As long as the encoder and decoder agree on a seed for the process, very little side channel data needs to be communicated.

The algorithm is essentially YOLO XOR, with a lot of statistical tweaks. First, we consider the degree d of a packet, which is the number of blocks we take the XOR of when generating a packet. A degree of 1 is the base case where we send a block as-is, and a degree of ceil(N / 1024) is taking the XOR of everything (i.e. the YOLO XOR case).

The packets chosen for XOR-ing are randomized. On the receiver end, we look at all our received packets, and if we find a case where we have a packet with degree d, and d – 1 of the packets have been recovered, we can recover the last one through … more XOR-ing. By recovering a new packet, this may cause other packets to reach this condition and the cycle continues. To kick-start this process, some packets with degree = 1 must be transmitted.

To make this work well, the literature describes a very particular distribution for d to minimize the expected redundancy. I implemented all of this, but I found some unfortunate practical problems. It is (very) possible I had bugs of course, but debugging completely random processes is not very fun and it’s not like I had a reference result to compare against.

Non-deterministic amount of packets needed to complete decode

Given the completely random process, it’s unbounded how many packets have to be encoded to actually be able to decode, even with no packet loss. Studying the literature, the examples I found seemed to assume a very large number of blocks K. K would be 10000 for example, and as K increases, the variance of redundancy ratios decreases. For my example of K = 20, the algorithm seemed to collapse. Occasionally, I needed 2 or 3x redundancy to complete the decode, which is obviously unacceptable.

Painful statistical modelling

The statistical distribution for the degree factor d depends on the number of blocks to send, K. This value changes every frame. K can be arbitrarily large, so computing LUTs got awkward.

Smol brain LT code

Following the enlightened example of grug brain, I massively simplified the LT code down to something that maybe isn’t as theoretically good, but it worked very well in practice for my particular needs, where packet loss ratios are fairly low and K is low.

This basically boils down to a heavily “rigged” LT, but otherwise the encoder and decoder does not really change.

Send all packets as-is first (d = 1)

This is an obvious thing to do. If there is no packet loss, we guarantee that we can start decoding immediately (good for latency). There is no randomness in this process.

Fixed degree factor

I found that a fixed d factor of K / 2 worked well. For odd K, alternate d between ceil(K / 2) and floor(K / 2). For large K, clamping the factor d to something reasonable like 64 worked well too.

Mirror selection of blocks

For every pair of blocks, we want to ensure that blocks selected for XOR are the complement of each other. This guarantees that by receiving a pair of FEC blocks, we will always be able to correct one missed packet. With 50% probability, we can recover two packet losses. The odd/even split above is designed to make sure that an odd/even pair always covers all K blocks. As the number of blocks increases, we’ll be able to recover more losses (with lower and lower probability).

Results

Given a fixed number of data packets, and N randomly lost packets, we can observe how well this FEC recovers data. The recovery rate for 1 lost packet is 100% due to our design, so that’s not interesting. XOR degree factor is 20.

With larger data blocks, and more FEC blocks to match the redundancy ratio, recovery ratio improves, but beyond 4 losses, the codec starts collapsing. That’s fine. I haven’t had too many issues with burst losses like these. XOR degree factor is 40. There’s some interesting stair-stepping here, which might be caused by the mirroring. This suggests we get most bang for our bandwidth by using odd number of FEC blocks.

It’s possible to tweak things however. Using a smaller degree is good when using a lot of FEC packets and more errors are expected. Maybe it’s possible to use a blend of high degree and low degree packets as well (basically the entire point of LT), but this kind of tweaking can be left to another time. PyroFling’s expectation is low number of losses per video frame and simplicity beats theoretical performance.

If we add e.g. 0.5% random, uncorrelated packet loss ratio, a degree factor of d = N / 2 seems much better.

For larger data sets, the number of expected losses starts increasing, so degree factors seem to prefer d = 100 over d = 200. For larger video frames, we’re far more likely to encounter at least one packet loss, so that’s why loss ratio for 0 FEC packets approaches 100%.

Is this even good?

Compared to the state of the art, this is likely far from optimal, but this was good enough for my uses and here’s the latest log from a 4-hour play session between Trondheim and Bergen:

2322932 complete, 54 dropped video, 9683 FEC recovered

~99.5% of dropped packets were avoided. This was at 25% FEC redundancy rate. Every dropped video packet is disruptive and lasts many frames, so this improvement was transformative.

This isn’t an academic project, so I don’t really care about comparing against a million different FEC algorithms. 🙂

UDP pacing to reduce bursty drops

A common technique in UDP streaming is to not send all packets immediately, but pace them over an interval. Sending over the full frame interval increases latency by quite a bit, but pacing a stream to a max instantaneous rate of e.g. 60 mbit/s worked alright. The added latency is only a few ms for a ~10-15 mbit/s stream, which is acceptable. Since I’m on Linux, and I was lazy, I rely on the kernel to do this automatically:

sudo tc qdisc add dev $iface root fq maxrate 60000000

Conclusion

Hopefully this demonstrates a simple FEC that is fairly accessible. The last piece of the PyroFling adventure will be to finally tackle Vulkan Video encode.

My scuffed game streaming adventure – PyroFling

My side projects have a tendency to evolve from a tiny weekend experiment into something that ends up satisfying a very specific niche use case after multiple weekends of nerdsniping myself. This is one of those projects where I started experimenting with how to use external memory in Vulkan and file descriptor flinging on Linux, and it just … grew from there.

This is a wild braindump ride with some of the topics being:

Basic Unix IPC
- Fling those file descriptors like a champ
Writing a Vulkan layer that captures a swapchain
- Knowing how to write a layer is pretty useful for any hardcore Vulkan programmer
A deeper understanding of how Vulkan WSI can be implemented under the hood
- Acquire elite, arcane knowledge
Techniques for audio/video sync with low latency
- A million MPV flags will not save you
- Bespoke hacks will, however
How to coax FFmpeg into encoding video with very low latency
- All the AVOptions, oh my!
Using /dev/uinput to create virtual gamepads
- Tie it all together

Making a custom WSI implementation

The first part of this project was to make my own custom WSI implementation and a “server” that could act as a compositor of some sorts. The main difference was that rather than putting the swapchain on screen – which is a rabbit hole I’m not getting myself into – I just wanted to dump the results to a video file. At the end of last year, I was fiddling around with Vulkan video + FFmpeg, and this was the perfect excuse to start considering encoding as well. It would be pretty neat to get a swapchain to stay in VRAM, be encoded directly in Vulkan video and then get back H.264/H.265/AV1 packets.

Rather than redirecting WSI to a different “surface” which can get very tricky, this approach is very simple. This is implemented in a Vulkan layer where we hook the swapchain.

The basic gist is that we copy any presented swapchain image to an image owned by a layer, which is then sent over to the “compositor” with external memory. Synchronization is all explicit using external semaphores, because of course!

The protocol needed for a Vulkan swapchain is pretty simple. In Linux, we can use a Unix domain socket with SOCK_SEQPACKET. This is kinda like a reliable datagram that can also send and receive file descriptors as side band information.

// name is a path, e.g. /tmp/pyrofling-server

int fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);

struct stat s = {};
if (stat(name, &s) >= 0)
{
    if ((s.st_mode & S_IFMT) == S_IFSOCK)
    {
       fprintf(stderr, "Rebinding socket.\n");
       unlink(name);
    }
}

struct sockaddr_un addr_unix = {};
addr_unix.sun_family = AF_UNIX;
strncpy(addr_unix.sun_path, name, sizeof(addr_unix.sun_path) - 1);
bind(fd.get_native_handle(),
     reinterpret_cast<const sockaddr *>(&addr_unix),
     sizeof(addr_unix));

From here, clients can connect to the server using e.g. connect() and server can listen() / accept() clients, just like normal TCP. The main difference is that SEQPACKET is not stream based, so we can send individual messages instead, ala UDP, using sendmsg() rather than plain send().

msghdr msg = {};

// data
msg.msg_iov = iovs;
msg.msg_iovlen = iov_count;

if (fling_fds_count)
{
    // control payload
    msg.msg_control = cmsg_buf;
    msg.msg_controllen = CMSG_SPACE(sizeof(int) * fling_fds_count);

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    cmsg->cmsg_type = SCM_RIGHTS;
    cmsg->cmsg_level = SOL_SOCKET;
    cmsg->cmsg_len = CMSG_LEN(sizeof(int) * fling_fds_count);
    auto *fds = reinterpret_cast<int *>(CMSG_DATA(cmsg));
    for (size_t i = 0; i < fling_fds_count; i++)
       fds[i] = fling_fds[i].get_native_handle();
}

ssize_t ret = sendmsg(fd.get_native_handle(), &msg, MSG_NOSIGNAL);

On the receiving end:

msghdr msg = {};
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = cmsg_buf;
msg.msg_controllen = sizeof(cmsg_buf);

ssize_t ret = recvmsg(fd.get_native_handle(), &msg, 0);

and then we grab the FDs. These FDs are tied to the message, so we know if this is an image, a semaphore, etc.

// Capture any FDs we receive.
std::vector<FileHandle> received_fds;

for (auto *cmsg = CMSG_FIRSTHDR(&msg); cmsg;
     cmsg = CMSG_NXTHDR(&msg, cmsg))
{
    if (cmsg->cmsg_level == SOL_SOCKET &&
        cmsg->cmsg_type == SCM_RIGHTS &&
        cmsg->cmsg_len > CMSG_LEN(0))
    {
       size_t data_len = cmsg->cmsg_len - CMSG_LEN(0);
       size_t num_fds = data_len / sizeof(int);
       auto *fds = reinterpret_cast<const int *>(CMSG_DATA(cmsg));
       for (size_t i = 0; i < num_fds; i++)
          received_fds.emplace_back(fds[i]);
    }
}

The protocol from here is pretty simple. Most WSI implementations would be some kind of variant of this under the hood I think.

Physical device (client -> server)

To use external memory in Vulkan we must be sure that the devices are compatible. We can get compatibility information in VkPhysicalDeviceIDProperties.

struct Device
{
    uint8_t device_uuid[16];
    uint8_t driver_uuid[16];
    uint8_t luid[8];
    uint32_t luid_valid;
};

For OPAQUE_FD external types in Vulkan, these must match. There is no particular need to be fancy and use DRM modifiers here. Client sends this information over once. Each VkSurfaceKHR has one connection associated with it. In Vulkan, there can only be one active non-retired swapchain assigned to a surface, so this model works well.

Create a group of images (client -> server)

When using external memory in Vulkan, the creator and consumer of the external memory must agree on VkImageCreateInfo parameters, so we just fling that information over as-is. If this were a more normal WSI, like X or Wayland, this is where DRM modifiers becomes important, because the consumer is probably not Vulkan, but I only really care about OPAQUE_FD for my use case since I know the image is being consumed in Vulkan.

struct ImageGroup
{
    // Assumptions made:
    // Layers = 1
    // Type = 2D
    // Levels = 1
    uint32_t num_images;
    uint32_t width;
    uint32_t height;

    // VkSurfaceFormatKHR
    // Assume that server can deal with anything reasonable.
    // We don't have to actually flip on display.
    uint32_t vk_format;
    uint32_t vk_color_space; // sRGB or HDR10? :3

    uint32_t vk_image_usage;
    uint32_t vk_image_flags;
    uint32_t vk_external_memory_type; // OPAQUE or DRM modifier.
    uint32_t vk_num_view_formats;
    uint32_t vk_view_formats[15]; // If MUTABLE and vk_num_formats != 0.
    uint64_t drm_modifier; // Unused atm.
};

Along with this message, num_image FDs are expected. The server will then import the memory, create images and bind.

If the server’s Vulkan device differs from the client, we can round-trip through system memory with VK_EXT_external_host_memory. Two separate GPUs can import the same system memory. This is very useful to me since I have two GPUs installed and being able to render on one GPU and encode on another GPU is pretty nifty. Can also be very nice to let iGPU do hardware accelerated encode down the line.

Present (client -> server)

struct PresentImage
{
    // Serial from image group.
    uint64_t image_group_serial;

    // If period > 0, FIFO semantics.
    // If period == 0, MAILBOX semantics.
    uint16_t period;

    // Must be [0, VulkanImageGroup::num_images).
    uint16_t index;

    // OPAQUE or something special. Binary semaphores only.
    uint32_t vk_external_semaphore_type;

    // Represents the release barrier that client performs.
    uint32_t vk_old_layout;
    uint32_t vk_new_layout;

    // An ID which is passed back in FrameComplete.
    uint64_t id;
};

One binary semaphore is expected as FD here. Explicit sync, yay. I could of course have used timeline semaphores here, but I really didn’t need anything more fancy than binary semaphores and Vulkan WSI requires binary semaphores anyway. If I ever want to port this to Windows, I’ll run into the problem that AMD does not support external timeline OPAQUE_WIN32, so … there’s that 🙁

The client needs to perform an image barrier to VK_QUEUE_FAMILY_EXTERNAL. The server side completes the transition with an acquire barrier from EXTERNAL into whatever queue family it uses.

The present ID is used later so we can implement KHR_present_wait properly.

Acquire (server -> client)

struct AcquireImage
{
    // Serial from image group.
    uint64_t image_group_serial;

    // Must be [0, VulkanImageGroup::num_images).
    uint32_t index;

    // OPAQUE or something special. Binary semaphores only.
    // If type is 0, it is an eventfd handle on host timeline.
    uint32_t vk_external_semaphore_type;
};

Acquire is async as intended. Typically, the server side does RGB -> YUV conversion and once that “blit” is done, we can release the image to client as long as there are new pending presents that are done.

Fortunately, we don’t have to hook vkAcquireNextImageKHR in this implementation since we’re still rendering to the display as normal. In QueuePresentKHR, we’ll do:

Wait for QueuePresentKHR semaphores
Acquire image from server (in the common case, this never blocks)
Queue wait for our acquire semaphore
Copy WSI image to internal image (transition image layouts as necessary)
Resignal QueuePresentKHR semaphores + signal external OPAQUE_FD semaphore
Send present message to server
Call QueuePresentKHR as normal

However, if we were redirecting the WSI completely, implementing the semaphore and fence parameters in vkAcquireNextImageKHR is actually quite awkward since there is no host vkSignalSemaphore and vkSignalFence in Vulkan sadly. Some bonus tips how to do it properly for the curious:

acquire has temporary import semantics

The semaphore you give to vkAcquireNextImageKHR isn’t really signaled as you’d expect, rather, it has temporary import semantics with a magic payload, i.e. the semaphore is replaced with a temporary payload of unknown type. When you subsequently wait on that semaphore, the temporary payload is consumed and freed and the semaphore is reverted to its original state. This is very useful, since we should implement AcquireNextImageKHR with vkImportSemaphoreFd and vkImportFenceFd.

From spec:

Passing a semaphore to vkAcquireNextImageKHR is equivalent to temporarily importing a semaphore payload to that semaphore.

NOTE:

Because the exportable handle types of an imported semaphore correspond to its current imported payload, and vkAcquireNextImageKHR behaves the same as a temporary import operation for which the source semaphore is opaque to the application, applications have no way of determining whether any external handle types can be exported from a semaphore in this state. Therefore, applications must not attempt to export external handles from semaphores using a temporarily imported payload from vkAcquireNextImageKHR.

As long as we can import a payload, we can do whatever we want, neat!

Async acquire (easy)

This is trivial, just import the binary semaphore we got from AcquireImage message.

sync acquire

If the server gives us back a CPU-side eventfd or something similar, this is more awkward. On Linux, we can import SYNC_FD with fd -1. This means it’s already signaled, and it’s a way to signal a binary semaphore from CPU. However, not all implementations support SYNC_FD, although I believe the last holdout (NVIDIA) added support for it in a recent beta, so maybe relying on SYNC_FD on Linux is feasible soon.

If that isn’t available we have to go into really nasty hackery, having a pool of already signaled binary OPAQUE_FD semaphores for example. On present, we can signal a new payload on the side, place that in our “pool” of binary semaphores that we can import into an acquire later. Supremely scuffed, but hey, gotta do what you gotta do.

Retire (server -> client)

I don’t think it was a good idea in the end, but I tried splitting the acquire process in two. The basic idea was that I could aggressively signal acquire early, letting the CPU start recording commands, but before you’d actually submit rendering, you’d have to block until the retire event came through. Alternatively, you could wait for acquire + retire events to come through before considering an acquire complete.

In practice, this ended up being a vestigial feature and I should probably just get rid of it. It maps rather poorly to Vulkan WSI.

Completion (server -> client)

struct FrameComplete
{
    // Serial from image group.
    uint64_t image_group_serial;

    // All processing for timestamp is committed and submitted.
    // Will increase by 1 for every refresh cycle of the server.
    // There may be gaps in the reported timestamp.
    uint64_t timestamp;

    // The current period for frame latches.
    // A new frame complete event is expected after period_ns.
    uint64_t period_ns;

    // When an image is consumed for the first time,
    // it is considered complete.
    uint64_t presented_id;

    FrameCompleteFlags flags;

    // Number of refresh cycles that frame complete
    // was delayed compared to its target timestamp.
    // If this is consistently not zero, the client is too slow.
    uint32_t delayed_count;

    uint64_t headroom_ns;
};

This event represents a “vblank” event. A completion event is fired when an image was done rendering and was consumed by a “vblank” (i.e. encoding) event.

This can be used to implement KHR_present_wait, proper frame pacing, latency control, etc. I didn’t implement all of the fields here fully, but when you control the protocol you can do whatever you want 🙂

Overall, this protocol ended up looking vaguely similar to X11 DRI3 Present protocol with the improvement of being explicit sync, async acquire by default, and a better FIFO queue model that does not require insane hackery to accomplish. Implementing FIFO well on X11 is actually a nightmare requiring worker threads juggling scissors to ensure forward progress. Don’t look at wsi_common_x11.c in Mesa if you value your sanity, just saying.

Frame pacing

A common concern I have with typical screen recording software is that the video output is not well-paced at all. If I record at 60 fps and I’m playing at 144 fps, there’s no way the output will be well paced if it’s just doing the equivalent of taking a snapshot every 16.6 ms. To fix this, I added some modes to optimize for various use cases:

Prefer server sync

The client becomes locked to the server refresh rate. Frame limiting happens either in QueuePresentKHR or WaitForPresentKHR. If the application is using presentIds, we can just redirect WaitForPresentKHR to wait for completion events from our server, instead of the actual swapchain. If it does not use present_wait, we can fall back to frame limiting in QueuePresentKHR. (Frame limiting in AcquireNextImageKHR is broken since you can acquire multiple images in Vulkan and may happen at arbitrary times).

Depending on the use case it can be useful to force MAILBOX present mode on the swapchain to avoid a scenario where we’re blocking on two separate clocks at the same time. If I’m playing on a 144 Hz VRR monitor while being frame limited to 60 fps, that’s not a problem, but recording at 60 fps with a 60 Hz monitor could be a problem. If frame pacing of recording is more important than frame pacing of local monitor, the swapchain that goes on screen should have MAILBOX or IMMEDIATE.

Prefer client sync

Client renders unlocked and server will use whatever latest ready image is. Basically MAILBOX.

Adaptive

Choose between above modes depending if application is using FIFO or non-FIFO presentation modes.

Server side swapchain implementation

Since we’re not tied to a particular display, we can pretend that every N milliseconds, we’re woken up to encode a video frame. At this point, we pick the last ready image whose requested earliest present time has not been reached, simple enough.

We can implement present interval quite neatly as well. When a present request is received, we compute the earliest timestamp we should present based on existing images in the queue.

The timestamp_completed here is in number of frames.

uint64_t compute_next_target_timestamp() const
{
    uint64_t ts = 0;
    for (auto &img : images)
    {
       if (img.state != State::ClientOwned)
       {
          uint64_t target_ts = img.target_timestamp + img.target_period;
          if (target_ts > ts)
             ts = target_ts;
       }
    }

    // If there are no pending presentations in flight,
    // lock-in for the next cycle.
    // Move any target forward to next pending timestamp.
    uint64_t next_ts = timestamp_completed + 1;
    if (ts < next_ts)
       ts = next_ts;

    return ts;
}

This is pretty simple and handles any presentation interval. If the period is 0, we can have multiple presentations in flight where they all have target_ts being equal. In that case we use the largest presentation ID to make sure we’re picking the last image.

Now the image is queued, but it is still in-flight on GPU. Now we kick off a background task which waits for the presentation to complete. At that point we transition the state from Queued to Ready.

Once an image becomes Ready, we can retire old images since we know that they will never be used again as an encode source. If this were a normal fullscreen FLIP-style swapchain, we’d have to careful not to signal acquire semaphores until the newly Ready image was actually flipped on screen. We’re doing a BLIT-style swapchain due to our encoding however, so we can retire right away.

bool retire_obsolete_images(uint64_t current_present_id)
{
    for (size_t i = 0, n = images.size(); i < n; i++)
    {
       auto &img = images[i];
       if ((img.state == State::PresentReady ||
            img.state == State::PresentComplete) &&
           img.present_id < current_present_id)
       {
          img.state = State::ClientOwned;
          if (!send_acquire_and_retire_with_semaphores(i))
              return false;
       }
    }

    return true;
}

At vblank time, we’ll pick the appropriate image to encode.

int get_target_image_index_for_timestamp(uint64_t ts)
{
    int target_index = -1;
    for (size_t i = 0, n = images.size(); i < n; i++)
    {
        auto &img = images[i];
        if (img.state != State::PresentReady &&
            img.state != State::PresentComplete)
            continue;

        if (img.target_timestamp > ts)
            continue;

        // Among candidates, pick the one with largest present ID.
        if (target_index < 0 ||
            img.present_id > images[target_index].present_id)
            target_index = int(i);
    }

    return target_index;
}

If this image is in the Ready state, this is the time to transition it to Complete and send a complete event.

There are some quirks compared to a normal FIFO swapchain however. If the server is being very slow to encode, it’s possible that it misses its own vblank intervals. In this case, we might end up “skipping” ahead in the FIFO queue. E.g. an application might have queued up images to be encoded at frame 1000, 1001 and 1002, but server might end up doing 1000, drop, 1002 where frame 1001 is just skipped. This is technically out of spec for Vulkan FIFO, but I don’t care 😀

I considered keeping up the pace more important rather than slowing down the client progress just because the encoder was too slow for a split second.

From here, video and audio can be encoded fairly straight forward with FFmpeg.

For video:

Async compute shader that rescales and converts RGB to YUV420
Ideally, we’d pass that on to Vulkan video encode directly, but for now, just read back YUV image to system memory
Copy into an AVFrame
- If using hwaccel, av_hwframe_transfer (so many copies …)
Send AVFrame to codec
Get AVPacket out
Send to muxer (e.g. MKV)

For audio:

Create a recording stream
- Either monitor the soundcard output as an input device
- … or use pipewire patch bay to record specific audio streams
  - Automating this process would be cool, but … eh

After all this, I felt the side project had kind of come to an end for the time being. I removed some old cobwebs in the IPC parts of my brain and got a deeper understanding of WSI on Linux and got basic hwaccel encoding working with NVENC and VAAPI, mission complete.

Now I could do:

// Client. use explicit_environment in layer JSON to make it opt-in.
// We all know how broken implicit layers that hook WSI by default can be,
// don't we, RTSS :')
$ PYROFLING=1 ./some/vulkan/app

// Server
./pyrofling /tmp/dump.mkv --encoder blah --bitrate blah ...

The pyrofling layer automatically connects to the server if it’s spawned after game starts, and you can restart the server and it reconnects seamlessly. Neat!

The plan at this point was to wait until Vulkan video encode matured and then hook up the encode path properly, but … things happened, as they usually do.

Scratching an itch – playing a single-player game together over the web

Replaying a classic game with friends and family during the holidays tends to be quite enjoyable, and at some point we ended up trying to recreate the experience remotely. The ideal situation was that one of us would host the game and play it while the other would watch the stream and we could banter.

The first attempt was to do this over Discord screen sharing, but the experience here was abysmal. Horrible video quality, stutter, performance, and no good solution for piping through high quality game audio. This testing included Windows. Maybe there’s a way, but I couldn’t find it. I don’t think Discord is designed for this use case.

Bad frame pacing completely breaks immersion, simply unacceptable.

At this point, I figured OBS might be a solution. Just stream to Twitch or something and people could watch that stream while talking over Discord.

While this “worked” in the sense that video was smooth and audio quality good, there were some major drawbacks:

Twitch’s idea of “low latency” mode is misleading at best. Expect between 1 and 2 seconds of delay, and as much as 3 in some cases. This was completely useless in practice. It might be barely okay for a streamer watching comments and interacting with an audience. When communicating with “the audience” over voice, and hearing reactions delayed by seconds, it was unusable.
Horrible video quality. Twitch caps you to about 6 mbit/s + 8-bit H.264 which has very questionable video quality for game content even with a competent encoder. (Popular streamers get more bandwidth, or so I hear.) This basically forced me into 720p. Surely we can do better than this in 2023 …
OBS did not like my multi-GPU setup on Linux and trying to hardware encode on top of that was … not fun 😀

At this point, I wanted to test if OBS was adding more buffering than expected, so I dusted off pyrofling, added an option to mux to RTMP / FLV which Twitch expects, and that’s about all you need to stream to Twitch, really. It worked just fine, but latency did not improve.

Targeting decent latency – 100-200 ms range

For just watching a stream and talking / commenting alongside it, I needed to find a way to get it down to about 100-200 ms, which is the middle ground of latency. I figured most of the delay was due to buffering on Twitch’s end, so I wondered if it’d be possible to host something similar locally. I’d only need to serve one client after all, so bandwidth was not a concern.

This venture quickly failed. The closest I found was https://github.com/ossrs/srs, but I couldn’t get it to work reliably and I couldn’t be arsed to troubleshoot some random Github project.

Hacking PyroFling encoder – MPEG-TS over TCP

The first idea I came up with was to use MPEG-TS as a muxer, add an IO callback, so that instead of writing the MPEG-TS to file I’d beam the data over a socket to any TCP client that connected. FFmpeg can do something similar for you by using “tcp://local-ip:port?listen=1” as the output path, but this is blocking and not practical to use with the FFmpeg API in a multiplexed server.

Video players like MPV and VLC can easily open a raw stream over TCP with e.g. tcp://ip:port. It’ll figure out the stream is MPEG-TS and start playing it just fine.

This actually worked! But again, I ran into issues.

Even in low-latency / no-buffer modes in MPV / VLC, the latency was questionable at best. At the start of the stream, I could observe usable levels of latency, but there seemed to be no internal system to keep latency levels stable over time, especially when audio was also part of the stream. The buffer sizes would randomly grow and eventually I’d sit at over half a second latency. I spent some time searching for random comments from people having the same problems and trying a million different CLI commands that “supposedly” fix the problem, but none of them satisfied me.

At this point, I was too deep in, so … Time to write a no-frills custom video player designed for stable low latency streaming.

Presentation Time Stamp – PTS

FFmpeg and most/all container formats have a concept of a PTS, when to display a video frame or play back audio. This is used to guide A/V synchronization.

Off-line media scenario – sync-on-audio

I already had this path implemented in Granite. Audio playback is continuous, and we can constantly measure the playback cursor of the audio. If we’re a typical media player with a long latency audio buffer to eliminate any chance of audio hick-ups, we take the audio buffer latency into account as well, so at any instantaneous time point, we can estimate current audio PTS as:

estimated_pts = pts_of_current_audio_packet +
                frame_offset_in_audio_packet / sample_rate -
                audio_latency

This raw form of PTS cannot be used as is, since it’s too noisy. Audio is processed in chunks of about 10 ms in most cases, so the estimate will be erratic.

The solution is to smooth this out. We expect the audio PTS to increase linearly with time (duh), so the way I went about it was to fuse wall clock with audio PTS to stay in sync and avoid drift.

// elapsed_time is current wall time
// yes, I use double for time here, sue me.

// Unsmoothed PTS.
double pts = get_estimated_audio_playback_timestamp_raw();

if (pts == 0.0 || smooth_elapsed == 0.0)
{
    // Latch the PTS.
    smooth_elapsed = elapsed_time;
    smooth_pts = pts;
}
else
{
    // Smooth out the reported PTS.
    // The reported PTS should be tied to the host timer,
    // but we need to gradually adjust the timer based on the
    // reported audio PTS to be accurate over time.

    // This is the value we should get in principle if everything
    // is steady.
    smooth_pts += elapsed_time - smooth_elapsed;
    smooth_elapsed = elapsed_time;

    // Basically TAA history accumulation + rejection :D
    if (muglm::abs(smooth_pts - pts) > 0.25)
    {
       // Massive spike somewhere, cannot smooth.
       // Reset the PTS.
       smooth_elapsed = elapsed_time;
       smooth_pts = pts;
    }
    else
    {
       // Bias slightly towards the true estimated PTS.
       // Arbitrary scaling factor.
       smooth_pts += 0.005 * (pts - smooth_pts);
    }
}

Now that we have a smooth estimate of PTS, video sync is implemented by simply displaying the frame that has the PTS closest to our estimate. If you have the luxury of present timing, you could queue up a present at some future time where you know audio PTS will match for perfect sync. In my experience you can be off by about 40 ms (don’t quote me on that) before you start noticing something’s off for non-interactive content.

Fixed video latency – sync-on-video

While sync-on-audio is great for normal video content, it is kinda iffy for latency. At no point in the algorithm above did we consider video latency, and that is kinda the point here. Video latency is the more important target. Correct audio sync becomes less important when chasing low latency I think.

In a real-time decoding scenario, we’re going to be continuously receiving packets, and we have to decode them as soon as they are sent over the wire. This means that at any point, we can query what the last decoded video PTS is.

Based on that, we can set our ideal target video PTS as:

target_video_pts = last_decoded_video_pts - target_latency

Again, this estimate will be very noisy, so we smooth it out as before using wall time as the fused timer:

double target_pts = get_last_video_buffering_pts() - target_latency;
if (target_pts < 0.0)
    target_pts = 0.0;

smooth_pts += elapsed_time - smooth_elapsed;
smooth_elapsed = elapsed_time;

if (muglm::abs(smooth_pts - target_pts) > 0.25)
{
    smooth_elapsed = elapsed_time;
    smooth_pts = target_pts;
}
else
{
    smooth_pts += 0.002 * (target_pts - smooth_pts);
}

Now we have another problem. What to do about audio? Frame skipping or frame duplication is not possible with audio, even a single sample of gap in the audio has disastrous results.

The solution is to dynamically speed audio up and down very slightly in order to tune ourselves to the correct latency target. The idea is basically to sample our estimated audio PTS regularly and adjust the resampling ratio.

latch_estimated_audio_playback_timestamp(smooth_pts);

auto delta = smooth_pts - get_estimated_audio_playback_timestamp_raw();

// Positive value, speed up. Negative value, slow down.
delta = clamp(delta, -0.1, 0.1);

// This is inaudible in practice.
// Practical distortion will be much lower than outer limits.
// And should be less than 1 cent on average.
stream->set_rate_factor(1.0 + delta * 0.05);

This of course requires you to have a high quality audio resampler that can do dynamic adjustment of resampling ratio, which I wrote myself way back in the day for retro emulation purposes. While this technically distorts the audio a bit by altering the pitch, this level of funging is inaudible. 1 cent of a semitone (about 0.05%) is nothing.

I believe this is also how MPV’s sync-on-video works. It’s a useful technique for displaying buttery smooth 60 fps video on a 60 Hz monitor.

Success! Kinda …

By targeting a reasonably low latency in the new player, we were able to get an acceptable stream going over the internet. We did some basic comparisons and Discord voice came through at the same time as the video feed according to my testers, so mission accomplished I guess!

The main drawback now was stream robustness. TCP for live streaming is not a great idea. The second there are hick-ups in the network, the stream collapses for a hot minute since TCP does not accept any loss. When we were all on ethernet instead of Wi-Fi, the experience was generally fine due to near-zero packet loss rates, but right away, a new use case arose:

Wouldn’t it be really cool if we could switch who controls the game?

This is basically the idea of Steam Remote Play Together, which we have used in the past, but it is not really an option for us based on past experience:

Latency too high
Video quality not great
Only supported by specific games
- And usually only multi-player co-op games
Won’t help us playing non-Steam stuff

At this point I knew I had work cut out for me. Latency had to drop by an order of magnitude to make it acceptable for interactive use.

Chasing low latency

The first step in the process was to investigate the actual latency by the encoder and decoder chains, and the results were … kinda depressing.

On the right, my test app was running, while the left was the video feedback over localhost on the same display. The video player was hacked to always present the last available image. 100 ms latency, yikes …

I eventually narrowed part of this down to MPEG-TS muxer in FFmpeg adding a lot of latency, over 50 ms just on its own. It was pretty clear I had to get rid of MPEG-TS somehow. Around this point I started to investigate RTP, but I quickly rejected it.

RTP does not support multiple streams. It cannot mux audio and video, which is mildly baffling. Apparently you’re supposed to use two completely different RTP streams on different ports. Some kind of external protocol is used to provide this as side band information, and when trying to play an RTP stream in FFmpeg you get hit with:

[rtp @ 0x55b3245fd780] Unable to receive RTP payload type 96 without an SDP file describing it

Apparently this is https://en.wikipedia.org/wiki/Session_Description_Protocol, and the whole affair was so cursed I just rejected the entire idea, and rolled my own protocol. I just need to bang over some UDP packets with some sequence counters, payloads and metadata after all, how hard can it be, right?

Turns out it wasn’t hard at all. The rest of the latency issues were removed by:

Disabling frame queue in NVENC
Disabling encoding FIFO in server
- Just encode as soon as possible and blast the packet over UDP
- Pacing be damned
- We’ll solve frame pacing later
Remove B-frames and look-aheads
- Well, duh :p
- “zerolatency” tune in libx264

For example, here’s some options for NVENC.

// This is critical !!!
av_dict_set_int(&opts, "delay", 0, 0);

av_dict_set_int(&opts, "zerolatency", 1, 0);
av_dict_set(&opts, "rc", "cbr", 0);
av_dict_set(&opts, "preset", "p1", 0);
av_dict_set(&opts, "tune", "ll", 0);
// There's an ull for ultra-low latency,
// but video quality seemed to completely die
// with no obvious difference in latency.

Some local results with all this hackery in libx264.

On my 144 Hz monitor I could sometimes hit a scenario where the video stream and application hit the same vblank interval, which means we just achieved < 7 ms latency, very nice!

NVENC also hits this target, but more consistently, here with HEVC encode.

AMD with VAAPI HEVC on RX 6800 isn’t quite as snappy though … Hoping Vulkan encode can help here. There might be some weird buffering going on in the VAAPI backends that I have no control over, but still. We’re still in the ~10 ms ballpark. I had better results with AV1 encode on my RX 7600, but I couldn’t be arsed to swap out GPUs just to get some screenshots.

Of course, we’re working with the most trivial video footage possible. The true test is real games where I expect encode/decode latency to be more obvious.

Intra-refresh – the magic trick

When doing very low-latency streaming like this, the traditional GOP structure completely breaks down. Intra frames (or I-frames) are encoded as still images and tend to be fairly large. P- and B-frames on the other hand consume far fewer bits. Low latency streaming also requires a lot more bitrate than normal high-latency encoding since we’re making life very difficult for the poor encoder.

In a constant bit-rate world where we’re streaming over a link with limited bandwidth, the common solution to this bitrate fluctuation is to just buffer. It’s okay if an I-frame takes 100ms+ to transmit as long as the decode buffer is large enough to absorb the bandwidth difference, but we cannot rely on this for low latency streaming.

Here’s a link to the 2010 post by x264 legend Dark Shikari. The solution is intra-refresh where parts of the image is continuously refreshed with I-blocks. Effectively, we have no GOP structure anymore at this point. This allows us to avoid the huge spikes in bandwidth.

libx264 and NVENC support this, but sadly, VAAPI implementations do not 🙁 Hoping we can get this working in Vulkan video encode somehow …

av_dict_set_int(&opts, "intra-refresh", 1, 0);
av_dict_set_int(&opts, "forced-idr", 1, 0);
video.av_ctx->refs = 1; // Otherwise libx264 gets grumpy.

The forced-idr option is used so that we can still force normal I-frames at will. Clients can request “pure” I-frames when connecting to the server to kick-start the decode process. Technically, with intra-refresh you can just keep decoding until the image has been fully refreshed at least once, but I had various issues with FFmpeg decoding errors when trying to decode raw intra-refresh without ever seeing a keyframe first, especially with HEVC, so I went with the dumb solution 🙂 It worked fine.

Fixing frame pacing

When I try to just display the frames as they come in over the network, the results are … less than ideal. The pacing is completely butchered due to variability in:

GPU time for game to render
Encoding time (scene dependent)
Network jitter
Decoding time

Under ideal conditions over a local network, network jitter is mostly mitigated, but the variability in encode/decode time is still noticeable on a fixed rate display, causing constant frame drops or dupes. My solution here was to re-introduce a little bit of latency to smooth over this variability.

Step 1 – Set up a low latency swapchain on client

VK_KHR_present_wait is critical to ensure we get the lowest possible latency. On a 60 Hz monitor, we want this frame loop:

while (gaming)
{
    wait_for_last_present_to_flip(); // vkWaitForPresentKHR(id);
    // 16.6ms until next flip.

    wait_for_next_video_frame_in_decode_queue(timeout = N ms);

    // Should just take a few microseconds on GPU.
    yuv_to_rgb_render_pass();

    // Ideally the GPU is done just before the deadline of compositor.
    present(++id);
}

Just in case we barely miss deadline due to shenanigans, FIFO_RELAXED is useful as well.

step 2 – send feedback to server

This is fairly magical and I don’t think any generic “screen capturing” software can and will do this. The idea is that there is an ideal time target when new video frames should be done. If they arrive too early, we can ask the game to slow down slightly, and if it arrives too late, speed up a bit.

Basically, this is a phase locked loop over the network. One nice property of this is that we don’t really need to know the network latency at all, it’s self stabilizing.

while (gaming)
{
     wait_for_last_present_to_flip(); // vkWaitForPresentKHR(id);
     // 16.6ms until next flip.

     auto phase = get_current_time();
     if (wait_for_next_video_frame_in_decode_queue(deadline))
     {
         auto arrive_time = video_frame.decode_done_time;

         // If 0, the packet arrived in sync with WaitForPresentKHR.
         // If negative, it was completed before we waited for it.
         auto phase_offset = arrive_time - phase;

         phase_offset -= target_phase_offset;
         // Continuously notify server by feedback.
         client.send_phase_offset(phase_offset);
     }
     else
     {
        // Could be used to adapt the target_phase_offset maybe?
        missed_frames++;
     }

     // ... render
}

Since the server controls when to send Complete events to the game running, we have full control over whether to render at 60.0 FPS, 60.01 FPS or 59.99 FPS. Tiny adjustments like these is enough to keep the system stable over time. It can also handle scenarios where the refresh rates are a bit more off, for example 59.95 Hz.

Of course, large network spikes, lost packets or just good old game stutter breaks the smooth video, but it will recover nicely. With target_phase_offset = -8ms and deadline of 8ms, I have a very enjoyable gaming experience with smooth frame pacing and low latency over local network.

Step 3 – Adjust audio buffering algorithm

At this point, we don’t really care about A/V sync by PTS. The assumption is that audio and video packets arrive together with very similar PTS values. Rather than trying to target a specific PTS, we just want to ensure there is a consistent amount of audio buffering to safely avoid underrun while keeping latency minimal. This strategy is good enough in practice.

double current_time = get_audio_buffering_duration();
double delta = current_time - target_buffer_time; // 30ms is good tradeoff
set_audio_delta_rate_factor(delta);

Adding a virtual gamepad

As cherry on top, we just need to let the client send gamepad events. Using /dev/uinput on Linux, it’s very simple to create a virtual gamepad that Steam can pick up and it “just werks” in all games I tested. It works fine in other programs too of course. It’s trivial to hook this up.

Cranking the quality

For game content in darker regions, I noticed that 10-bit HEVC looked dramatically better than 8-bit, so I went with that. >30mbit/s 10-bit streams with HEVC or AV1 looks basically flawless to me on a TV even with really difficult game content that tends to obliterate most streams. Good luck getting game streaming services to provide that any time soon!

Putting it all together

Remaining problems

The main problem left is that packet loss recovery isn’t really there. I’m not sure if FFmpeg has facilities to recover gracefully from dropped packets other than freaking out about missing reference frames in the logs, and this is a bit outside my wheelhouse. Intra refresh does a decent job of quickly recovering however.

I have some hopes that using Vulkan video decode directly will allow me to fake the presence of missed reference frames to mask most of the remaining issues, but that’s a lot of work for questionable gain.

Audio is a bit more YOLO, I just ignore it. That seems to be the general strategy of VoIP anyways.

There’s also zero security / encryption. I don’t really care.

Sadly, I haven’t had much luck getting the work in progress Vulkan encode work to run yet. Hooking up a fully Vulkan encode -> decode chain will be nice when that matures. The decode path is already working.

Conclusion

If you actually made it this far, congratulations. I mostly aimed to make this post a braindump of the techniques I went through to make this and I achieved what I set out to do, useful low-latency game streaming tailored exactly for my needs.

Hardcore Vulkan debugging – Digging deep on Linux + AMDGPU

Everyone battle hardened in the world of Vulkan and D3D12 knows that debugging is ridiculously hard once we enter the domain of crashes and hangs. No one wants to do it, and seeing a random GPU crash show up is enough to want to quit graphics programming and take up farming on a remote island. Someone has to do it though, and given how questionable a lot of D3D12 content is w.r.t. correctness, this comes up a lot more often that we’d like in vkd3d-proton land.

The end goal of this blog is to demonstrate the magical UMR tool on Linux, which I would argue is the only reasonable post-mortem debugging method currently available on PC, but before we go that deep, we need to look at the current state of crash debugging on PC and the bespoke tooling we have in vkd3d-proton to deal with crashes.

Eating just crumbs makes for a miserable meal

Breadcrumbs is a common technique that most modern APIs have some kind of implementation of. The goal of breadcrumbs is simply to narrow down which draws or dispatches caused the failure. This information is extremely limited, but can sometimes be enough to figure out a crash if you’re very lucky and you have knowledge about the application’s intentions with every shader (from vkd3d-proton’s point of view, we don’t obviously).

Depending on the competency of the breadcrumb tool, you’d get this information:

A range of draws or dispatches which could potentially have caused failure.
- Ideally, exactly the draw or dispatch which caused failure.
If page fault, which address caused failure?
- Which resource corresponds to that failure? It is also possible that the address does not correspond to any resource. Causing true OOB on D3D12 and Vulkan is very easy.

As far as I know, this is where D3D12 on Windows ends, with two standard alternatives:

WriteBufferImmediate (Basically VK_AMD_buffer_marker)
DRED

There are vendor tools at least from NVIDIA and AMD which should make this neater, but I don’t have direct experience with any of these tools in D3D12, so let’s move on to the Vulkan side of things.

VK_AMD_buffer_marker breadcrumbs

Buffer markers is the simplest possible solution for implementing breadcrumbs. The basic idea is that a value is written to memory either before the GPU processes a command, or after work is done. On a device lost, counters can be inspected. The user will have to instrument the code somehow, either through a layer or directly. In vkd3d-proton, we can enable debug code which automatically does this for all D3D12 commands with VKD3D_CONFIG=breadcrumbs (not available in release builds).

For example, from our dispatch implementation:

VK_CALL(vkCmdDispatch(list->vk_command_buffer, x, y, z));
VKD3D_BREADCRUMB_AUX32(x);
VKD3D_BREADCRUMB_AUX32(y);
VKD3D_BREADCRUMB_AUX32(z);
VKD3D_BREADCRUMB_COMMAND(DISPATCH);

Then it’s a matter of writing the breadcrumb somewhere:

cmd.type = VKD3D_BREADCRUMB_COMMAND_SET_BOTTOM_MARKER;
cmd.count = trace->counter;
vkd3d_breadcrumb_tracer_add_command(list, &cmd);

VK_CALL(vkCmdWriteBufferMarkerAMD(list->vk_command_buffer,
  VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT,
  host_buffer, ...);

trace->counter++;

cmd.type = VKD3D_BREADCRUMB_COMMAND_SET_TOP_MARKER;
cmd.count = trace->counter;
vkd3d_breadcrumb_tracer_add_command(list, &cmd);

VK_CALL(vkCmdWriteBufferMarkerAMD(list->vk_command_buffer,
  VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
  host_buffer, ...);

We’ll also record commands and the parameters used in a side band buffer so that we can display the faulting command buffers.

Another thing to consider is that the buffer we write to must be coherent with the host. On a device lost happening randomly inside a command we won’t have the opportunity to perform host memory barriers and signal a fence properly, so we must make sure the memory punches straight through to VRAM. On AMD, we can do this with

memory_props = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
               VK_MEMORY_PROPERTY_HOST_COHERENT_BIT |
               VK_MEMORY_PROPERTY_HOST_CACHED_BIT;

if (is_supported(VK_AMD_device_coherent_memory))
{
   memory_props |= VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD |
                   VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD;
}

On fault, we scan through the host buffer, and if we observe that TOP and BOTTOM markers are not 0 (never executed) or UINT32_MAX (done), scan through and report the range of failing commands.

RADV speciality, making buffer markers actually useful

GPUs execute commands concurrently unless barriers are emitted. This means that there is a large range of potential draws or dispatches in flight at any one time. RADV_DEBUG=syncshaders adds barriers in between every command so that we’re guaranteed a hang will narrow down to a single command. No other Vulkan driver supports this, and it makes RADV the only practical driver for breadcrumb techniques, at least on Vulkan. Sure, it is possible to add barriers yourself between every command to emulate this, but for render passes, this becomes extremely annoying since you have to consider restarting the render pass for every draw call …

As a simple example, I’ve hacked up one of the vkd3d-proton tests to write a bogus root descriptor address, which is a great way to crash GPUs in D3D12 and Vulkan.

When running with just breadcrumbs, it’s useless:

Device lost observed, analyzing breadcrumbs ...
Found pending command list context 1 in executable state, TOP_OF_PIPE marker 44, BOTTOM_OF_PIPE marker 0.
===== Potential crash region BEGIN (make sure RADV_DEBUG=syncshaders is used for maximum accuracy) =====
Command: top_marker
marker: 1
Command: set_shader_hash
hash: db5d68a6143611ad, stage: 20
Set arg: 0 (#0)
Set arg: 18446603340520357888 (#ffff800100400000)
Command: root_desc
Set arg: 0 (#0)
Set arg: 18446603340788793344 (#ffff800110400000)
Command: root_desc
Tag: ExecuteIndirect [MaxCommandCount, ArgBuffer cookie, ArgBuffer offset, Count cookie, Count offset]
Set arg: 1 (#1)
Set arg: 7 (#7)
Set arg: 16 (#10)
Set arg: 0 (#0)
Set arg: 0 (#0)
Set arg: 0 (#0)
Command: execute_indirect_unroll_compute
Command: bottom_marker
marker: 1
Command: top_marker
marker: 2
Command: execute_indirect
Command: bottom_marker

... A ton of commands

Command: barrier
Command: bottom_marker
marker: 44
===== Potential crash region END =====

Instead, with syncshaders, it becomes:

===== Potential crash region BEGIN (make sure RADV_DEBUG=syncshaders is used for maximum accuracy) =====
Command: top_marker
marker: 1
Command: set_shader_hash
hash: db5d68a6143611ad, stage: 20
Set arg: 0 (#0)
Set arg: 18446603340520357888 (#ffff800100400000)
Command: root_desc
Set arg: 0 (#0)
Set arg: 18446603340788793344 (#ffff800110400000) <-- bogus pointer
Command: root_desc
Tag: ExecuteIndirect [MaxCommandCount, ArgBuffer cookie, ArgBuffer offset, Count cookie, Count offset]
Set arg: 1 (#1)
Set arg: 7 (#7)
Set arg: 16 (#10)
Set arg: 0 (#0)
Set arg: 0 (#0)
Set arg: 0 (#0)
Command: execute_indirect_unroll_compute
Command: bottom_marker
marker: 1
===== Potential crash region END =====

That’s actionable.

It’s more widely supported than you’d expect

A lot of drivers actually support the buffer marker vendor extension, at least in Mesa land, and even NVIDIA (although, on NVIDIA we use another extension for breadcrumb purposes …)

Narrow down multiple queues

With async compute, it’s possible that multiple command streams are in flight, and with breadcrumbs, it’s not possible to determine which queue actually faulted. To aid this, we have VKD3D_CONFIG=single_queue which serializes everything into one VkQueue.

VK_NV_device_diagnostic_checkpoints

The NV vendor extension simplifies things a fair bit. Rather than allocating memory in sysmem and manually writing out markers, one call is made after every command:

VK_CALL(vkCmdSetCheckpointNV(list->vk_command_buffer,
        NV_ENCODE_CHECKPOINT(context, trace->counter)));

The argument is a void pointer where you can place whatever you want, so we encode command list index and a counter there. On device fault, you can then query checkpoints from your queues.

VK_CALL(vkGetQueueCheckpointDataNV(vk_queue, &checkpoint_count, NULL));
if (checkpoint_count == 0)
  return;

checkpoints = vkd3d_calloc(checkpoint_count, sizeof(VkCheckpointDataNV));
for (i = 0; i < checkpoint_count; i++)
  checkpoints[i].sType = VK_STRUCTURE_TYPE_CHECKPOINT_DATA_NV;
VK_CALL(vkGetQueueCheckpointDataNV(vk_queue,
        &checkpoint_count, checkpoints));

From there, start looking for TOP_OF_PIPE and BOTTOM_OF_PIPE pipeline stages to get a potential range of commands. BOTTOM_OF_PIPE means we know for sure all commands before completed execution, and TOP_OF_PIPE means the command processor might have started executing all commands up to that point.

The main flaw with this extension is there is no easy way to narrow down the range of commands. With RADV we can enforce sync with syncshaders as a (very useful) hack, but there is no such method on NV unless we do it ourselves 🙁

Drilling into hard rock

If we can narrow down a breadcrumb to a specific shader – and it’s reproducible – it might be the time to perform the dark art of shader replacement and GPU debug printing. We know where it crashes, but why is still a mystery.

Shader replacement is a special kind of magic that vkd3d-proton needs to consider since we have no means of modifying the original game shaders directly. We get (incomprehensible) DXIL which we then translate into SPIR-V. vkd3d-proton supports bypassing the translation step, and using our own SPIR-V instead. This SPIR-V can be instrumented with debug code which lets us drill down. This is a horribly slow and tedious process, but it’s the only thing we have to inspect shader execution in real time.

Dumping DXIL and SPIR-V

First, we have to dump all shaders used by a game.

mkdir /tmp/shaders
VKD3D_SHADER_DUMP_PATH=/tmp/shaders %command%

From a breadcrumb trace, we’ll hopefully know the shader hashes which are relevant to look at, and we can round-trip them to Vulkan GLSL using SPIRV-Cross.

mkdir /tmp/override
cp /tmp/shaders/$hash.* /tmp/override
cd /tmp/override
python3 ~/git/dxil-spirv/roundtrip_shaders.py --input . --output . --spirv-cross $PATH_TO_SPIRV_CROSS

From here, we can modify the shader as we please.

$ cat /tmp/override/db5d68a6143611ad.comp
#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

layout(buffer_reference) buffer _7;
layout(buffer_reference, buffer_reference_align = 4, std430) buffer _7
{
   uint _m0[];
};

layout(push_constant, std430) uniform push_cb
{
   uvec2 va0;
} _13;

void main()
{
  uint _22 = uint(0) * 1u;
  uint _28 = atomicAdd(_7(_13.va0)._m0[_22 + (uint(0) >> 2u)], 1u);
}

It’s pretty obvious where it crashes here, but to demonstrate …

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

#include "debug_channel.h"

layout(buffer_reference) buffer _7;
layout(buffer_reference, buffer_reference_align = 4, std430) buffer _7
{
  uint _m0[];
};

layout(push_constant, std430) uniform push_cb
{
  uvec2 va0; // Root descriptor raw address.
} _13;

void main()
{
  DEBUG_CHANNEL_INIT(gl_GlobalInvocationID);
  uint _22 = uint(0) * 1u;

  DEBUG_CHANNEL_MSG(1);
  DEBUG_CHANNEL_MSG(_13.va0.x, _13.va0.y);
  uint _28 = atomicAdd(_7(_13.va0)._m0[_22 + (uint(0) >> 2u)], 1u);
  DEBUG_CHANNEL_MSG(2);
}

cd /tmp/override
make M=$(pwd) -C ~/git/vkd3d-proton/include/shader-debug
make: Entering directory '/home/maister/git/vkd3d-proton/include/shader-debug'
glslc -o /tmp/override/db5d68a6143611ad.spv /tmp/override/db5d68a6143611ad.comp -I/home/maister/git/vkd3d-proton/include/shader-debug --target-env=vulkan1.1 --target-spv=spv1.4 
make: Leaving directory '/home/maister/git/vkd3d-proton/include/shader-debug'

Now we can run again with:

VKD3D_SHADER_OVERRIDE=/tmp/override VKD3D_SHADER_DEBUG_RING_SIZE_LOG2=30 VKD3D_CONFIG=breadcrumbs %command%

25af:info:vkd3d_shader_debug_ring_print_message: Shader: db5d68a6143611ad: Instance 0000000000, ID (0, 0, 0): 1
25af:info:vkd3d_shader_debug_ring_print_message: Shader: db5d68a6143611ad: Instance 0000000000, ID (0, 0, 0): #50400000, #ffff8001

As expected, the shader did not reach 2, because it crashed. The address also correlates with dmesg:

[ 2551.614976] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:56 vmid:1 pasid:32771, for process d3d12 pid 9633 thread d3d12 pid 9633)
[ 2551.614985] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800150400000 from client 0x1b (UTCL2)
[ 2551.614988] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x001C1070
[ 2551.614990] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[ 2551.614992] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0
[ 2551.614994] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0
[ 2551.614996] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x7
[ 2551.614998] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 2551.614999] amdgpu 0000:0b:00.0: amdgpu: RW: 0x1
[ 2611.649035] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.2.0 timeout, signaled seq=35, emitted seq=36

As you can imagine, this is not a fun process when debugging games with 3 ksloc+ long shaders with tons of resource access. To aid this process, we really need UMR …

To make debug print work in crash scenarios, we need to use the same trick as buffer markers, i.e., make the print ring buffer device coherent and uncached.

The holy grail – UMR wave dumps

If all else fails, we have a trump card. This tool is unique to AMD + Linux as far as I know and it lets us inspect all waves which were executing on the GPU at the time of crash. It’s developed by AMD on freedesktop. Alternatively, just install umr-git from AUR if you’re on Arch Linux.

ISA dumps
Corresponding SPIR-V dump
SGPR / VGPR register dumps
Which instruction was being executed in every wave
GPU disassembly around the crash site

Now this is the real deal. RADV can invoke UMR on crashes and dump out a bunch of useful information. The UMR tool is standalone and should work with AMDVLK or amdgpu-pro as well. Nothing stops you from invoking the same CLI commands that RADV does while the device is hung.

Some preparation

UMR needs to do pretty deep kernel poking, so permissions will be required. First, we have to expose the debug interfaces. Run this as super user after every boot:

#!/bin/bash

chmod 777 /sys/kernel/debug
chmod -R 777 /sys/kernel/debug/dri

If you’re on a multi-GPU system (a good idea if you’re debugging hangs all day every day), it’s possible that the AMD GPU won’t be DRI instance 0.

ls /sys/kernel/debug/dri/*/amdgpu_gfxoff
/sys/kernel/debug/dri/1/amdgpu_gfxoff

If the instance is not 0, RADV currently does not forward the proper arguments to UMR, so check out this Mesa MR for now and add

RADV_UMR_EXTRA_ARGS="--instance 1"

Hopefully a more automatic solution will present itself soon.

A note about Steam Linux Runtime

If trying to debug games from within Steam, Proton or native, the pressure-vessel container will sandbox away /usr/bin/umr, so you’ll need to bypass it somehow. Details are left elsewhere.

Graphics queue only

Currently, RADV only knows how to dump the GFX ring, so we need to ensure only that queue is used if crashes happen in async compute. In vkd3d-proton, we have VKD3D_CONFIG=single_queue for that purpose.

Bracing for impact

RADV_DEBUG=hang VKD3D_CONFIG=single_queue %command%

In RADV, this does a few things:

syncshaders is implied
After every queue submission, RADV waits for idle. If it times out, it is considered a hang
Dump a bunch of internal state, disassemblies, etc
Invoke UMR to provide wave dumps

It’s also possible to add this debug option to make page faults a little nicer to debug, but usually not needed:

ACO_DEBUG=force-waitcnt

Note that while in hang debug mode, games will usually run at less than half performance due to the aggressive synchronization.

****************************************************************
* WARNING: RADV_DEBUG=hang is costly and should only be used for debugging! *
****************************************************************

radv: GPU hang detected...
radv: GPU hang report will be saved to '/home/maister/radv_dumps_22769_2023.08.20_12.09.28'!
dmesg: read kernel buffer failed: Operation not permitted
radv: GPU hang report saved successfully!

RADV will try to dump dmesg too, but you probably won’t have permissions, it’s not a big deal.

There’s a lot of useful information here, time to bring out your text editor.

-rw-r--r-- 1 maister maister 780 Aug 20 12:09 1919677f221dcd2d53a149d71db43d179e35dac3.spv
-rw-r--r-- 1 maister maister 152 Aug 20 12:09 app_info.log
-rw-r--r-- 1 maister maister 5429 Aug 20 12:09 bo_history.log
-rw-r--r-- 1 maister maister 56 Aug 20 12:09 bo_ranges.log
-rw-r--r-- 1 maister maister 27 Aug 20 12:09 dmesg.log
-rw-r--r-- 1 maister maister 4927 Aug 20 12:09 gpu_info.log
-rw-r--r-- 1 maister maister 2403 Aug 20 12:09 pipeline.log
-rw-r--r-- 1 maister maister 11231 Aug 20 12:09 registers.log
-rw-r--r-- 1 maister maister 106478 Aug 20 12:09 trace.log
-rw-r--r-- 1 maister maister 826454 Aug 20 12:09 umr_ring.log
-rw-r--r-- 1 maister maister 11388 Aug 20 12:09 umr_waves.log

First, SPIR-V relative to the faulting PSO is dumped. In vkd3d-proton, we emit the DXBC/DXIL hash inside the SPIR-V to correlate back to breadcrumbs or shader dumps.

 %29 = OpString "db5d68a6143611ad.dxbc"

pipeline.log

This is a straight up ISA dump of NIR -> ACO -> AMD ISA.

DISASM:
BB0:
v_mov_b32_e32 v0, 1 ; 7e000281
v_mov_b32_e32 v1, 0 ; 7e020280
global_atomic_add v1, v0, s[2:3] ; dcc88000 00020001
s_endpgm ; bf810000

bo_history.log

This logs all allocations and frees. It’s intended to be parsed with e.g.

python ~/git/mesa/src/amd/vulkan/radv_check_va.py bo_history.log 0x0000800110400000

This is mostly useful to prove application use-after-free or similar.

umr_waves.log

This is the most important dump of all. An entry is made for every active wave. It’s very verbose, but the important parts I think are:

Main Registers:
pc_hi: 00008000 | pc_lo: 00071d10 | wave_inst_dw0: bf810000 | exec_hi: 00000000 | 
exec_lo: 00000001 | m0: f97b2781 | ib_dbg1: 01000000 |

SGPRS:
[ 0.. 3] = { f34d388c, a09e6289, 10400000, ffff8001 }
[ 4.. 7] = { 0000fff0, 00000000, 0000003b, 00000000 }
[ 8.. 11] = { b5cc397c, 650307c8, 138f6ec2, e0c2bfd0 }
[ 12.. 15] = { 3a19d354, 491b2107, 9b54b044, 40a40138 }
...
[ 104.. 107] = { 29bea6e2, ba67090a, 577b6776, 2b42adc6 }

VGPRS: t00 (t01) (t02) (t03) (t04) (t05) (t06) (t07) (t08) (t09) (t10) (t11) (t12) (t13) (t14) (t15) (t16) (t17) (t18) (t19) (t20) (t21) (t22) (t23) (t24) (t25) (t26) (t27) (t28) (t29) (t30) (t31) 
[ 0] = { 00000001 2a50a935 abb6dbce 7ff6fd9b 44024271 3182e80b afb57f2e 156403ff fc122e1a fe904906 13d03a6d 03dc3aa9 834eac40 044c0cf3 467e8798 15aec7d3 a4fd12c5 d1018c88 e77eea3b ce77e044 8e0af2f4 308960e9 7551b5b5 c97c7bee 3b030401 e81d4060 0b52c8cc d6377322 3bc1ac00 0901eb09 b4256737 b1fe11f2 }
[ 1] = { 00000000 09434004 46793b03 67fc758a 0000cd69 a1080140 bdd45dd9 252155aa 082c15d0 00475606 76027c6d 54ba72a2 92741152 06345667 bb634172 bd9c7f6a 8a701309 b3340805 aa2772f1 2f05cdee 7b478c94 80020493 93da61e0 f9052bce 6414435f 80971172 48c70d29 8033f3d8 928489a9 25521422 52dd60ac 4ee699be }
...
 [ 15] = { 6c84febd fff7d1f8 0036b433 922dddef d3d7197e ddfffb89 62247e7e 61af41c7 4ff21fb3 6d1d41b4 880a4918 8cde3ed6 3c72df4f ff76fef5 abe16b69 000b7ac1 d4b325ae 37fedc5d ed3f9301 870dfa3b abfa3757 d30dede1 42016b8e 785bb894 d9e3ba1c 9b7bbdad 6220918b 35729313 bcfffbdc b84f64ed 20d775d6 52135245 }

PGM_MEM:

pgm[7@0x800000071cf0 + 0x0 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x4 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x8 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0xc ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x10 ] = 0x7e000281 v_mov_b32_e32 v0, 1 
pgm[7@0x800000071cf0 + 0x14 ] = 0x7e020280 v_mov_b32_e32 v1, 0 
pgm[7@0x800000071cf0 + 0x18 ] = 0xdcc88000 global_atomic_add v1, v0, s[2:3] 
pgm[7@0x800000071cf0 + 0x1c ] = 0x00020001 ;; 
* pgm[7@0x800000071cf0 + 0x20 ] = 0xbf810000 s_endpgm 
pgm[7@0x800000071cf0 + 0x24 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x28 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x2c ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x30 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x34 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x38 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x3c ] = 0xbf9f0000 s_code_end 
End of disassembly.

TRAPSTS[50000100]:
excp: 256 | illegal_inst: 0 | buffer_oob: 0 | excp_cycle: 0 | 
excp_wave64hi: 0 | dp_rate: 2 | excp_group_mask: 0 | utc_error: 1 |

Here we can see the faulting instruction is the global_atomic_add. It’s using the address in SGPR 2/3, which we can see is … 10400000, ffff8001, which in little endian is 8001’10400000. Only the lower 48 bits are relevant, and if we look at the page fault, the address matches.

If the fault happened in a descriptor-based instruction, we can inspect the descriptor as well, since on AMD, descriptors are consumed in the instruction itself. It’s really convenient in situations like these. 🙂

Correlating fault site back to HLSL or GLSL needs to be done manually, but it is not particularly difficult.

trace.log

This can be used to inspect the PM4 stream. I rarely find it actionable, but it’s nice to have regardless. It adds breadcrumbs, but at the lowest level possible. The most useful thing (for me at least) is that we can inspect the user registers being set.

 Trace point ID: 2
!!!!! This is the last trace point that was reached by the CP !!!!!
... blablabla
10400000 COMPUTE_USER_DATA_2 <- 0x10400000
ffff8001 COMPUTE_USER_DATA_3 <- 0xffff8001
... blablabla
!!!!! This trace point was NOT reached by the CP !!!!!

Real-world debug scenario – Street Fighter 6 GPU hangs

After the last Rashid update (of course this dropped while I was on vacation, hnnnnng), users were observing random GPU hangs during online play, which is the worst kind of GPU hang. After some online play in custom rooms with people on Discord ready to join the breadcrumb crusade, I was able to reproduce the hang at a ~10% rate. I went through a goose chase.

Identify a faulting shader

Breadcrumb trace always pointed to the same shader, which is always a good start. We observed a page fault, so it could have been anything. Use-after-free or OOB access is the usual cause here. The address did not correspond to any resource however, so that’s always a fun start.

Shader replacement?

Replacing shaders seemed to mask the bug, which was rather puzzling. Usually this points to a Vulkan driver bug, but that lead got us nowhere. When dealing with low repro rate random GPU hangs, this is always the worst, since we don’t know if we were just very unlucky with repros, or if the change actually made a difference …

UMR to the rescue

(I didn’t have the old crash dumps lying around, so please excuse the lack of ISA spam. I didn’t feel like spending my Sunday reproducing a bug that I already fixed :v)

Sometimes RADV_DEBUG=hang masks bugs as well due to extra sync, but fortunately we got a wave dump eventually. The failure was in a scalar load from a raw pointer. Normally, this means an out-of-bounds root CBV descriptor access.

First hint was that this was loading 8 dwords in one go, i.e. an image descriptor. I correlated the ISA with the Vulkan GLSL disassembly and it pointed to this code:

vec4 _1132 = textureLod(nonuniformEXT(samplerCube(_32[_1129], _56[_146])), vec3(_1097, _1098, _1099), _1123 * 4.0);

It was also bindless. Normally, my spider senses would immediately think that this was an out of bounds descriptor heap access.

The descriptor index was computed as root table offset + dynamic offset. Studying the ISA I realized that it was not actually the dynamic offset that was the culprit, but rather the root table offset. Figuring this out would have taken an eternity without SGPR dumps.

From the PM4 trace, I was then able to confirm that the SGPR root table offset correlated with vkCmdPushConstants on our end.

This was rather puzzling … I had a theory that maybe our root parameter flushing code had a bug, so I added extra instrumentation to our breadcrumbs …

Another crash later, I was able to prove that on a GPU fault:

Root table #10 was never set in the command list
The shader in question accessed a descriptor array which maps to root table #10
The root table #10 read an undefined offset as a result

Game bug, oops! Turns out this scenario does not trigger an error in D3D12 validation layers when I wrote some tests to study this UB scenario (yaaaay <_<). It’s possible to trigger GPU hangs on the native AMD D3D12 driver this way. Maybe they app-opt it on their end for RE Engine, we’ll never know 🙂

Our workaround was to emit offset 0 for every unset root table access and the crash went away. …

Conclusion

For hardcore GPU debugging on PC, I think RADV currently provides the best experience by far. If you’re stuck debugging hangs in D3D12 on Windows, maybe give RADV + vkd3d-proton a shot. GPU hang recovery on amdgpu is sadly still questionable on Linux, but I have a good time as long as the AMD GPU is not driving the desktop session. I suggest multi-GPU here for an enjoyable experience. I’m also hoping this functionality can be added to the newly released RGD by AMD.

My personal hell of translating DXIL to SPIR-V – finale

It has been quite a while since the last post in this series, for good reason. Finding the energy to write this down simply does not happen overnight. It is not a coincidence that these kinds of blog posts tend to appear at the end of a long summer vacation!

As teased in the previous post, it is time to finally tackle the real meat of structured control flow. Reading that introductory post is mandatory to have any hope understanding what is going on here, and to be fair, even with that knowledge, I’m not confident I can convey all the bizarre algorithms I made up to make this somewhat work.

Most of what I’m trying to explain here are ideas that have grown organically over time with little to no theoretical backing, and with hindsight there should be better approaches, but taking advantage of that likely needs a complete rewrite. Take this post with a grain (or three) of salt if you intend to write your own structurizer.

Ideally, I would have used an off-the-shelf implementation, but I never found one that satisfies:

Standalone
Convinces me that it will generate fast code, i.e. does not confuse the hell out of existing compilers with ridiculous branching patterns
Does not collapse into pure gibberish on mildly weird code (and DXC sure loves to emit weird code)
Understands the SPIR-V flavor of structured control flow
Has a concept of convergence
An algorithm I can actually understand and debug

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
— Brian W. Kernighan

My little DXIL hell blog series – previous entries

What the algorithm needs to do

The two major requirements of the structurizer is the need to rewrite control flow, and marking required structured information in SPIR-V, such as:

- Selection merge: For any block that does conditional branch or switch
- Loop merge + continue block: For any block that has a back-edge

If we can insert these constructs and make the SPIR-V validator pass, we’ve done our job and the code tends to “just werk”. In terms of quality of implementation, it is imperative that the control flow is as natural as possible. The output control flow (passed through SPIRV-Cross) is the only help we have in vkd3d-proton when debugging shaders, so anything that can help make the CFG more readable is extremely important. In more concrete terms, emitting the simplest possible structured construct is the goal.

Unfortunately, it is impossible to simply take the DXIL control flow, analyze it and insert structured information as necessary. Instead, we have to (heavily) modify the control flow graph first, so that it is possible to insert structured information while satisfying SPIR-V requirements later. Modifying a control flow graph is a massive topic, but before falling in that rabbit hole, I’ll try to outline some of the concepts at a higher level to help ramp up the brain clock frequency. In the sections after this, I’ll aim to provide concrete examples with graphviz illustrations.

Dummy basic blocks aiding merge constructs

I refer to these blocks as ladders. A requirement of SPIR-V is that a basic block can only be the merge target of one header. If multiple scopes want to merge to a specific block, we need ladders which can merge them one at a time.

There are exceptions to this rule. A loop merge can serve as a merge block for any selection construct contained within in the sense that selection constructs can just OpBranch to a loop merge as it pleases. It might be tempting to cheese the SPIR-V requirements by turning every complex selection construct into a block-like loop, but a significant hurdle is that there is no multi-break support. This is consistent with all higher level shading languages, since they don’t support goto. Even if we could, it would lead to unreadable garbage code which we need to avoid at all cost. Some structurization algorithms rely on this feature (WebAssembly), but we’ll need to tackle this on hard mode.

Remove fake merge blocks

Any block which has two or more predecessors is potentially hazardous and carries some semantic meaning. The implication is that we need to merge control flow. Of course, DXIL is completely unstructured and it is not well defined when control flow actually converges, but maximal convergence is a good initial assumption, where every block that has two or more predecessors must be a merge block. Unfortunately, this assumption will break down in certain edge cases. When we run into “impossible” scenarios that make no sense from a high level language point of view, we can take the foot off the pedal and reconsider if we really need to merge.

DXC tends to “merge” control flow in bizarre ways that does not necessarily reflect the intent of the original code. As discussed in the previous post, I covered the issue of subgroup operations in loops, but the issue can manifest in scenarios unrelated to subgroups as well.

Fortunately, the scenarios that I’ve seen are limited to basic blocks that exist only to emit a PHI opcode or two, and then actually perform the break. The solution is to simply split the block. It is generally safe to do this as long as the block is not control dependent, i.e. it has no derivative, barrier or subgroup operations.

Promote to loop blocks

A fairly common pattern is:

do
{
  blah();
  blah();
  if (foo) break;
  blah();
  blah();
} while(false);

There is no loop here, but in some situations we are kind of forced to insert a “block” construct. A block construct is a loop with a continue block that is not reachable in the CFG. This code is inevitable with inlined functions that return early.

A cute trick spirv-opt uses is an OpSwitch with one default case. It is also possible to break out of the switch this way. dxil-spirv did not go this route since we have to handle multi-break anyways, and a switch does not help with that.

Rewriting multiple breaks

This gets extremely hairy, and dealing with it is just another Tuesday.

This is also a problem spirv-opt deals with. When inlining, it is possible to return from two inner loops like such:

void func()
{
  while (cond1)
  {
    a();
    while (cond2)
    {
      b();
      if (foo) return; // y u do dis ;____;
      c();
    }
    d();
  }
}

I have seriously contemplated “un-inlining” code to support shenanigans like this, but that is a level of insanity I am not willing to engage with.

What is a break? A miserable pile of heuristics

In selection constructs, we’ll end up in a situation where there is no clear merge block, i.e. the header does not dominate the immediate post-dominator, yet we have to emit a merge block for any selection. If our block branches to A and B, we might have to select one of three situations:

The branch to A is a “breaking construct”, B is not. Merge to B. Path in A will leave the selection construct through an outer switch or loop break.
Vice-versa above.
Both paths can be considered to be breaking constructs. Merge to an OpUnreachable dummy node.

This is harder than it sounds. Tie-breaking between A and B paths can be quite the ordeal.

Transposing control flow

There are some extremely rare scenarios where we need to transpose code. For example, code that lives inside a loop must be moved outside the loop for structurization to be possible. These shaders cause serious brain damage, be warned!

Note that transposing control flow is a term I pulled out of the aether. Such is life when you don’t really know what you’re doing.

If we hit this scenario, we must give up any hope of debugging the code, it will look incomprehensible at this point.

Switch blocks with free-flowing fall-through

Switch blocks in unstructured control flow are truly cursed. They can fall-through case labels, but we also find scenarios where multiple blocks fall-through to the same case label. Spicy!

Rewrite PHI nodes

When rewriting control flow, we will break any PHI nodes. One solution would have been to de-SSA the shader, but I went with the more principled approach and retain SSA throughout. I’m not sure in hindsight that it was the correct solution due to its complexity, but de-SSA is not trivial either, especially with switch blocks that fall-through. SPIRV-Cross has some truly cursed code to deal with that, and it’s horrible enough that I did not want to replicate that here.

Infinite loop shenanigans

Infinite loops can be quite tricky to analyze since it is somewhat vague what it means to exit such a loop. Typically, it would be a return inside the loop. There are various horrible scenarios like these.

The transform passes

The structurizer is organized as a series of transform passes more or less. Once we have computed the CFG, we can perform a pass, at which point we need to recompute the CFG before the next pass. When we rewrite the CFG in any way, we technically have to recompute the CFG, but this is quite troublesome, since this kind of code would break:

for (auto &node : post_visit_order)
{
  if (something(node))
  {
    modify_cfg();
    recompute_cfg(); // Would affect post_visit_order
  }
}

Recomputing the CFG every time a branch changes would just be too slow, so the implementation tries to avoid recomputing the CFG until we have traversed through the entire CFG. To make this work, we have to recompute what we need in-place while modifying the CFG. For example, we have to patch in which blocks are immediate dominators and post-dominators. Reachability information can be inherited from other blocks by copying the post-visit order index. This can indeed lead to many subtle bugs, but I don’t know what a better solution would look like. It seems to work fine for simple scenarios where we don’t significantly rewrite the control flow and dominance and post-dominance relationships aren’t affected globally.

Some algorithms have extremely awkward transforms that must be implemented as a loop that terminates ones there are no special cases left to handle, but these scenarios are also extremely rare, e.g.:

do
{
  did_work = false;
  for (auto &node : post_visit_order)
  {
    if (need_complex_weird_thing(node))
    {
      rewrite_branches_in_a_horrible_way();
      recompute_cfg();
      did_work = true;
      break;
    }
  }
} while (did_work);

As you’ll quickly figure out reading this, this is really just a series of jank heuristics layered on top of each other. There is no strong unifying algorithm to speak of here, but the advantage is that we slowly and surely chip away at the insanity LLVM emits into something that ends up looking acceptable. If we throw generic algorithms at it, I think the result will end up looking nothing like the input whatsoever and collapse into noise.

Stage 1 – Cleanup breaking PHI constructs

void CFGStructurizer::cleanup_breaking_phi_constructs()

As the very first thing we do, we aim to remove merge blocks that should not be considered to be a merge as these blocks will create problems for us later. Let’s consider this shader:

Block 3 here is problematic. Both 2 and 4 break, but it is illogical for there to be a block 3 here that merges execution. There is no way to express this merge in high level code without rewriting everything, so what we want to do instead is to split up the node.

The heuristic here is fairly simple. We’re looking for a block that has no operations associated with them, only PHI nodes. It must have exactly one successor. We also avoid removing blocks that are good loop merge candidates. If a back edge branches to the candidate block, we should not remove it, since it is a clear loop merge candidate. With these checks in mind, we can rewrite the PHI in 3 where the inputs from 2 and 4 are added directly to 5. The merge is now gone and we don’t have to consider it for later.

Stage 2 – Create continue block ladders

void CFGStructurizer::create_continue_block_ladders()

It is often the case that continue block (i.e. back edges) have multiple predecessors. Making the continue block a merge target does not seem to be valid, so we need to create a dummy predecessor instead.

Using this shader as an example:

6 is the back-edge and the continue block (they both mean the same thing really). The 4 to 6 branch is a typical continue statement. We’d also like to merge execution in 6 since it has two preds, but to do this we’ll need a helper pred block of 6, as otherwise spirv-val complains. I’m not 100% sure if it’s illegal to merge to a continue construct, but it is very awkward for sure. Two quotes from SPIR-V 2.11.2 jump out that might explain this:

a loop construct: the blocks structurally dominated by a loop header, excluding both the loop header’s continue construct and the blocks structurally dominated by the loop header’s merge block

if a structured control-flow construct S contains the header block for a selection, loop or switch construct different from S, then S must also contain that construct’s merge block

Since the continue block is excluded from the loop construct, the inner selection construct cannot merge to it, since all merge targets must be contained within the enclosing loop construct.

In any case, the final shader ends up looking like:

As we can see here, 6.pred takes on the role of the merge block before we continue. 3.succ takes on the role of a loop block where continue is translated to inner break.

(Circles are loop headers, triangles are selection headers, dashed lines are merge blocks and dotted lines are continue blocks.)

There are also scenarios where we don’t do this. We abandon the idea of having merged execution in the continue construct, since we really need to use continue as a way to break out of weird control flow.

bool CFGStructurizer::continue_block_can_merge(CFGNode *node) const

The heuristics here are … questionable at best. The comments say it all:

// This algorithm is very arbitrary and should be seen as a nasty heuristic which solves real shaders
// we see in the wild. It's probably safe to block continue merge in far more cases than this, but we
// want to be maximally convergent as often as we can.

Here we see a real-world shader, where 17 is a continue block we probably shouldn’t consider to be a merge block as well. 24 is deduced to be a far more reasonable candidate for merging and we’d have to emit extremely bizarre code to make the merge-to-continue style work. Note that 24 can break out of the loop (27) as well as continue to 17. If we enclose 17 in a loop block, we would not be allowed to break anymore without some horrible rewrites due to the double break, and you probably see why the heuristic is in place. Generally, it’s very hard to say if a continue block should be maximally convergent or not (it’s not in SPIR-V as-is). We’re okay with hand-wavy heuristics here I think.

Final output:

Stage 3 – De-interleave unrolled loop breaks

This is a recent addition and I’ll defer explaining it for later due to its high complexity. I don’t want to cause brain damage just yet 🙂

Stage 4 – Recover nested selection constructs

void CFGStructurizer::split_merge_scopes()

This is where things start to get kinda difficult, but I’ll stick to simple examples. The gist of this pass is to study all blocks which are merge block candidates, and introduce intermediate merge blocks for nested selection constructs. Consider a basic nested if-else pattern:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  [branch]
  if (thr & 4)
  {
    [branch]
    if (thr & 6)
      RW[0] = 1;
    else
      RW[0] = 2;
  }
  else
  {
    [branch]
    if (thr & 12)
      RW[0] = 3;
    else
      RW[0] = 4;
  }
  RW[1] = 40;
}

void CFGStructurizer::rewrite_selection_breaks(
    CFGNode *header, CFGNode *ladder_to)

We could just give up and treat .entry as a loop header, where all branches to 7 are treated as break, but that won’t do. This is where we study any potential merge block. In this case, we find block 7. We identify .entry as its immediate dominator. From here we carve out a subset of the CFG that is dominated by .entry and can reach 7.

template <typename Op>
void CFGNode::traverse_dominated_blocks(const Op &op) const

This utility is used quite a bit, and we’re interested in finding selection headers that have a post dominance frontier of .entry. There are some other checks to avoid wild edge cases, but see code for details.

if (node->succ.size() >= 2 && !branch_is_loop_or_switch)
{
   auto *outer_header =
     get_post_dominance_frontier_with_cfg_subset_that_reaches(
       node, ladder_to, nullptr);

   if (outer_header == header)
      construct.insert(node);
}

We’re only interested in the next scope level here. The candidates we find are 1 and 2. For 1, we create a ladder block, and rewrite all branches to 7 which are dominated by 1, similar for 2.

// Stop rewriting once we hit a merge block.
traverse_dominated_blocks_and_rewrite_branch(inner_block,
    ladder_to, ladder, [inner_block](CFGNode *node) -> bool {
       return inner_block->selection_merge_block != node;
    });

This looks like:

There are some edge cases here. For example, if our input CFG was this one, we would end up creating redundant ladder blocks. To avoid this, we stop the traversal early if the selections in 1 and 2 end up merging their scopes. Ahead of time, we have done some basic analysis to mark selection merge blocks candidates which is used for this purpose.

// Do we keep going?
return inner_block->selection_merge_block != node;

This analysis then recurses into the inner selection constructs to resolve multiple nested scopes.

In hindsight, this approach of recursive traversal is not efficient at all, but there are also a lot of particular edge cases to consider where we skip rewriting branches. The details are a bit too esoteric for this kind of post, so again I’ll refer to the code instead.

Another common case is if-else ladders:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
   uint val;
   [branch]
   if (thr & 4)
      val = 1;
   else if (thr & 3)
      val = 2;
   else if (thr & 8)
      val = 5;
   else if (thr & 16)
      val = 21;
   else if (thr & 32)
      val = 20;
   else
      val = 50;
   RW[0] = val;
   RW[1] = 40;
}

Both of these scenarios are perfectly recovered, which is nice. When possible, we always want to avoid adding loop blocks, for our own sanity.

Stage 5 – Remove degenerate blocks

void CFGStructurizer::eliminate_degenerate_blocks()

This stage isn’t very interesting, but when adding ladders in stage 4, it’s very easy to end up with ladders that introduce merges that we did not intend to exist. Very similar ideas to stage 1. We might also introduce blocks which don’t do anything useful, i.e. they might have one predecessor and one successor with no opcodes and no PHI nodes. For sanity’s sake, we nuke those blocks as well.

if (noop && single_pred && single_succ)
   easy_choice();
else if (noop && merge_candidate_is_on_breaking_path(node))
   danger_zone();

A random example from a real world shader is 10.16.ladder.19.ladder.21.ladder which was introduced by stage 4. This node is on a breaking path, and we really should not merge execution here. There’s also a bunch of useless single pred/succ blocks we can nuke as well.

We have to be careful though. Not all no-op blocks are redundant, and we can only remove degenerate merge blocks if they are on a breaking path. I’ll cover it in more detail in the intermission.

Stage 6 – Duplicate impossible merge constructs

void CFGStructurizer::duplicate_impossible_merge_constructs()

Breaking constructs are a blight on my soul, and we’re still not done dealing with this. We’re now entering the kind of edge case where CFG attempts to merge execution in breaks, but this block contains actual code … what to do!?

Here’s another random example I found:

228 is a loop header, and 319 attempts to break out with a double break (._.). As discussed already, merging execution on a breaking path is questionable at best, and it is impossible to express this without completely rewriting the CFG. This time, there is actual code in 319, so we have to be extremely careful.

To solve this, we bring out the big guns and just duplicate the block. Fortunately, I have not run into a scenario where this kind of block is control dependent …

If the words here did not cause concern, the code comments probably do:

// WARNING: This check is EXTREMELY sensitive and microscopic changes to the implementation
// will dramatically affect codegen.
bool breaking = merge_candidate_is_on_breaking_path(node);

if (breaking && !node->ir.operations.empty() &&
    !block_is_control_dependent(node))
   duplicate_queue.push_back(node);

Intermission – what is a breaking path anyways?

So far, we have done a ton of work to consider breaking paths, trying to classify code into a “non-breaking” path where control flow can merge with maximal convergence, and breaking paths where we can avoid merging until we hit the outer merge block. As hinted to earlier, it can get very tricky to identify what a breaking path even is. For loops, this is somewhat easy, but for selection constructs, it gets pretty complicated.

bool CFGStructurizer::control_flow_is_escaping(
    const CFGNode *node, const CFGNode *merge) const

This function is load bearing for a lot of the code in the structurizer, so I should probably try to explain the rationale behind it. The merge node in this context is a post-dominator of node and the construct we’re analyzing. Node does not dominate merge.

Loops

If we recall what an unstructured loop is, the loop body is where we are dominated by a loop header, and we can still reach the back edge. The moment a branch enters a path of no return, we can consider it a break. The rules I came up with are:

Node has a loop header which dominates it (duh)
Node cannot reach the back-edge (otherwise it would be in the loop)
Back-edge cannot reach node (we have not merged the loop yet)
All post-domination frontiers of node can reach back-edge (a breaking construct can have control flow inside of it, and we only care about the break itself)

Using these rules is fairly straight forward, but it gets trickier!

Selections

In a selection, there is no guiding star like the loop header and back edge to analyze. Rather, we need heuristics to deduce if a branch looks and smells like a break. This is the kind of cruft that accumulates over time as games ship more and more difficult shaders …

Load bearing blocks – are we important?

If we’re a break block with two or more predecessors, we can consider that CFG reachability does not change in meaningful ways if we remove the block. In particular, we look at the immediate dominator and the merge block. We then perform a query: given reachability chain A -> B -> C, does removing B change reachability for A -> C? If we’re load bearing like this, it’s not a break.

bool CFGStructurizer::exists_path_in_cfg_without_intermediate_node(
   const CFGNode *start_block,
   const CFGNode *end_block,
   const CFGNode *stop_block) const

Analyze domination frontiers

Next heuristic is to see if node can reach the merge target without encountering any unrelated domination frontiers along the way. If there are domination frontiers other than merge, it probably means we’re still on a structured path and we end the heuristic.

If we get to this point, we need further checks to be sure we make the right decision. Heuristics are so much fun!

Did we just branch to a continue block?

If we didn’t emit a special block in stage 2, branching to the continue block is definitely a breaking construct. Accept immediately.

Post-dominance shenanigans

Here we adopt the same idea as loop analysis. If a post-dominance frontier can reach a domination frontier before reaching merge block, that is a very strong indication we’re on a breaking path. For example, if we consider this CFG:

Node 6 here attempts to break out to 9.pred, but 8 does not. This is obvious to human inspection, but why is that the case here? The key lies in post-domination frontiers. For 6, we have 4 as a PDF. This node has a domination frontier 8, so 4 is in a structured flow, thus we have shown that we went from a normal path (4) to a breaking path (6). For 8, 4 is in the post-domination frontier as well, but 4 does not dominate 8, so we don’t consider it. To understand what 8 is, we need to analyze further …

Do we post-dominate any useful work?

This is an esoteric check. This is mostly to catch more scenarios in stage 5, since accidental ladder merge blocks can get really weird. Essentially, if we don’t post-dominate anything useful, it’s probably a break construct.

Tie-break patterns

If we still don’t know what is going on (node 8), we attempt a final check. We back-trace from 9.pred and try to see if we can back-trace to a node that is not our candidate but can end up reaching 8. In this case, we will back-trace through 6, and find 4. 4 can reach 8. In some sense, we’re just trying to query if our neighbor predecessors of 9.pred are break blocks instead.

Stage 3 – De-interleave unrolled loop breaks

Time for more gnarly code that will cause mild brain damage (sorry!). Dealing with breaks isn’t bad enough, what if I told you it was possible to have 2 or more different break merge scopes? This is a situation where nothing will help except a complete rewrite of the CFG. Here’s an example:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint id : SV_DispatchThreadID)
{
   uint v;
   uint w = 1;
   uint dummy = 0;

   [unroll]
   for (int i = 0; i < 4; i++)
   {
      InterlockedAdd(RW[0], w, v); w = v;

      [branch]
      if (w & 13)
      {
         InterlockedAdd(RW[0], w, v); w = v;
         dummy = 1;
         break;
      }

      [branch]
      if (w & 1)
      {
         [branch]
         if (w & 4)
            InterlockedOr(RW[0], w, v); w = v;
         dummy = 2;
         break;
      }

      [branch]
      if (w & 2)
      {
         InterlockedOr(RW[0], w, v); w = v;
         dummy = 3;
         break;
      }
   }

   InterlockedAdd(RW[0], w, v); w = v;
   InterlockedAdd(RW[0], dummy, v);
}

Spaghetti, yum! Fun fact, if I were to unroll this to 32 or more loop iterations, graphviz will take minutes to complete and without a proper fix dxil-spirv will explode, creating endless ladder blocks. It is truly the Stairway to Heaven of CFGs! I ran into this with a game that used 128 iterations instead, and the shader never completed. Of course, this was an SSR shader, and I cannot fathom why anyone would want to unroll a massive loop with several breaks 128 times, it’s pure and utter chaos, but here we are …

5 is the merge block of our “loop”. 2, 4 and 7 are all break blocks, and the predecessors interleave for every iteration. This is simply impossible to figure out with structured control flow as-is, so it’s time to take out the hammer, we simply have to rewrite this CFG into something extremely ugly instead.

bool CFGStructurizer::serialize_interleaved_merge_scopes()

We start similarly as stage 4 where we find selection merge candidates, identify the immediate dominator, and analyze the inner scopes. What we are looking for is two or more merge candidates that cannot reach each other with a post-domination frontier that partially overlaps. For example, 6 and 11 are PDFs of 2 and 10 is a PDF of 7. This is a bizarre case where we’re breaking in different directions all the time. In this case, we simply have to transpose the entire shader into this.

5.pred is a dispatch switch block. It creates a PHI variable where the predecessors signal which break block to dispatch to.

Technically, we could consider doing something similar in stage 6, where any merging break constructs are transposed outside the loop, but this transposition obliterates any hope of emitting readable code, so we reserve this only as a last resort.

Stage 7 – Rewrite transposed loops

More brain damage incoming, but just like stage 3, I’ll defer this until the bonus section since it’s ridiculous nonsense 🙂

Structurization passes

void CFGStructurizer::structurize(unsigned pass)

In all the passes above, we mostly focused on rewriting the CFG, tweaking it, etc, but now we’re mostly focused on emitting structure information. This is done twice. In pass 0, we focus on analysis, rewrite the CFG to resolve any issues, and in pass 1, we only mark structurization information. If we are forced to modify the CFG at that point, we have a bug in the implementation.

The natural merge and actual merge

Before digging into conditional branches, switches and loops, we have to elaborate on the system used in dxil-spirv to deal with multi-break.

Switches and loops are easy to identify, but in terms of structurization there is a difficult requirement that any merge block we decide must be dominated by the header. For loops, the merge block must also post-dominate the header, discounting any block that terminate the CFG like returns and DXR intersection termination. Switch blocks are a little more relaxed since it’s legal in SPIR-V to break from inside a switch block to the outer loop merge block.

For a more concrete example of what I’m blabbing on about:

3 is our loop header. 4 is clearly a breaking construct that breaks to the exit block, 5 is the back-edge and 6 is the “natural” loop merge if all breaks were removed. As expected, we cannot just say that 6 is the merge block of 3 and call it a day, since the branch from 4 to 1 is simply illegal in structured control flow.

For every loop, we essentially start considering two kinds of merges.

The natural merge (CFGNode::loop_ladder_block), where the loop should merge.
The actual merge (CFGNode::loop_merge_block), equal to the common post dominator of all loop exits.

If these two don’t match, we’ve got trouble. The idea is that in pass 0, we find these issues, fix them up, and if we do this analysis again in pass 1, these nodes must match.

Switch blocks have similar considerations where we find the merge point through common post domination analysis, and a natural merge, where it makes sense to merge.

auto *merge = find_common_post_dominator(node->succ);
auto *natural_merge = find_natural_switch_merge_block(node, merge);

For example, here is a monster switch where the natural merge seems to be 86, but 24 is the post-dominator. This is a funny case with multi-way case fall-through, which is not legal in structured control flow.

I’m sorry about these examples being a bit random and unfocused, but I find these by adding breakpoints where needed and run the entire game repro suite over it until some shader triggers it. The lucky winner is publicly shamed here 🙂 It’s very hard to come up with HLSL that will trigger these exact scenarios. AAA shaders are unhinged enough to trip all these cases.

Marking header and merge block relationships

The actual merge location needs to be recorded so that we can fix things up if there is a problem. The basic idea is that the header records which kind of merge it is (loop or selection), and points to the block in question. In the first pass, the header does not necessarily need to dominate the merge block.

The merge block has a list of headers associated with it. If that number is greater than two, we have a problem, which needs to be corrected later. With any kind of breaking construct, it is inevitable that we will see a merge block with two or more headers associated with it.

In this example, 1 will be associated with both 3 and .entry as relevant headers.

The passes

void CFGStructurizer::structurize(unsigned pass)
{
   if (find_switch_blocks(pass))
   {
      // Can rewrite the CFG in slightly weird ways.
      recompute_cfg();
      if (find_switch_blocks(pass))
      {
         LOGE("Fatal, detected infinite loop.\n");
         abort();
      }
   }

   find_loops();
   find_selection_merges(pass);
   fixup_broken_selection_merges(pass);
   if (pass == 0)
      split_merge_blocks();
}

Switch blocks

Switch blocks can get extremely tricky since they support fall-through, and fall-throughs look deceptively like a merge. The most interesting bit is analyzing case fall-through.

CFGNode *CFGStructurizer::find_natural_switch_merge_block(
    CFGNode *node, CFGNode *post_dominator) const

Sort all case labels (and default) by post visit index order.
If a case label can reach another case label, it must be sorted right before. (SPIR-V requirement)
If a conflict arises where two case labels can reach another case label, consider that the natural merge, since it’s the only thing that makes sense.
If a dominance frontier of a case label can be reached by another case label, and that dominance frontier is not another case label, we need to consider it a natural merge. In the massive example above, 86 is such a dominance frontier. 86 is the dominance frontier of 27, and 86 is also reachable by 26, uh oh!

Certainly, this cannot be robust against all possible insane code you can emit with unstructured switch statements if there are multiple such “natural merge” candidates, but I’ll have to tackle that when I see it. I’d go bald if I had to anticipate every possible scenario like that.

In this particular case, we can take advantage of the fact that it is valid to break out of a loop from inside a switch construct in SPIR-V. This is not allowed in high level shading languages, but SPIRV-Cross works around it.

Selections

void CFGStructurizer::find_selection_merges(unsigned pass)

This is pretty simple. Start at any block with more than 1 predecessor, find the idom and mark the selection. There can be conflicts however, if the idom is already marked as a loop or another selection, we need to do something.

In this shader, 9.pred and 8 are both merge candidates where 3 is the idom. Since 3 is already a loop header, we must create an inner block, 3.succ, which can serve that purpose. 8 is then analyzed, which finds 3.succ to be its idom. Since it already has a selection merge associated with it, we create yet another succ helper, 3.succ.succ. 3.succ is promoted to a loop block header. To remember that 3.succ is indeed a loop header despite having no back-edge, we use the freeze_structured_analysis flag.

In hindsight, it might have been possible to use a single block switch block for 3.succ here instead.

Fixing up broken selection merges

void CFGStructurizer::fixup_broken_selection_merges(unsigned pass)

While we have accounted for all merge blocks, not all selection headers have been marked in any way. This typically happens for break constructs, since these blocks don’t dominate much code at all. In the example above, 4 is a great example. We’re mostly interested in two things here:

Figure out if one branch is a “break” and the other path is not. We are forced to merge somewhere, and we must merge in the direction away from the break.
Find common post-dominator of all successors so that we can resolve weird breaks properly later. In this case, 4 would be marked as a header for 9.pred.

Unfortunately, there are half a million edge cases here as well, but I’ll omit them here since it’s just endless tie-break rules to figure out which branch (or neither) gets to be the lucky merge.

Loops

Loops are quite horrible since code quite often breaks out of multiple loops at once. A case study that covers a fair amount of interesting scenarios would be:

uint func(uint3 dispatch)
{
   [loop]
   for (int i = 0; i < 10; i++)
   {
      [loop]
      for (int j = 0; j < 20; j++)
      {
         [branch]
         if (dispatch.y < 10)
            return 50;

         [branch]
         if (dispatch.z < 10)
            return 70;
         dispatch.y++;
      }

      dispatch.x++;
   }

   return 80;
}

RWStructuredBuffer<uint> buf : register(u0);

[numthreads(1, 1, 1)]
void main(uint3 dispatch : SV_DispatchThreadID)
{
   buf[dispatch.x] = func(dispatch);
}

To begin, we find all blocks with a back edge. These must be loop headers. The meat of the analysis happens in

auto result = analyze_loop(node);

where we study all exits from the loop body, and deduce what kind of exit it is. As discussed earlier, a loop exit happens when we can no longer reach the back-edge.

We’re mostly interested in the question if an exit is a break or not. We aim to find one block that we consider the natural merge (CFGNode::loop_ladder_block), and the post-dominator of all exits (CFGNode::loop_merge_block). If the back-edge can exit the loop, that exit must be our natural merge, otherwise, we should choose a block that the loop header dominates. This can happen if the loop only exits from the header, or there is an infinite loop that only exits by breaking for example.

In this particular example:

Loop header 2, loop_ladder_block = 7, loop_merge_block = 5
Loop header 1, loop_ladder_block = 8, loop_merge_block = 5

Resolving merge conflicts

void CFGStructurizer::split_merge_blocks()

In the case where more than one header is associated with a merge block, we have to do some horrible things. 5 is the lucky winner this time and we get to show off triple breaks. The headers are {1, 2, 4}. The basic idea is to work inside out so we start at block 4. Looking at our loop scopes, we know that we can only break so far. Scanning through the header list from right to left, we find 2.

// Start from innermost scope,
// and rewrite all escape branches to a merge block
// which is dominated by the loop header in question.
// The merge block for the loop must have
// a ladder block before the old merge block.
// This ladder block will break to outer scope,
// or keep executing the old merge block.
for (size_t i = node->headers.size() - 1; i; i--)
{
   auto *current_node = node->headers[i];

   // Find innermost loop header scope we can
   // break to when resolving ladders.
   CFGNode *target_header =
     get_target_break_block_for_inner_header(node, i);
   ...
}

The end result ends up looking quite weird as seen below. I should mention that spirv-opt needs to do similar things when inlining functions, but of course, inlining already structured code is quite a bit simpler.

This is where we make use of the distinction of loop_ladder_block versus loop_merge_block. The basic idea is that we can create a pred helper of loop_ladder_block which then dispatches either another break or continues the “normal” path of branching to loop_ladder_block. If the outermost header is not a loop itself, we have to force one. 1.pred here is a helper block that allows the final break. The branch from 4 to 5 (via 3) therefore punches through the intermediate blocks. 7.pred.pred and 8.pred create PHI variables which signal if the break is a multi-break or not.

Mark branches that are expected

// Before we start splitting and rewriting branches,
// we need to know which preds are considered "normal",
// and which branches are considered ladder breaking branches
// (rewritten branches).
// This will influence if a pred block gets false or true
// when emitting ladder breaking blocks later.
Vector<UnorderedSet<const CFGNode *>> normal_preds(node->headers.size());
for (size_t i = 0; i < node->headers.size(); i++)
   if (node->headers[i]->loop_ladder_block)
      for (auto *pred : node->headers[i]->loop_ladder_block->pred)
         normal_preds[i].insert(pred);

Conditionally break

auto *ladder = create_helper_pred_block(loop_ladder);
// Merge to ladder instead.
traverse_dominated_blocks_and_rewrite_branch(header, node, ladder);

ladder->ir.terminator.type = Terminator::Type::Condition;
ladder->ir.terminator.conditional_id = module.allocate_id();
ladder->ir.terminator.false_block = loop_ladder;
// True target is the innermost loop construct.

PHI magic for predecessors after rewriting branches

for (auto *pred : ladder->pred)
{
   IncomingValue incoming = {};
   incoming.block = pred;
   bool is_breaking_pred = normal_preds.count(pred) == 0;
   incoming.id = module.get_builder().makeBoolConstant(is_breaking_pred);
   phi.incoming.push_back(incoming);
}
ladder->ir.phi.push_back(std::move(phi));

This is trivial for compilers to optimize away if they don’t need this pedantic level of structured control flow. The conditional branch in these blocks is a boolean PHI which takes constant inputs based on the predecessors. Thus, we can statically deduce that 4 -> 5 is a direct branch.

Of course, this is a well-behaved example without any serious edge cases, but the basic idea remains the same.

The idea of a ladder block like 7.pred.pred and 8.pred that dispatches shows up several places.

Structurization summary

This mostly concludes the structurization passes. It’s not exhaustive by any means, but the broad strokes should be covered, and I think I somehow managed to explain the overall architecture.

I don’t expect anyone to grasp all of this by just reading this. I can barely understand what is going on myself, and it’s not like any of this is backed by best practices. This approach just happened to work and has been refined one horrible AAA shader at a time.

Rewriting PHI nodes

After rewriting the CFG, PHI nodes will break since the original predecessors do not match reality anymore. Fortunately, this part of the algorithm isn’t as esoteric and has well known algorithms.

void CFGStructurizer::insert_phi()

Fixup broken value dominance

void CFGStructurizer::fixup_broken_value_dominance()

This isn’t related to PHI, but it is related to SSA. SSA values created in one block can be consumed in another as long as the creating block dominates the consumer. If that’s no longer the case, we have to do something. In my case, I just lowered it directly to an OpVariable with explicit OpStore / OpLoad instead. It’s quite rare for this to happen in practice.

This might be a hint that keeping SSA throughout was a mistake. (._.)

Reinserting PHI values

void CFGStructurizer::insert_phi(PHINode &node)

Given that we have N SSA values created in N blocks, we’ll have to insert new PHI nodes such that every predecessor of the original PHI node gets a proper value. Any mem2reg algorithm does this, but I’ll try to summarize the basic ideas. There’s also one special edge case that mem2reg would never consider.

End state

The algorithm is iterative, and we’re done when every predecessor of the target node can find an input value created by a node that dominates the predecessor, at which point we can emit the OpPhi. The algorithm will add more input values until we satisfy this condition.

Finding the next frontier

For all our current input nodes, look at their dominance frontiers. We pick the dominance frontier that’s earliest in the CFG (highest post visit index) and can reach the final node. Reachability is tricky when the PHI node is in a loop header. We have to consider the back-edge as well since it’s also an input.

A frontier is only considered if it makes forward progress, i.e. we ask the question if having a frontier here is useful at all.

bool CFGStructurizer::phi_frontier_makes_forward_progress(
   const PHI &phi, const CFGNode *frontier,
   const CFGNode *end_node) const

For example, if the frontier cannot reach any existing input, it makes forward progress. Also, if a frontier is such that any path from input to final node must go through the frontier, it also makes forward progress, because we can remove that input from consideration in the next iteration.

There’s probably a more elegant solution, but I couldn’t be arsed to study the algorithm in detail.

Placing a new input value at frontier

Look at all preds of the frontier and find a candidate input value. If there is no candidate (quite possible and expected in some cases), use OpUndef.

OpSelect workaround

Due how we rewrite branches through ladders, it’s possible that code which did not reach each other now do. Thus we can have a situation where a frontier dominates an existing input and the only way to resolve this is through OpSelect. Either we keep the existing input or use the PHI’d value from predecessors depending on the intent of the predecessor.

When testing this, I realized that this might be a bogus case that actually never happens, because I was not able to trigger it in any shader in the repro suite … Might as well just remove the dead code and see what happens 😀

Bonus round – brain damaging edge cases

I teased loop transposition as a thing, so I’ll try to explain the problem. I don’t think I can properly explain the solution though.

All of these cases are exceptionally rare. I’ve only seen a single example of each case across the entire repro shader corpus.

Outer loop transposition

void CFGStructurizer::rewrite_transposed_loop_outer(
   CFGNode *node, CFGNode *impossible_merge_target,                                                 
   const LoopMergeAnalysis &analysis)

This code is truly cursed, because it pokes a hole in the model of unstructured loops and structured loops. The loop in question is 6. The loop ladder block is 16 since the back-edge (14) branches there. However, we observe a bizarre merge to 12. Normally, we would be able to consider 12 to be a break target, but 12 does not post-dominate anything, so we’ll never deal with this in split_merge_blocks(). 5 is the post-dominator of 6‘s loop exits. We also cannot just split up 12 and treat it as a break block, because:

16 can reach 12, clearly it’s outside the loop
It has a conditional branch on its own, duplicating the block brings its own set of problems

We’re in a situation where 12 is technically inside the loop in structured control flow (we never merged), but outside the loop in unstructured control flow. On top of this, 16 can reach 12, so in a structured sense, we’re branching back into our own loop … So much fun!

This is a similar approach as split_merge_blocks(). Create a ladder block, rewrite all branches there, and dispatch based on the predecessor. Leads to hilariously unreadable code, but it’s our only option here.

The final output also needs to deal with multi-break, and the resulting CFG collapses into white noise.

Inner loop transposition

void CFGStructurizer::rewrite_transposed_loop_inner(
   CFGNode *node, CFGNode *impossible_merge_target,
   const LoopMergeAnalysis &analysis)

This is very similar to outer loop transposition, but it is somehow more cursed.

This CFG is incomprehensible, and I honestly have no idea how this can possibly be generated from HLSL, but here we are. 113 and 121 are inner nested loops. The odd scenario here is that the normal loop exit of 120 actually breaks further out than what we would expect. 129 is a break block that breaks out of the two loops, but it doesn’t break as far as 120 does, since it ends up in 89. This is extremely weird. I solved this similarly to the outer loop scenario. Collect all branches in a ladder and dispatch …

Ladders, ladders everywhere! This CFG is pure non-sense, I’m just including it as an example of how cursed some shaders turn out to be.

Merge to outer loop continue

This is similar to inner loop transposition where we have to be very careful. The loop in 16 can exit in 22, and this confuses the algorithm, since we have to find the common post-dominator of all loop exits. The back-edge 25 does not have any forward successors so it confuses the analysis, but we specifically detect 22 and transpose the loop so that we can merge execution, before breaking out of the outer loop.

I omitted the final CFG here since it looks like a complete mess. A spider web of code if you will (ahem) :V

Conclusion

Joke’s on me, because with my luck, a new game probably ships that breaks everything, but for now, this ends my brain dump of the truly cursed corner of vkd3d-proton which is the DXIL -> SPIR-V conversion.

At some point, all of this probably has to be rewritten and I’ll have to abandon any hope of having somewhat debuggable codegen, but for now, this is my questionable take on structurization.

My personal hell of translating DXIL to SPIR-V – part 5

Time for despair

It has taken way too long to get to this point, but we must address the final challenge of translating DXIL, a claustrophobic mess that haunts me to this day, the most nightmarish set of poorly understood heuristics which come together to put pretty pixels on screen. It is of course, the act of untangling spaghetti and turning unstructured control flow into structured control flow.

True impostor syndrome kicks in

Doing compiler things without knowing what you’re doing is agonizing. You can never trust yourself and whatever I did for DXIL structurization has turned out to fail 30+ times, and each time I fix a critical bug and move on. The only thing I can say for sure is that it’s becoming asymptotically correct, and I have been able to make 100k+ AAA shaders compile and validate correctly, but every new insane scenario DXC (LLVM) throws at me could be the one to break everything, because if you cannot mathematically prove a compiler algorithm you might be extremely screwed because the edge cases are truly insane. The true compiler nerds are probably giggling in the corner 🙂

Goal for this post

The goal for this post is to go through the basic theory needed to understand the structurizer. The topic itself will take many posts to fully cover, so this is a good start.

GraphViz – visual debugging

GraphViz is a corner stone of debugging control flow, so the very first thing we need is to be able to dump out a CFG to a PDF. Fortunately, there is just the tool for the job.

For example we can start with an extremely basic shader from the test suite:

RWByteAddressBuffer Buf : register(u0);

[numthreads(1, 1, 1)]
void main(uint3 index : SV_DispatchThreadID)
{
  uint result = 0;

  // Single branch
  [branch]
  if (index.x == 10)
    result = 40;
  else
  result = 50;

  // 2-level branch
  [branch]
  if (index.x == 20)
  {
    [branch]
    if (index.y == 30)
      result += 20;
    else
      result *= result;
  }
  else
  {
    [branch]
    if (index.y == 40)
      result += 70;
    else
      result *= 10;
  }

  Buf.Store(0, result);
}

If we dump this input to GraphViz we get:

digraph {
1 [label="1", shape="box"];
2 [label=".entry", shape="box"];
2 -> 1;
3 [label="2", shape="box"];
2 -> 3;
3 -> 1;
4 [label="3", shape="box"];
1 -> 4;
5 [label="4", shape="box"];
1 -> 5;
6 [label="7", shape="box"];
5 -> 6;
7 [label="8", shape="box"];
5 -> 7;
8 [label="9", shape="box"];
7 -> 8;
6 -> 8;
9 [label="5", shape="box"];
4 -> 9;
10 [label="6", shape="box"];
4 -> 10;
10 -> 8;
9 -> 8;
}

Then, we have this trivial Python script to dump to PDF:

#!/usr/bin/env python3

from graphviz import Source
import sys

def main():
  s = Source.from_file(sys.argv[1])
  s.view()

if __name__ == '__main__':
  main()

For debugging purposes, we can dump out a GraphViz after every major transformation step, of which there are many. For reference, here’s the final structured output, which we will study in greater detail. Of course, this is baby mode compared to what we have to deal with later, but gotta start somewhere.

SPIRV-Tools also has a tool to extract a GraphViz from a SPIR-V module.

spirv-cfg -o test.graph test.spv

This is super handy as well! The graph can be visualized with the same Python script.

SPIRV-Tools is also critical since we lean heavily on the SPIR-V validator to verify that our SPIR-V output conforms to structured rules.

Basic theory for the non-compiler nerds

When trying to learn this stuff, you will encounter terminology from graph theory and any learning material surrounding this topic will usually come from CS textbooks or lectures which assume you have all that background knowledge (I have an EE background, not CS or fancy math). These concepts become the basic primitives of any algorithms we have to perform on the CFG. None of this is particularly difficult, sure, but needs to be understood at a deep level.

Reachability

Asks the question of “can block A reach block B”? In the example above, 4 can reach 9 by branching via either 7 or 8, but 4 cannot reach 6. This is quite intuitive.

Back-edge

In loops, a back-edge branches back to the beginning of the loop. This is detected when A can reach B and B branches to A.

Here is an example where 3 has a back-edge to itself.

Note: I tend to mix up back-edge as a block (continue statement) and the branch itself.

This makes the CFG cyclic. We can eliminate back-edges from a DXIL CFG and achieve an acyclic CFG.

I believe this property is what makes the CFG reducible. Irreducible CFGs can have completely random gotos that make it impossible to ensure an acyclic graph without some extra preprocessing. DXIL fortunately does not support irreducible CFGs. It is also its only saving grace …

In our implementations back-edges are treated as special branches that happen in loops, and for sake of reachability analysis we only consider reachability to be acyclic.

There can only be one back-edge branch to a particular block. I think this is what LLVM defines. I have never seen a counter example, but if it happens, we can trivially rewrite the graph to only have one back-edge.

Predecessors

Usually abbreviated “preds”. If A branches directly to B, A is a predecessor of B. Back-edges are not counted in our implementation, they are recorded separately as a pred_back_edge.

Successors

Usually abbreviated “succs”. If A branches directly to B, B is a successor of A. Back-edges are not counted in our implementation, they are recorded separately as a succ_back_edge.

Domination

Domination happens when we have two blocks A and B and in order to reach B, control flow must pass through A as well. A block dominates itself. Strict domination is when A != B.

Post-domination

Post-domination happens when we have two blocks A and B, and if control flow reaches block A, control flow must then flow into B.

Immediate dominator

Among all blocks which dominate a block A, finds the closest block which dominates A. Often called idom.

Immediate post-dominator

Same as idom, but post-dominance rather than dominance.

Domination frontier

Assume we have a block A which dominates a set of blocks B. If a block in B branches to a block C, and A does not dominate C, C is in the domination frontier set of A. This is somewhat abstract, but it’s an extremely useful analysis tool for later. E.g.:

For block 4, 9 is in the domination frontier. For block 7, 9.4.ladder is in the domination frontier.

Post-domination frontier

Same idea, but for post-dominance. E.g. for block 7, 4 would be in the post-domination frontier. For 9.4.ladder, 1 would be in the frontier, etc. The post-domination frontier is particularly useful as an analysis tool to figure out which control flow decisions contribute to a block being executed.

Common (post-)dominator

Given candidates B and C, finds the closest block A which (post-)dominates both B and C.

Loop analysis

Loops in unstructured control flow can be very awkward to reason about, but we have a way to specify loops precisely.

The loop header

A loop header is any block which has a back-edge branching to it. In reducible CFGs which DXIL requires, the loop header must dominate the back-edge.

The continue block

The back-edge block is the continue block. SPIR-V allows a “continuing construct” which means it can span multiple blocks, but we don’t use that.

Defining the loop body

When we have a loop header and a continue block, we can deduce the loop body as any block which is dominated by header and can reach the continue block. Once we can no longer reach the continue block, we are breaking out of the loop. We call these blocks “loop exits”.

For example, 3 is a loop header, 6 is a continue block. The loop body is {3, 4, 5, 7, 8, 6}. 9 is the only loop exit.

Issues for later …

As we will see later, this definition of a loop is fundamentally flawed when we consider subgroup operations in non-uniform control flow. A single-threaded CPU execution model that LLVM caters to does not care about any of this, but GPU execution certainly does since semantics change depending on how we define convergence.

Computing the CFG

In the implementation, we will recompute the CFG at many points, usually after performing transformations on the CFG. Here are some of the things we need to do:

Compute post-visit order

Essentially, a depth-first traversal, where each node is assigned a visitation index. The rule is if A can reach B, post_visit_index(A) > post_visit_index(B). We also detect back edges this way. A node is only visited once.

If we don’t visit a node, it means it’s not reachable anymore (can happen after rewriting the CFG), and we remove the node, which means purging dead predecessors.

We can also build an array of nodes which we can traverse through and know that we will walk the tree in an appropriate order. Iterating over the post-visit order backwards will ensure that dominators are visited before our dominated blocks for example. This is used when we …

Compute immediate dominators

The algorithm for this is pretty easy once we have post-visit orders in place.

void CFGNode::recompute_immediate_dominator()
{
  if (pred.empty())
  {
    // For entry block only.
    immediate_dominator = this;
  }
  else
  {
    immediate_dominator = nullptr;

    for (auto *edge : pred)
    {
      if (immediate_dominator)
        immediate_dominator =
          CFGNode::find_common_dominator(immediate_dominator, edge);
      else
        immediate_dominator = edge;
    }
  }
}

Finding a common dominator (simplified) is done by:

while (a != b)
{
  if (a->post_visit_order < b->post_visit_order)
    a = a->immediate_dominator;
  else
    b = b->immediate_dominator;
}

Compute reachability

Here we build a simple LUT where we can test if post_visit_order A can reach post_visit_order B.

void CFGStructurizer::visit_reachability(const CFGNode &node)
{
  uint32_t *dst_reachability =
    &reachability_bitset[node.forward_post_visit_order *
                         reachability_stride];

  for (auto *succ : node.succ)
  {
    // Inherit reachability from all successors.
    const uint32_t *src_reachability =
      &reachability_bitset[succ->forward_post_visit_order *
                           reachability_stride];
    for (unsigned i = 0; i < reachability_stride; i++)
      dst_reachability[i] |= src_reachability[i];
  }

  // We can reach ourselves.
  dst_reachability[node.forward_post_visit_order / 32] |=
    1u << (node.forward_post_visit_order & 31u);
}

void CFGStructurizer::build_reachability()
{
  reachability_stride = (forward_post_visit_order.size() + 31) / 32;
  reachability_bitset.clear();
  reachability_bitset.resize(reachability_stride *
                             forward_post_visit_order.size());
  for (auto *node : forward_post_visit_order)
    visit_reachability(*node);
}

It’s nice how the structures we build feed into each other. We compose this as saying that if B can reach C, and A can reach B, A can reach C.

Compute dominance frontiers

Similarly, the dominance frontier is built bottom-up.

void CFGStructurizer::recompute_dominance_frontier(CFGNode *node)
{
  // If we don't dominate our successor, it's a frontier.
  for (auto *succ : node->succ)
    if (succ->immediate_dominator != node)
      node->mark_dominance_frontier(succ);

  // Inherit frontiers.
  if (auto *idom = node->immediate_dominator)
    for (auto *frontier_node : node->dominance_frontier)
      if (!idom->dominates(frontier_node))
        idom->mark_dominance_frontier(frontier_node);
}

The recursive definition here is that if A dominates B, then any dominance frontier of B is also in the dominance frontier of A.

Congratulations! You can now say dominance frontier casually with a straight face :V

The backwards traversal

To compute post-dominance and post-dominance frontiers, we have to do the same thing, just backwards. Flip successors into predecessors and vice versa, but we have some problems. The normal traversal has one unique entry node, but there can be multiple leaf nodes (think returns from a branch)! This means there are many entry blocks when we traverse backwards, and this breaks the algorithm.

Another problem is loops. Consider this example, which demonstrates both cases:

Here, 6 is not reachable through backwards traversal from 1. 7 is a return block. What we need to solve the backwards traversal is to ensure we have a unique start node. For this purpose, we will invent an EXIT node. Any return will be considered to branch to that EXIT node. Back-edges are special. For backwards traversal, we will instead invent a “fake” branch from the back-edge to the common post-dominator of all loop exits. (That is a mouthful.)

This is the first of many silly ideas I came up with … After this point, assume I have no idea what I’m talking about. 😐

Dotted lines are fake branches we can use for backwards traversal.

What does it mean to be structured?

A structured CFG basically means the control flow closely maps to higher level languages without gotos. Rather than spaghetti, the control flow is like an onion. It’s somewhat difficult to precisely define it (study the SPIR-V specification), but the basic gist is something like:

Every block exists in a scope

A block can be enclosed by many scopes. A scope could be a selection construct or a loop construct.

Selection construct

When a node has more than one successors, we begin a selection construct, where that node is a header. A selection construct has a merge node where execution reconvenes. A header must dominate its merge node. The selection construct is whatever the header dominates minus whatever the merge node dominates. Intuitively:

// HEADER
if (blah)
{
  // IN CONSTRUCT
  if_true();
}
else
{
  // IN CONSTRUCT
  if_false();
}
// MERGE, not part of construct

No crossing streams

A block from outside a construct must not branch into the construct, except through the header block.

No shared merge block

A block cannot be the merge block for multiple headers.

Loop construct

A loop header begins the construct. The loop construct contains all blocks which are dominated by the loop header, minus whatever is dominated by the merge block. See example earlier.

Notice how this is (fatally) different from unstructured loops. Unstructured loops leave the construct when they can no longer reach the continue block. Structured loops don’t exit until they reach the merge block.

Leaving a construct

A loop defines a “break” target as well as a “continue” target. The only valid ways to leave a selection construction is through its own merge block, the innermost break, or innermost continue target.

Unlike some languages, SPIR-V does not allow “multi-break”. A loop construct functions as a barrier to how far we can break. We can only break out of one loop construct at a time (except with return, but DXIL code is all inlined). Solving this is … an ordeal 🙂

Switch constructs

Switch blocks are selection constructs, but they have a special property where the merge target of switch also serves as a break block. Oddly enough, you can break directly out of a switch and a loop at the same time, which is not allowed in normal high level languages, but oh well.

Edge cases for selection constructs

It is possible that a selection construct has one branch being a “break” or “continue”, and the other branch is as normal. This special case does not require a selection construct.

Another situation is when both paths of a selection construct ends up breaking. Execution will never merge in this case, so it is possible to setup the merge block as an unreachable block with OpUnreachable instead.

Why do merge blocks matter?

In a world with SIMT-style execution of GPUs, two sets of invocations may branch off, but at some point they might want to join back again. Defining this can be done naturally through merge blocks. Without a notion of merge blocks, there is no way to meaningfully distinguish

from

In a scalar world such as CPUs this has same behavior, but certainly not in a SIMT world. If 4 does anything using subgroups or just normal texturing with implicit LOD, well … everything breaks.

From my understanding, OpenCL has a notion of divergence and convergence, but any well defined convergence must come through a merge block. In completely unstructured control flow, convergence is not well defined. From SPIR-V:

Uniform Control Flow: Uniform control flow (or converged control flow) occurs if all invocations (in the invocation group, unless otherwise stated) execute the same dynamic instance of an instruction. Uniform control flow is the initial state at the entry point, and lasts until a conditional branch takes different control paths for different invocations (non-uniform or divergent control flow). Such divergence can reconverge, with all the invocations once again executing the same control-flow path, and this re-establishes the existence of uniform control flow. If control flow is uniform upon entry into a structured loop or selection, and all invocations leave that loop or selection via the header block’s declared merge block, then control flow reconverges to be uniform at that merge block.

So here, we have explicit wording that we must use merge blocks to get guaranteed re-convergence. I couldn’t find any strong wording about this in CUDA, but the closest from skimming the docs was this:

For divergent control flow, the optimizing code generator automatically determines points of re-convergence

This is interesting, because this is exactly what we are going to have to do as well in dxil-spirv, but this is not exactly a strong guarantee. In practice, it should work fine, but I think this presents a good reason why structured control flow is a thing. I just wish IRs did not take the lazy route of just derping everything into unstructured control flow and expect implementations to guess the semantics. Sigh …

The classic subgroup breakage – maximal convergence gone awry

It’s time to go through the classic example of why we can reach catastrophic derpiness with unstructured control flow.

StructuredBuffer<uint> RO : register(t0);
RWStructuredBuffer<uint> RW : register(u0);

[numthreads(16, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
    uint v = RO[thr];
    uint result;
    while (true)
    {
        uint first = WaveReadLaneFirst(v);
        if (v == first)
        {
            result = WaveActiveSum(v);
            break;
        }
    }

    RW[thr] = result;
}

It’s crystal clear what the programmer’s intention is here. We partition a wave into sets, which then participate in a wave-op in isolation, then they exit the loop. The loop is iterated N times. This is a very common waterfall pattern, but any experienced shader programmer will avoid this exact implementation of said pattern because it breaks over half the compilers out there … The real problem here is that according to unstructured control flow, the WaveActiveSum() is not inside the loop, because it cannot reach the continue block. Therefore we end up with a completely broken:

define void @main() {
...
  br label %6

; The loop, it does not do anything.
  %7 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %5)
  %8 = icmp eq i32 %5, %7
  br i1 %8, label %9, label %6

; Outside the loop.
  %10 = call i32 @dx.op.waveActiveOp.i32(i32 119, i32 %5, i8 0, i8 1)
  call void @dx.op.bufferStore.i32(...)
  ret void
}

The wave operation now happens outside the loop, with all threads active, no matter what. It is impossible to deduce the programmer’s intention at this point. Structured control flow does not have these ambiguities, because the loop does not end until we reach the merge block.

The high level code would look something like:

// A degenerate loop.
for (;;)
{
  if (RO[thr] == subgroupBroadcastFirst(RO[thr]))
  {
    break;
  }
  else
  {
  }
}
RW[thr] = subgroupAdd(RO[thr]);

Indeed, the ACO ISA is devoid of branches here as the loop does nothing.

Funnily enough, while writing this, I found that the latest DXC versions actually go to great length to try (and fail!) to make this work somehow, through an absolutely disgusting trick:

@dx.break.cond = internal constant [1 x i32] zeroinitializer

define void @main() {
  %1 = load i32, i32* getelementptr inbounds ([1 x i32], [1 x i32]* @dx.break.cond, i32 0, i32 0)
  %2 = icmp eq i32 %1, 0
  %3 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 1, i32 0, i32 0, i1 false)
  %4 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false)
  %5 = call i32 @dx.op.threadId.i32(i32 93, i32 0)
  %6 = call %dx.types.ResRet.i32 @dx.op.bufferLoad.i32(i32 68, %dx.types.Handle %4, i32 %5, i32 0)
  %7 = extractvalue %dx.types.ResRet.i32 %6, 0
  br label %8

; <label>:8 ; preds = %13, %0
  %9 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %7)
  %10 = icmp eq i32 %7, %9
  br i1 %10, label %11, label %13

; <label>:11 ; preds = %8
  %12 = call i32 @dx.op.waveActiveOp.i32(i32 119, i32 %7, i8 0, i8 1)
  br i1 %2, label %14, label %13

; <label>:13 ; preds = %11, %8
  br label %8

; <label>:14 ; preds = %11
  call void @dx.op.bufferStore.i32(i32 69, %dx.types.Handle %3, i32 %5, i32 0, i32 %12, i32 undef, i32 undef, i32 undef, i8 1)
  ret void
}

What on earth is going on here? So, somehow DXC recognizes the horrible mistake of unstructured control flow, and tries to hack around it by forcing the breaking path to have a reachable path to the continue block. It does this by conditionally branching to continue or breaking depending on a value that is always 0. Apparently, it seems to do this in the most ridiculous way possible, by reading from a 1-element Private array which is zeroinitialized. This is probably done in an attempt to avoid optimizations which remove the branch. The high-level language now avoids the maximal convergence.

const uint _13[1] = uint[](0u);

bool _21 = _13[0u] == 0u; // Always evaluates to true.
for (;;)
{
  _52 = _46.value == subgroupBroadcastFirst(_46.value);
  if (_52)
  {
    _53 = subgroupAdd(_46.value);
    if (_21)
      break;
    else
      continue;
  }
}

However, we now have a situation where convergence depends entirely on how smart the backend compiler is. A compiler can detect that _21 is constant and eliminate the continue branch as dead code. At the same time however, this also means that we regress back to maximal convergence and everything breaks.

This particular case works as expected on AMD/Intel D3D12 drivers, but apparently NVIDIA’s D3D12 driver optimizes this back to maximal convergence and even this nasty hack fails. Fun times … I’m not even sure I can call that a bug. It’s not like any of this is well defined in DXIL 🙂

This isn’t a fantasy example, Microsoft’s own examples trip this scenario as well:

// Is equivalent to writing a loop like this:
while (true) {
  if (WaveReadLaneFirst(expr) == expr) {
    sum = WavePrefixSum(value));
    break;
  }
}

Sigh … Loops are a real PITA.

Conclusion

This concludes CFG 101. Unfortunately, we still have not reached the actual juicy algorithms, but it took way too many words just to go through the bare minimum of CFG theory.

My personal hell of translating DXIL to SPIR-V – part 4

It has been quite a while since the last post where I discussed the binding model shenanigans of D3D12, and as promised, it’s time to look at AMD code-gen for various API patterns we find in D3D12 when translated to Vulkan and the ACO compiler in RADV.

At a fundamental level, the AMD ISAs haven’t changed all that much since the GCN 1.0 days, for good reason. Arguably, it’s still the cleanest and most flexible descriptor model of any GPU architecture and it’s a very useful architecture to understand at a low level.

As discussed last time, the same DXIL code can translate to many different code patterns in SPIR-V. The Root Signature object in D3D12 not only concerns itself with mapping binding points, but also controls the physical implementation of how resources are accessed. This means we have to deal with all of this junk in dxil-spirv somehow.

Using Fossilize as our cross-vendor RGA – fossilize-synth and fossilize-disasm

For the ISA experiments we conduct here, we’ll use fossilize-synth and fossilize-disasm. These CLI tools are part of the Fossilize project and they can be used to conduct ISA experiments like these quite handily. Unlike RGA, fossilize-synth can automatically generate a compatible VkPipelineLayout, VkRenderPass and PSO state for given shaders and record this as a Fossilize archive. We can then replay it with fossilize-disasm, target ISA output and output ISA on any driver that supports VK_KHR_pipeline_executable_properties fully.

$ dxil-spirv --output test.spv test.dxil
$ fossilize-synth --output test.foz --comp test.spv
Fossilize INFO: Successfully synthesized a FOZ archive to /tmp/test.foz.
$ fossilize-disasm --target isa --output /tmp test.foz
Fossilize INFO: Dumping disassembly to: /tmp/888210fc26602b56.main.a2d87808e64b9164.comp

How many ways can we implement Constant Buffer View?

Welcome to hell.

Constant buffers can be implemented a million different ways, so let’s have a look at what we generate for an RDNA2 chip and some alternatives we have to use for other vendors. As a test shader, we start with this HLSL compute shader:

cbuffer Cbuf : register(b0)
{
  uint4 v;
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  RW[thr] *= v;
}

1. Root constant – Global Root Signature

Root constants translate naturally to push constants in Vulkan. The Vulkan GLSL representation of dxil-spirv output is:

#version 450
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _13;

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
} registers;

void main()
{
  uvec4 _33 = uvec4(registers._m0, registers._m1,
    registers._m2, registers._m3);
  _13._m0[gl_GlobalInvocationID.x] = 
    uvec4(_13._m0[gl_GlobalInvocationID.x].x * _33.x,
          _13._m0[gl_GlobalInvocationID.x].y * _33.y,
          _13._m0[gl_GlobalInvocationID.x].z * _33.z,
          _13._m0[gl_GlobalInvocationID.x].w * _33.w);
}

s_mov_b32 s0, s3
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], 0x0
v_lshl_add_u32 v0, s7, 6, v0
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)

// Interesting work begins here
buffer_load_dwordx4 v[4:7], v0, s[8:11], 0 offen
s_waitcnt vmcnt(0)
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s4
v_mul_lo_u32 v10, v6, s5
v_mul_lo_u32 v11, v7, s6
buffer_store_dwordx4 v[8:11], v0, s[8:11], 0 offen
s_endpgm

The first thing to understand about GCN and RDNA is the execution model. On basically any modern GPU, threads are executed in groups. “Waves”, “warps” and “subgroups” are all names for the same thing. “Wave” is the preferred AMD naming, so we’ll use that. An instruction here is either a wide vector instruction or a scalar instruction. Scalar instructions run once in the small scalar unit, and vector instructions run wide.

Scalar registers and vector registers are similar here. v registers are per-lane (called VGPR), and scalar s registers are stored once per wave (called SGPR). To translate this to x86-speak, SGPRs are normal registers like rax, rbx, etc. VGPRs would be AVX-2048 monsters, crunching 64 operations in parallel.

A critical performance consideration when targeting AMD is “scalarization”. When data is the same across a wave, we can place it in scalar registers. This keeps register pressure down. We have scalar registers in abundance, but using too many vector registers leads to poor occupancy. Shifting vector registers to scalar registers is therefore a sound strategy.

With this in mind, let’s look at the interesting bit:

s_mov_b32 s0, s3
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s4
v_mul_lo_u32 v10, v6, s5
v_mul_lo_u32 v11, v7, s6

This is the meat of the work. When execution is begun, the root constants are already placed in s3, s4, s5 and s6. This is ideal. No need to load anything explicitly. This is possible because of “User SGPRs”. There is however, a strict limit to how much “User SGPR” data we can hold. If we have too many constants, the driver is forced to insert indirection.

For example, consider:

cbuffer Cbuf : register(b0)
{
  uint4 v0;
  uint4 v1;
  uint4 v2;
  uint4 v3;
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  RW[thr] *= v0;
  RW[thr] *= v1;
  RW[thr] *= v2;
  RW[thr] *= v3;
}

Now we end up with:

s_mov_b32 s0, s3
s_movk_i32 s1, 0x8000
s_clause 0x1
s_load_dwordx4 s[16:19], s[0:1], 0x20
s_load_dwordx4 s[20:23], s[0:1], 0x30
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[0:3], s[2:3], 0x0
v_lshl_add_u32 v0, s12, 6, v0
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)
buffer_load_dwordx4 v[4:7], v0, s[0:3], 0 offen
s_waitcnt vmcnt(0)
v_mul_lo_u32 v4, v4, s4
v_mul_lo_u32 v5, v5, s5
v_mul_lo_u32 v6, v6, s6
v_mul_lo_u32 v1, v7, s7
v_mul_lo_u32 v4, v4, s8
v_mul_lo_u32 v5, v5, s9
v_mul_lo_u32 v2, v6, s10
v_mul_lo_u32 v1, v1, s11
v_mul_lo_u32 v3, v4, s16
v_mul_lo_u32 v4, v5, s17
v_mul_lo_u32 v2, v2, s18
v_mul_lo_u32 v1, v1, s19
v_mul_lo_u32 v8, v3, s20
v_mul_lo_u32 v9, v4, s21
v_mul_lo_u32 v10, v2, s22
v_mul_lo_u32 v11, v1, s23
buffer_store_dwordx4 v[8:11], v0, s[0:3], 0 offen
s_endpgm

Some of the root constants were promoted to User SGPR, but some were not and the driver had to spill those to memory somewhere, and load them.

s_load_dwordx4 s[16:19], s[0:1], 0x20
s_load_dwordx4 s[20:23], s[0:1], 0x30

Since the constants are the same for all threads, we can load them directly into SGPRs instead of doing silly broadcasts.

Scalar loads on AMD have special properties. First, they target the constant cache (K$) instead of normal data caches. The K$ is not coherent with the data caches, so we can use the K$ if the address to load is scalar and we know that we do not need to alias with other writes. SSBO loads can actually use this path too if these constraints are met, which will come in handy in the next section.

2. Root descriptor – Global Root Signature – Buffer Device Address (BDA)

Root descriptors in D3D12 are designed to be raw pointers to data and they take up 64 bits in the root signature. It is up to the implementation to figure out what to do about the raw pointer, and in vkd3d-proton we prefer to translate this to push constant BDA on AMD.

Let’s first try with dynamically uniform address:

cbuffer Cbuf : register(b0)
{
  uint4 v[8];
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= v[wg];
}

The output Vulkan GLSL is … funky:

#version 450
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

struct AddCarry
{
  uint _m0;
  uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerUint4NonWrite;
layout(buffer_reference, buffer_reference_align = 16, std430)
readonly buffer PhysicalPointerUint4NonWrite
{
  uvec4 value;
};

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14;

layout(push_constant, std430) uniform RootConstants
{
  uvec2 _m0;
} registers;

void main()
{
  AddCarry _33;

  // Pointer arithmetic is fun >:|
  _33._m0 = uaddCarry(registers._m0.x, gl_WorkGroupID.x * 16u, _33._m1);
  PhysicalPointerUint4NonWrite _40 =
    PhysicalPointerUint4NonWrite(uvec2(_33._m0, registers._m0.y +
    _33._m1));
  _14._m0[gl_GlobalInvocationID.x] =
    uvec4(_14._m0[gl_GlobalInvocationID.x].x * _40.value.x,
          _14._m0[gl_GlobalInvocationID.x].y * _40.value.y,
          _14._m0[gl_GlobalInvocationID.x].z * _40.value.z,
          _14._m0[gl_GlobalInvocationID.x].w * _40.value.w);
}

In the BDA path of dxil-spirv, we encode the BDA in push constants, perform address calculation, cast that to a pointer and load. The critical part here is NonWritable decoration. This signals the compiler that the memory region can be assumed to not have been written to, even with aliasing in this shader module. This is subtly different from C const. Const can still observe aliasing through other pointer, but not in SPIR-V. This allows ACO to generate ideal code even from raw pointers:

s_mov_b32 s0, s3                   // S[3:4] holds BDA
s_lshl4_add_u32 s0, s5, s0         // S5 holds SV_GroupID.x here
s_add_u32 s1, s4, src_scc          // Handle carry for 64-bit + 32-bit
s_load_dwordx4 s[0:3], s[0:1], 0x0 // Scalar load from pointer
s_waitcnt vmcnt(0) lgkmcnt(0)
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s1
v_mul_lo_u32 v10, v6, s2
v_mul_lo_u32 v11, v7, s3

Nice. If we remove the NonWritable from SPIR-V assembly we get the depressing vector load instead, because compiler can no longer prove there is no potential aliased write (we’re doing pointer arithmetic after all), so we have to do it in the vector unit instead:

 global_load_dwordx4 v[4:7], v[2:3], off

Of course, if the shader is loading uniforms with varying offset we also need vector loads, e.g.:

cbuffer Cbuf : register(b0)
{
  uint4 v[8];
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= v[thr & 7];
}

v_lshl_add_u32 v0, s5, 6, v0
v_and_b32_e32 v1, 7, v0
v_add_nc_u32_e32 v2, s0, v1                // Vectorized 64-bit + 32-bit
v_add_co_u32 v1, s[0:1], s0, v1
v_add_co_ci_u32_e64 v3, vcc, 0, s4, s[0:1]
global_load_dwordx4 v[4:7], v[2:3], off

The code-gen isn’t perfect here, but it accomplishes what we need for Root CBV. I am considering rewriting the pointer arithmetic for CBV to just use access chains instead, but that’s for another time …

(The existing codegen was designed for Root SRV / UAV which can do unaligned loads and stores in weird ways, and it is easier to do it this way with pointer arithmetic ._.)

(The reason we use uvec2 here is quirky. At the time, ACO only promoted 32-bit uints to User SGPRs, not 64-bit.)

3. Root constant – Global Root Signature – INLINE_UNIFORM_BUFFER workaround

If the driver does not support 256 byte push constants, we might be forced to fall back to inline UBO blocks instead of using push constants for massive root signatures. The generated code would look more or less the same as the BDA scenario shown above. Inline UBO on AMD is implemented in a way where the UBO data is stored at an offset from descriptor set pointer.

Now that BDA is widespread we might want to nuke this workaround and instead implement the workaround as a push constant BDA to memory we allocate ourselves. This removes the need to deal with annoying descriptor sets just to handle this case.

4. Root descriptor – Global Root Signature – Plain Descriptor

Push descriptors aren’t very special on AMD, so it gets lumped in with the others here. It just means the driver needs to allocate descriptor memory behind the scenes, which means extra CPU overhead. Descriptor paths on AMD have two indirections. This is why we prefer using BDA. For the example above with dynamically uniform address:

 RW[thr] *= v[wg];

s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], 0x0
s_buffer_load_dwordx4 s[0:3], s[8:11], s0

In SGPRs, pointers to descriptor sets can be pushed. Descriptor memory lives in a 4 GB virtual memory range, so RADV only needs to push a 32-bit pointer to store a descriptor set and the upper half is synthesized from a constant. From here, we load a UBO descriptor. With the descriptor in SGPR, we load the actual UBO data.

A thing to note here is how trivial it is to extend this to bindless / descriptor indexing. Just adjust the offset when loading the UBO descriptor and that’s all we need. No magic involved, descriptors are just memory in this model.

This path is preferred on NVIDIA however. Push descriptors are more “magical” there and the BDA path is very slow compared to that. 🙁

5. Root table – Global Root Signature

In root tables, we need to tap into descriptor indexing. The table parameter in D3D12 works like an offset into the ID3D12DescriptorHeap. This offset is placed in push constants, which we then use to offset into the heap. E.g. given this HLSL shader:

struct C { uint4 value; };
ConstantBuffer<C> CBVs[] : register(b10);
RWStructuredBuffer<uint4> RW : register(u1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBVs[wg].value;
}

we end up with Vulkan GLSL looking like:

#version 450
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

// If VK_VALVE_mutable_descriptor_type, SSBO and BindlessCBV can share
// same descriptor binding.
layout(set = 4, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14[];

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _22[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
  uint _m4;
  uint _m5;
  uint _m6;
  uint _m7;
} registers;

void main()
{
  uint _29 = registers._m4 + 1u;
  uvec4 _52 = floatBitsToUint(_22[registers._m5 +
    (gl_WorkGroupID.x + 10u)]._m0[0u]);
  _14[_29]._m0[gl_GlobalInvocationID.x] =
    uvec4(_14[_29]._m0[gl_GlobalInvocationID.x].x * _52.x,
          _14[_29]._m0[gl_GlobalInvocationID.x].y * _52.y,
          _14[_29]._m0[gl_GlobalInvocationID.x].z * _52.z,
          _14[_29]._m0[gl_GlobalInvocationID.x].w * _52.w);
}

The ISA looks like:

v_lshl_add_u32 v0, s6, 6, v0 // Compute DispatchThreadID.x
s_add_u32 s0, s6, s5         // Add GroupID.x + root table offset
s_lshl_b32 s0, s0, 4         // sizeof(UBO) == 16
s_mov_b32 s6, s3             // Build pointer to set
s_movk_i32 s7, 0x8000

// Load descriptor by index.
  // Missing constant offset here. LLVM disassembler bug I think ...
  // Raw opcode is f4080203 000000a0
s_load_dwordx4 s[8:11], s[6:7], s0

...
s_buffer_load_dwordx4 s[4:7], s[8:11], 0x0

Fundamentally, this is no different from legacy descriptors. The offset is just no longer constant.

6. Root table – Global Root Signature – NonUniformResourceIndex

Adding more spice, we’ll make the resource index not uniform. Descriptors on AMD have to live in scalar registers for good reasons, so for this case we will see a common pattern in advanced compute, waterfall loops.

struct C { uint4 value; };
ConstantBuffer<C> CBVs[] : register(b10);
RWStructuredBuffer<uint4> RW : register(u1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBVs[NonUniformResourceIndex(thr % 3)].value;
}

This just translates to nonuniformEXT() in Vulkan GLSL, so there’s nothing interesting to see here, but the ISA is unrecognizable.

BB0:
s_lshl_b32 s4, s4, 4
s_mov_b32 s0, s3
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], s4
v_lshl_add_u32 v0, s6, 6, v0
v_mul_hi_u32 v1, v0, 0xaaaaaaab
v_lshrrev_b32_e32 v1, 1, v1
v_lshl_add_u32 v1, v1, 1, v1
v_sub_nc_u32_e32 v1, v0, v1
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)
buffer_load_dwordx4 v[4:7], v0, s[8:11], 0 offen
v_add_lshl_u32 v1, s5, v1, 4
v_add_nc_u32_e32 v1, 0xa0, v1

// REGION OF INTEREST - BEGIN
s_mov_b64 s[2:3], exec
BB1:
v_readfirstlane_b32 s1, v1
v_cmp_eq_i32_e32 vcc, s1, v1
s_and_saveexec_b64 s[4:5], vcc
s_cbranch_execz BB9
BB2:
s_mov_b32 s6, s0
s_movk_i32 s7, 0x8000
s_load_dwordx4 s[12:15], s[6:7], s1
s_waitcnt lgkmcnt(0)
s_buffer_load_dwordx4 s[12:15], s[12:15], 0x0
s_waitcnt lgkmcnt(0)
v_mov_b32_e32 v1, s15
v_mov_b32_e32 v2, s14
v_mov_b32_e32 v3, s13
v_mov_b32_e32 v8, s12
s_andn2_b64 s[4:5], s[4:5], exec
s_cbranch_scc0 BB10
BB9:
s_mov_b64 exec, s[4:5]
s_branch BB1

// REGION OF INTEREST - END
BB10:
s_mov_b64 exec, s[2:3]
s_waitcnt vmcnt(0)

// Actually do the math.
v_mul_lo_u32 v8, v4, v8
v_mul_lo_u32 v9, v5, v3
v_mul_lo_u32 v10, v6, v2
v_mul_lo_u32 v11, v7, v1

// Store result.
buffer_store_dwordx4 v[8:11], v0, s[8:11], 0 offen
s_endpgm

This is … a bit more complicated, but now we get to study the corner stone of SIMD-style control flow – execution masking – in more detail.

The principle behind waterfall loops is that we need to partition a wave into sets which have the same value. The more divergence in a wave, the more loop iterations are needed. For sake of simplicity in the example below, we can assume a wave size of 8 (it’s actually 64 here).

The descriptor indices could be something like: index = { 0, 1, 2, 0, 1, 2, 0, 1 }. The waterfall loop looks like this in pseudo-code:

uint dynamic_index = thr % 3;
for (;;)
{
  uint index = subgroupBroadcastFirst(dynamic_index);
  if (index == dynamic_index)
  {
    desc = load_descriptor(index);
    data = load_data_from_descriptor(desc);
    break; // Stay tuned for structurization blogs, harharharhar :V
  }
}

In the first iteration, we look at our active thread mask (0xff), find the first bit set, which is 0. Lane 0 holds the value 0.

v_readfirstlane_b32 s1, v1 // A compute enthusiast's best friend!

Any other lane with the same value can go ahead, so lane 0, 3 and 6 will enter the branch.

v_cmp_eq_i32_e32 vcc, s1, v1
s_and_saveexec_b64 s[4:5], vcc

The index is now known to be constant over the wave, so we can now load the UBO descriptor and data, but the result must be moved to vector registers since we don’t load from wave uniform resource indices here:

s_buffer_load_dwordx4 s[12:15], s[12:15], 0x0
// The v_mov is only executed for active lanes.
v_mov_b32_e32 v1, s15
v_mov_b32_e32 v2, s14
v_mov_b32_e32 v3, s13
v_mov_b32_e32 v8, s12

The lanes which participate are now masked out, and the exec mask would be 0xff & ~(0x01 | 0x08 | 0x40). In the next iteration, we look at the first active lane, which has the value 1, and so we roll on until all unique values have been processed. Neat!

s_cbranch_scc0 BB10 // Keep going until we've masked out all exec bits.

7. ResourceDescriptorHeap – SM 6.6

This is trivially similar to the root table case, so I’ll omit ISA here. The main problem is DXIL having to define a completely different set of opcodes for SM 6.6, but that’s another story <_<. Overall it is simpler. Root tables are translated to SM 6.6 automatically in dxil-spirv after all.

struct C { uint4 value; };
RWStructuredBuffer<uint4> RW : register(u0, space1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  ConstantBuffer<C> CBV = ResourceDescriptorHeap[wg];
  RW[thr] *= CBV.value;
}

#version 450
#extension GL_EXT_nonuniform_qualifier : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _13[];

layout(set = 1, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _18;

void main()
{
  uvec4 _34 = floatBitsToUint(_13[gl_WorkGroupID.x]._m0[0u]);
  _18._m0[gl_GlobalInvocationID.x] =
    uvec4(_18._m0[gl_GlobalInvocationID.x].x * _34.x,
          _18._m0[gl_GlobalInvocationID.x].y * _34.y,
          _18._m0[gl_GlobalInvocationID.x].z * _34.z,
          _18._m0[gl_GlobalInvocationID.x].w * _34.w);
}

8. Root table / ResourceDescriptorHeap – Pascal workaround

On older NVIDIA hardware, we cannot use bindless UBO at all, as hardware does not support it. The workaround is to use bindless SSBO instead, with hilarious performance loss as a result :’)

9. Root constant – Local Root Signature

Local root signatures are an abomination. There is no fossilize-synth support for RTPSOs at this time, so I’ll resort to just showing the Vulkan GLSL equivalent. In Vulkan, the shader record block works just like an SSBO, but D3D12 insists on the abstracted model of root signatures, so when emitting code we need to consider if a resource belongs to the local root signature or global root signature.

struct CBVData { float4 v; };
ConstantBuffer<CBVData> SBTRootConstant : register(b0, space15);
ConstantBuffer<CBVData> SBTRootConstant2 : register(b1, space15);

struct Payload
{
  float4 color;
  int index;
};

[shader("miss")]
void RayMiss(inout Payload payload)
{
  payload.color = SBTRootConstant.v;
  payload.color += SBTRootConstant2.v;
}

#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct _19
{
  vec4 _m0;
  uint _m1;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  // ... Local root signature parameters.
  // The exact binary layout is implied by the root signature
  // parameter order.
} SBT;

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  // ... Global root signature parameters
} registers;

layout(location = 0) rayPayloadInEXT _19 payload;

vec4 _67;

void main()
{
  vec4 _41 = uintBitsToFloat(uvec4(SBT._m0[0u], SBT._m0[1u], SBT._m0[2u], SBT._m0[3u]));
  vec4 _57 = uintBitsToFloat(uvec4(SBT._m1[0u], SBT._m1[1u], SBT._m1[2u], SBT._m1[3u]));
  // Ah, yes, the classic LLVM composite insert pattern from undef vector.
  // You thought DXIL didn't support composites? Think again! :D
  vec4 _66 = _67;
  _66.x = _41.x + _57.x;
  vec4 _68 = _66;
  _68.y = _41.y + _57.y;
  vec4 _69 = _68;
  _69.z = _41.z + _57.z;
  vec4 _70 = _69;
  _70.w = _41.w + _57.w;
  payload._m0 = _70;
}

10. Root descriptor – Local Root Signature

Unlike the global root signature, we are forced to use BDA here. It is physically impossible to translate this to push descriptors since we have no chance to observe this pointer on CPU side. The GPU can generate this pointer on its own. It’s a bit hilarious that we can use raw pointers like this on D3D12 since forever, but we cannot actually use the pointer directly in the shader 😐

The SPIR-V is similar to global root signature, we just have to source the BDA from record block instead.

AddCarry _31;
_31._m0 = uaddCarry(SBT._m6.x, 0u * 16u, _31._m1);
PhysicalPointerFloat4NonWrite _38 =
  PhysicalPointerFloat4NonWrite(uvec2(_31._m0, SBT._m6.y + _31._m1));
vec4 _57 = _58;
_57.x = payload._m0.x + _38.value.x;
vec4 _59 = _57;
_59.y = payload._m0.y + _38.value.y;
vec4 _60 = _59;
_60.z = payload._m0.z + _38.value.z;
vec4 _61 = _60;
_61.w = payload._m0.w + _38.value.w;
payload._m0 = _61;

11. Root table – Local Root Signature

Cursed mode. In global root signatures, a table costs 1 DWORD, but in local it’s 2, because we are expected to consume a full GPU VA which points to descriptors. Since we work with offsets, we synthesize a fake GPU VA pointer instead in ID3D12DescriptorHeap::GetGPUVirtualAddress(), which we can trivially convert to an offset and then sample the global heap, SM 6.6 style … The GPU VA is faked with the formula (unique_value << 32) + (offset * DescriptorIncrement)

struct CBVData { float4 v; };
ConstantBuffer<CBVData> SBTCBVs[] : register(b4, space15);

struct Payload
{
  float4 color;
  int index;
};

[shader("miss")]
void RayMiss(inout Payload payload)
{
  payload.color += SBTCBVs[payload.index].v;
}

#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct _25
{
  vec4 _m0;
  uint _m1;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  uvec2 _m2;
  uvec2 _m3;
  uvec2 _m4;
  uvec2 _m5;
  uvec2 _m6;
  uvec2 _m7;
  uvec2 _m8;
  uvec2 _m9;
  uvec2 _m10;
} SBT;

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _24[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  // ... Global root parameters
} registers;

layout(location = 0) rayPayloadInEXT _25 payload;

vec4 _62;

void main()
{
  // Extract offset in descriptor heap from VA.
  // The shift amount is configurable.
  uint _42 = ((SBT._m9.x >> 5u) + 13u) + payload._m1;
  vec4 _61 = _62;
  // Force nonuniformEXT here when indexing from SBT
  // since it's a bit unclear what dynamically uniform even means for RT.
  _61.x = payload._m0.x + _24[nonuniformEXT(_42)]._m0[0u].x;
  vec4 _63 = _61;
  _63.y = payload._m0.y + _24[nonuniformEXT(_42)]._m0[0u].y;
  vec4 _64 = _63;
  _64.z = payload._m0.z + _24[nonuniformEXT(_42)]._m0[0u].z;
  vec4 _65 = _64;
  _65.w = payload._m0.w + _24[nonuniformEXT(_42)]._m0[0u].w;
  payload._m0 = _65;
}

So there we have it, at least 11 different ways we can implement CBV in Vulkan from the exact same DXIL input … Oh never mind, make that 12 if we add the -no-legacy-cbuf-layout flag to DXC.

12. CBufferLoad vs LoadLegacy

Non-legacy loads force scalar loads for everything 🙁 It’s also horribly buggy in DXC and drivers, but that’s a story for another time. With legacy model, we only really had to consider loading 128 bits at once with a clean index into float4 array, but now we have a byte-offset based interface. Fortunately, basically no content uses this feature. It’s still opt-in in DXC despite being introduced with SM 6.0.

struct C { uint4 value; };
struct C64 { uint64_t4 value; };
RWStructuredBuffer<uint4> RW : register(u2);
ConstantBuffer<C> CBV : register(b0);
ConstantBuffer<C64> CBV64 : register(b1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBV.value;
  RW[thr] *= uint4(CBV64.value);
}

#version 460
#extension GL_ARB_gpu_shader_int64 : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_scalar_block_layout : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 4, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14[];

// We got descriptor aliasing, no problem!
// We also have actual scalar packing of uniforms :)
layout(set = 5, binding = 0, scalar) uniform BindlessCBV
{
  float _m0[16384];
} _21[];

layout(set = 5, binding = 0, scalar) uniform _25_28
{
  double _m0[8192];
} _28[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
  uint _m4;
  uint _m5;
  uint _m6;
  uint _m7;
} registers;

void main()
{
  uint _35 = registers._m4 + 2u;
  uint _42 = registers._m5 + 1u;

  // Please make it stop ...

  _14[_35]._m0[gl_GlobalInvocationID.x] = uvec4(
    _14[_35]._m0[gl_GlobalInvocationID.x].x *
      floatBitsToUint(_21[registers._m5]._m0[0u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].y *
      floatBitsToUint(_21[registers._m5]._m0[1u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].z *
      floatBitsToUint(_21[registers._m5]._m0[2u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].w *
      floatBitsToUint(_21[registers._m5]._m0[3u]));

  _14[_35]._m0[gl_GlobalInvocationID.x] = uvec4(
    _14[_35]._m0[gl_GlobalInvocationID.x].x *
      uint(doubleBitsToUint64(_28[_42]._m0[0u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].y *
      uint(doubleBitsToUint64(_28[_42]._m0[1u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].z *
      uint(doubleBitsToUint64(_28[_42]._m0[2u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].w *
      uint(doubleBitsToUint64(_28[_42]._m0[3u])));
}

Ugh … There isn’t too much we can do though, the DXIL is hardcoded to be scalar here despite all other buffer load interfaces being vectorized to some degree at least.

%5 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 0, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
// Note the broken alignment, cute. :')
%6 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 4, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%7 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 8, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%8 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 12, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

Without robustness, ACO manages to vectorize this garbage at least:

s_buffer_load_dwordx4 s[4:7], s[8:11], 0x0
s_buffer_load_dwordx4 s[8:11], s[12:15], 0x0
s_buffer_load_dwordx4 s[12:15], s[12:15], 0x10

This also holds even for robustness2, but if we start doing dynamically indexing the CBV like this, things collapse into noise since the DXIL mode is doing insane things with the byte offset here.

%5 = shl i32 %4, 4
%6 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %5, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%7 = or i32 %5, 4
%8 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %7, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

// ÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆ. I'm out!
%9 = add nsw i32 %7, 4

%10 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %9, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%11 = or i32 %5, 12
%12 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %11, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

Sad compiler noises …

Bonus round

Other descriptor types?

At a fundamental level, once we have studied CBVs, there isn’t that much more to cover, except for some key edge cases. The only real differences is how big the descriptors are and the opcodes used with the descriptors, like image_sample, image_load, image_write, etc, but that’s not terribly interesting for this post, so I’ll leave it here.

Raw SRV / UAV vectorization

DXIL raw buffer opcodes are in general quite unfriendly to human readable representations, and it maps very poorly to SSBOs. The robustness rules are also horrible to work with, and making this all work was not fun. After a significant rework, dxil-spirv is somewhat competent now, I think.

The raw buffer interfaces do make sense from an ISA-targeting compiler point-of-view though, I just wish it was a bit more structured for the sake of my sanity … :V

Byte address buffers

ByteAddressBuffers are very annoying. In HLSL, there is only scalar alignment required for a load-store with no way for HLSL shaders to signal intended alignment. There is also a very strange robustness rule. Even for a vector load-store, robustness is per-component at a 16 byte granularity. (why …)

With a simple shader like this, we can do a good job:

RWByteAddressBuffer RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  float4 values = RW.Load<float4>(thr * 16 + 16);
  values += 1.0;
  RW.Store<float4>(thr * 16 + 16, values);
}

When we attempt to vectorize these loads and stores, we must be able to prove that the SSA expression can be divided by the load-store size, here 16. We can split the addition and compute the result: thr + 1. We then declare an aliased SSBO variant as float4[].

#version 460
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _10;

void main()
{
  vec4 _29 = uintBitsToFloat(_10._m0[gl_GlobalInvocationID.x + 1u]);
  _10._m0[gl_GlobalInvocationID.x + 1u] = uvec4(
    floatBitsToUint(_29.x + 1.0),
    floatBitsToUint(_29.y + 1.0),
    floatBitsToUint(_29.z + 1.0),
    floatBitsToUint(_29.w + 1.0));
}

The way we used to implement this was through 4 unrolled scalar load-stores. E.g. if address is changed to something weird like thr * 4 + 20 for the store, we cannot prove vectorization and we fall back into the old and bad code paths.

#version 460
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

// These are declared with Aliased decorations in SPIR-V.
layout(set = 0, binding = 0, std430) buffer SSBO
{
  uint _m0[];
} _9;

layout(set = 0, binding = 0, std430) buffer _12_14
{
  uvec4 _m0[];
} _14;

void main()
{
  uint _22 = gl_GlobalInvocationID.x << 4u;
  uvec4 _30 = _14._m0[gl_GlobalInvocationID.x + 1u];
  vec4 _33 = uintBitsToFloat(_30);
  uint _46 = gl_GlobalInvocationID.x + 5u;
  _9._m0[_46] = floatBitsToUint(_33.x + 1.0);
  _9._m0[_46 + 1u] = floatBitsToUint(_33.y + 1.0);
  _9._m0[_46 + 2u] = floatBitsToUint(_33.z + 1.0);
  _9._m0[_46 + 3u] = floatBitsToUint(_33.w + 1.0);
}

This is a case where robustness2 can matter quite a lot actually … Without robustness, ACO will generate:

v_add_f32_e32 v8, 1.0, v4
v_add_f32_e32 v9, 1.0, v5
v_add_f32_e32 v10, 1.0, v6
v_add_f32_e32 v11, 1.0, v7
buffer_store_dwordx4 v[8:11], v0, s[4:7], 0 offen offset:20

Memory operations on AMD are very flexible w.r.t. alignment fortunately! With robustness2 however, we cannot safely vectorize 🙁

buffer_store_dword v4, v0, s[4:7], 0 offen offset:20
buffer_store_dword v5, v0, s[4:7], 0 offen offset:24
buffer_store_dword v6, v0, s[4:7], 0 offen offset:28
buffer_store_dword v1, v0, s[4:7], 0 offen offset:32

If we emit the vectorized load-store ourselves, we can get vectorized load-store as expected.

The case with float3[] vectorization is very special. With scalar block layout we can implement it, but we need to consider the 16 byte robustness rule. On AMD, we get per-component robustness in hardware, but not on NVIDIA it seems, so for ByteAddressBuffer, float3[] can only conditionally be vectorized. We’re probably going a bit out of spec on Vulkan here I think, but it’s backed up by tests in the vkd3d-proton test suite.

RW.Store<float3>(thr * 12, values.xyz);

ends up as

layout(set = 0, binding = 0, scalar) buffer _12_14
{
  uvec3 _m0[];
} _14;

 _14._m0[gl_GlobalInvocationID.x] = uvec3(
  floatBitsToUint(_41.x + 1.0),
  // Individual components might be sliced here. >_<
  floatBitsToUint(_41.y + 1.0),
  floatBitsToUint(_41.z + 1.0));

buffer_store_dwordx3 v[4:6], v0, s[4:7], 0 offen

Structured buffer

Structured buffers are a bit simpler, but still annoying in practice. While the HLSL declares something like:

struct T { float4 a; float3 b; float c; };
RWStructuredBuffer<T> Buf;

This information is lost in the raw DXIL form except in some reflection metadata that we cannot fully trust. The interface for structured buffers is just:

Index (usually not constant)
Offset into structured element (usually constant)
Load-store size
Alignment (ignored, DXC emits nonsense here)

Ideally, we would be able to emit the equivalent in SSBO form, but the raw nature of DXIL doesn’t really care anymore about the data types. We can only get effective vectorization for composites like this if the stride aligns with the vector size.

An example here would be:

struct T { float4 a; float3 b; float c; };
RWStructuredBuffer<T> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  T t = RW[thr];
  t.a += 1.0;
  t.b += 2.0;
  t.c += 3.0;
  RW[thr] = t;
}

// Alignment of 4? Sure ... Except we trivially know it's 32 ...
// Told you we cannot trust DXC here! <_<
%3 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 0, i8 15, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

%8 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 16, i8 7, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

%12 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 28, i8 1, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

Since stride is 32, we can vectorize the float4, but not float3.

// float4 load
vec4 _29 = uintBitsToFloat(_14._m0[gl_GlobalInvocationID.x * 2u]);

// float3 load
uint _37 = (gl_GlobalInvocationID.x * 8u) + 4u;
uint _40 = _9._m0[_37];
uint _44 = _9._m0[_37 + 1u];
uint _47 = _9._m0[_37 + 2u];
vec3 _50 = uintBitsToFloat(uvec3(_40, _44, _47));

// float load
uint _58 = _9._m0[(gl_GlobalInvocationID.x * 8u) + 7u];

// float4 store
_14._m0[gl_GlobalInvocationID.x * 2u] = uvec4(
  floatBitsToUint(_29.x + 1.0), floatBitsToUint(_29.y + 1.0),
  floatBitsToUint(_29.z + 1.0), floatBitsToUint(_29.w + 1.0));

// float3 store
uint _79 = (gl_GlobalInvocationID.x * 8u) + 4u;
_9._m0[_79] = floatBitsToUint(_50.x + 2.0);
_9._m0[_79 + 1u] = floatBitsToUint(_50.y + 2.0);
_9._m0[_79 + 2u] = floatBitsToUint(_50.z + 2.0);

// float store
_9._m0[(gl_GlobalInvocationID.x * 8u) + 7u] =
  floatBitsToUint(uintBitsToFloat(_58) + 3.0);

buffer_store_dwordx4 v[12:15], v0, s[4:7], 0 offen
buffer_store_dwordx4 v[4:7], v0, s[4:7], 0 offen offset:16

ACO is being an absolute champ here though, and even with robustness2 it can handle it. I think this happens because it can prove the address increments cannot overflow due to a POT index multiplier.

One saving grace of structured buffers is that it is undefined to straddle an element boundary. This means there is no byte address buffer hell of having to support per-component robustness. We can safely vectorize e.g.

RWStructuredBuffer<float3> Buf;

Root descriptor of ray tracing acceleration structure

If you ever wonder why OpConvertUToAccelerationStructureKHR exists, you can thank yours truly for creating a truly horrible monster.

struct Payload
{
  float4 color;
};

RaytracingAccelerationStructure AS_RootDesc : register(t0, space0);

[shader("raygeneration")]
void RayGen()
{
  RayDesc ray;
  ray.Origin = float3(1, 2, 3);
  ray.Direction = float3(0, 0, 1);
  ray.TMin = 1.0;
  ray.TMax = 4.0;

  Payload p;
  p.color = float4(1, 2, 3, 4);
  TraceRay(AS_RootDesc, RAY_FLAG_NONE, 0, 0, 0, 0, ray, p);
}

#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_nonuniform_qualifier : require

struct _16
{
  vec4 _m0;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  uvec2 _m2; // Why hello there. I'm a cursed RTAS.
} SBT;

layout(location = 0) rayPayloadEXT _16 _18;

void main()
{
  _18._m0 = vec4(1.0, 2.0, 3.0, 4.0);
  traceRayEXT(accelerationStructureEXT(SBT._m2),
    0u, 0u, 0u, 0u, 0u,
    vec3(1.0, 2.0, 3.0), 1.0,
    vec3(0.0, 0.0, 1.0), 4.0, 0);
}

Oh no!

Conclusion

Now you know probably a little too much about descriptors and you can go bug IHVs about this with confidence. D:

There’s a million different ways to do things when it comes to descriptors and D3D12 and Vulkan took completely different approaches here. Developers asked how things work and D3D12 said “… yes”. In response I say ÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆ.

My personal hell of translating DXIL to SPIR-V – part 3

It’s time we tackle one of the big problems of DXIL, the binding model. The D3D12 binding model is completely foreign to most people who know the Vulkan binding model, and vice versa. I don’t think there are that many people in the world who can claim to fully grasp the binding models in both APIs. Translating every last detail of the D3D12 binding model to Vulkan is extremely painful, and I feel D3D12 made some critical design mistakes which bite us (and native drivers?) hard. Whenever I hear people naively claim D3D12 and Vulkan is basically the same API, I cringe hard and cry a little inside. Translating low level APIs is hellish when the details don’t map 1:1 exactly and the binding model is the perfect vehicle to demonstrate it.

I hope this blog post can serve as a definitive document on the insanity we need to go through in vkd3d-proton to make all of this work well. We have landed on a solution I feel is quite solid for AMD, but perhaps less so on other IHVs … A lot of credit here goes to Doitsujin who went through the insane task of rewriting the entire binding model in vkd3d-proton to the full TIER_3 binding model last year.

Earlier posts in the series:

The starting point of D3D12 – allow reuse of D3D11-era shaders

Before we can discuss D3D12, we must have a basic understanding of the D3D11 binding style we find in SM 5.0. From there, we will see how this was extended to the frankenstein monster that is SM 5.1 DXBC and how this was cribbed wholesale in SM 6.0+.

Binding resources to slots

The binding model in D3D11 was extremely simple, simple enough that we have no problems implementing that in Vulkan 1.0 as-is. Essentially, you can bind a fixed amount of:

SRV (t#): Generic read-only resources, sampled images, read-only SSBO and uniform texel buffers
UAV (u#): Generic read-write resources, storage images, read-write SSBO and storage texel buffers
Samplers (s#): Separate samplers, same as Vulkan
CBV (b#): Constant buffer views, i.e. UBOs in Vulkan

Each resource in the HLSL declares which register it binds to, easy enough! This maps very easily to Vulkan. The main difference is that the bindings are per stage in D3D11, but these details are easy to work around. Another awkward detail of D3D11 is that SRVs and UAVs are catch alls for any kind of read-only or read-write resource, even if they supports many different descriptor types. Sampled images, raw buffers and typed buffers are quite different things! Of course, with a slot based binding model, the implementation can resolve these details easily since we know all accessed resources up front. No bindless hell to deal with.

Now, we imagine that game developers are sitting on a large repository of D3D11-era shaders and D3D12 rolls around. Very few were going to risk going D3D12 exclusive any time soon, so it had to be possible to ship the exact same DXBC shaders to target both APIs. Even to this day, most new D3D12 games ship DXBC to target both APIs! (Maintaining two completely different shader backends is soooo much fun and productive use of time …) The center of gravity is slowly shifting towards DXIL, and it’ll probably take a few more years before DXIL is the main bytecode games ship. I still haven’t seen a single Unreal Engine title shipping DXIL-exclusive yet for example.

One key aspect of the D3D11 model is that the shaders say nothing about how data is accessed. Especially buffer data can be implemented in many different ways, and it would have been up to the D3D11 driver to figure out how to implement constant buffers efficiently for example. Of course, with the highly managed model of that API, the driver had ample opportunity to do so, at the cost of CPU overhead. With explicit APIs like D3D12 and Vulkan, there is far less room for drivers to optimize because the application developer is given certain freedoms, but they are expected to do some work in return. In reality, the IHV will probably add hacks behind your back if your performance is important enough to them, but that’s another story.

With buffer data like constant buffers for example, there are three primary indirection levels where we can trade off speed vs. space:

Preloaded registers: The absolute fastest mode of execution. If a constant buffer is tiny (a few u32s), it might as well fit entirely inside the register bank of the shader core, no memory operation required in the shader!
Store a pointer in preloaded registers: The second fastest mode. We can take one indirection to be able to access a larger chunk of data with a simple load instruction from pointer.
Access a descriptor: The “slowest” mode, with two indirections. First, we need to load a descriptor, then, we load data based on that descriptor. We also get robustness guarantees, which is another special kind of hell to handle.

As we see here, these three styles of buffer access have different trade-offs. We can get more efficient access the smaller data we need to consider. A few u32s of data? Preload registers. A few KiB? Store a pointer to it in registers. Many different buffers with different base pointers? Descriptor is the way to go.

The awkward part of this simplified model is that on certain hardware, using descriptors might be the fast path, so we’ll have to be careful. IHVs have optimized these paths to death over the decades.

The critical part for D3D12 here is that there is no way to express these concerns in the HLSL shader itself, whereas we certainly can in GLSL / SPIR-V!

A “virtual” vs. “physical” binding model

In Vulkan, we can express concerns like these in the SPIR-V for the most part.

Preloaded registers: Push constants is designed to map directly to this scheme. Some drivers might also be able to promote INLINE_UNIFORM_BUFFER to registers, especially if the descriptor set is not UPDATE_AFTER_BIND.
Store a pointer in preloaded registers: In Vulkan, we can place buffer device addresses in a push constant block and load directly from that. This is somewhat esoteric, but we use this a lot in vkd3d-proton. INLINE_UNIFORM_BUFFER also maps to this scheme. If we consider a descriptor set to be a pointer, inline UBO data can be placed directly in that descriptor memory, and we get one indirection. Implementation details between IHVs tends to be wildly different for inline UBO.
Access a descriptor: Just normal descriptor sets. If the descriptor set is not UPDATE_AFTER_BIND, a driver might be able to promote a buffer to a pointer in preloaded registers. This model has two indirections. One to load a descriptor, one to load the data from that descriptor.

Since we’re able to express ourselves fairly explicitly in Vulkan in the shader itself, I’ll call this a “physical” binding model, or at least a close enough approximation. A true physical binding model would be one where descriptors are accessed through raw pointers, but good luck convincing N IHVs to agree on how that should work 🙂 Inline UBO is the odd exception to this rule where the pipeline layout specifies access patterns which are not expressed in SPIR-V.

One fair criticism of Vulkan’s model is that there are a lot of ways to do effectively the same thing. Should you use push constants, inline UBO, UBO, UBO with dynamic offset, push constant buffer device address, push descriptors or normal pool descriptors? Hard to say without lots of experience and profiling. At least we have the tools available in the API, even if it’s not always obvious what the optimal choice is. In vkd3d-proton we use almost all of these, depending on the context.

To lament the state of graphics programming a bit, loading constant data optimally is still an unsolved problem in graphics, but we have accelerated ray traced global illumination at least, so yay? 🙂 To be fair, it’s hard to go wrong with plain old Vulkan 1.0 UBOs. Any perf gains beyond that tends to be minor and highly situational.

The abstract nature of D3D12 binding model

D3D12’s model is far more abstract, but at the same time, it maps to extremely specific restrictions on an implementation. It’s a weird model where it feels like an abstraction, but it actually isn’t. We can see right through it. We’ll explore this point in the root signature section …

Shader model 5.1 – I wanna bindless too!

5.1 is a weird version of DXBC. It changes some critical things:

Resources can be declared as arrays with dynamic indexing. Unbounded array size is also supported. Bindless, baby!
register() bindings now take an optional space parameter. This seems similar to DescriptorSet in SPIR-V, but it actually isn’t. It’s meaningless on its own. It is necessary however when using a lot of unbounded array size declarations. In DXBC, all t# or u# registers for example would be exhausted by unbounded array sizes, but by using different register spaces we can declare multiple unbounded arrays. When referring to a resource, we need to consider its binding, space, shader stage and type (SRV, UAV, CBV vs Sampler).

5.1 is kinda obsolete now that DXIL (6.0) has been around for a while, but it is still shipped in various titles.

The root signature

As mentioned earlier, HLSL shaders give no hints to the driver about how resources are actually accessed, so this information has to come from somewhere. In D3D12, we provide all this information in the root signature. The root signature is analogous to Vulkan’s VkPipelineLayout, but the details are completely different.

I think it’s helpful to understand D3D12 root signatures from a top-down view, and the point I would like to get across is that we should see the root signature as a very fancy push constant layout declaration.

The root signature defines an ABI for shaders. We lay down up to 256 bytes in memory, and then define how those 256 bytes can be used to access resources in various ways. The weird way root signatures work is that there’s a very explicit limit of a certain number of bytes, but we cannot actually access those bytes ourselves in a shader. The 256 bytes are organized in terms of 64 DWORDs. Each DWORD can be viewed in a certain way.

32_BIT_CONSTANT: Consume a certain number of DWORDs and assign them to a CBV with specific register binding. That CBV’s data cannot be accessed with any dynamic array indices. This enforces the hidden implementation detail that the data can be mapped directly to registers. As an interesting tidbit, Vulkan push constants actually don’t have this restriction. In Vulkan, you can still dynamically index push constants as long as the access is dynamically uniform (yes, some hardware can dynamically index registers, it’s neat!)
Root descriptor: Consume 2 DWORDs to store a 64-bit GPU virtual address to any kind of raw buffer. CBV, UAV and SRV all work as long as they are untyped. The key thing to note here is that there is no room for a buffer size, so buffer robustness is completely disabled for root descriptors. All of this is just a very roundabout way to say that we store a pointer and access it as a pointer. This is basically a very restricted form of VK_KHR_buffer_device_address which is hidden from the developer. The driver is of course free to turn this into something like a Vulkan push descriptor if it so chooses. In vkd3d-proton, we’d love if games actually made use of root CBVs, but oh well 🙁 Some engines got the memo though 🙂
Table pointer: The cost is 1 DWORD, even though it should be 2 (we’ll get to this later, it gets confusing!). This is where the D3D12 binding model starts to get esoteric and diverge completely from Vulkan. We’ll need a separate section to cover this.

The global descriptor heaps and table pointers

Descriptors are managed in a completely different way than Vulkan and this is the point that gives us a pretty big headache.

In D3D12, you’re expected to allocate one heap which contains N descriptors. N goes up to 1,000,000 in the highest binding tier. All descriptors you access in a shader must exist inside this heap, and the heap must be explicitly bound to a command list before use. For a given command list, you’re not really expected to change it, but you can suballocate, and 1 million descriptors should be enough unless we take the Nanite meme far enough where we need to render one material per pixel. Samplers have their own heap, and there’s a limit of a few thousand here.

From the global heap, you’re able to bind sub-ranges of that heap. These sub-ranges are represented as the table pointers in the root signature. Essentially, a table pointer encodes a u32 offset into the global heap – even if you give the API a GPU VA – which is why it costs 1 DWORD instead of 2 (for now …).

Apparently, this idea is a workaround for certain GPUs. There exists a literal “descriptor palette” in some hardware and we can only read descriptors from this “palette” essentially. D3D12 seems to cater to that model, at the cost of flexibility. You’re asked as a developer to own that palette / heap and allocate memory on top of that.

View objects work completely differently as well. In D3D12, a view object is created directly into a descriptor heap, rather than having separate VkImageView and VkBufferView objects. Copying descriptors is a critical part of D3D12 descriptor management. It’s possible to create non-shader visible descriptor heaps, which is basically just a fancy malloc(). Here, it’s intended that we can stream out descriptors to the shader-visible descriptor heap by calling CopyDescriptors over and over, often over 10000 times per frame! Making this path efficient enough has been a ton of work and has placed some serious restrictions on our implementation. There is only one way to do it “fast” and correct, it just took 3-4 iterations to get there.

In Vulkan, the approach is quite different. We create view objects as standalone objects, which are then written into descriptor sets with vkUpdateDescriptorSets. If we squint enough, the Vulkan view objects represent the non-shader visible descriptor heaps, and vkUpdateDescriptorSets is closer to CopyDescriptors (non-shader visible → shader visible) than Create*View. Profiling native drivers shows this clearly. Create*View() is ~50-100x slower than CopyDescriptors, similar to Vulkan if you look at vkCreate*View() vs. vkUpdateDescriptorSets.

In the Vulkan model, there is no concept of a “descriptor heap” which has to be bound, however, we can figure out what is really going on under the hood. We can allocate a lot of view objects in Vulkan, but eventually, we will actually hit out-of-memory conditions on some hardware. This is a good indication that we have actually exhausted the internal descriptor “palette”. The views merely contain references to that global palette instead.

Another headache is that view objects in D3D12 do not have a lifetime or their own. The lifetime is tied to the descriptor heap itself, as a descriptor is essentially treated as plain old data. This is not the case with Vulkan. The unfortunate side effect of this scenario is that we need to maintain a per ID3D12Resource hash-map in vkd3d-proton where we keep view objects alive until we know for sure that we can destroy the views, i.e. at resource destruction time. The older implementation in vkd3d-proton used reference counted views, but as you can expect, when games copy 10000+ descriptors per frame in many threads concurrently, the overhead was unusably large. The shit hit the fan so to say in Death Stranding where we spent >80% of CPU time copying descriptors, not a good look. The only solution was to use VkCopyDescriptors functionality of vkUpdateDescriptorSets, which is considered quite esoteric in Vulkan.

So how do shaders access the global descriptor heap? Well, through a lot of fixed function jank, that’s how! This jank is more or less fixed in SM 6.6 as we’ll get to later, but now we just have yet another API to implement, support and test, sigh …

First, we start with the table parameter that is pushed to the GPU through the command list API. This represents an offset into the descriptor heap. A table entry has a certain number of descriptor ranges associated with it. These ranges specify that a certain subset of descriptors will be consumed as descriptors. For example, consider a Texture2D[] bound at register(t10, space4). A table entry in the root signature might say “Parameter #5 has an SRV range which begins at space #4, register #8, contains unbounded number of descriptors, with a constant offset of 15 descriptors from the table entry.” When accessing this resource, we should access the descriptor heap at descriptor #tableEntry5 + (10 – 8) + 15 + dynamicIndex in shader. As we see, the registers and spaces in the HLSL have no physical meaning until the root signature gives it meaning.

Codegen to SPIR-V

We can map the entire root signature into two parts, the root parameters and immutable sampler declaration.

For example, take this shader:

cbuffer CPush : register(b0) { float4 cpush; };
cbuffer C : register(b1) { float4 c; };
cbuffer C2 : register(b2) { float4 c2; };

Texture2D<float4> T[] : register(t40);
SamplerState S : register(s30);

float4 main(uint index : INDEX, float uv : UV) : SV_Target
{
	return T[NonUniformResourceIndex(index)].Sample(S, uv) +
		cpush + c + c2;
}

A typical codegen here would be something like (via SPIRV-Cross):

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct AddCarry
{
    uint _m0;
    uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerFloat4NonWrite;
layout(buffer_reference, std430) readonly buffer PhysicalPointerFloat4NonWrite
{
    vec4 value;
};

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
    vec4 _m0[4096];
} _21[];

// The root parameters map to push constants.
layout(push_constant, std430) uniform RootConstants
{
    uvec2 _m0; // Root CBV for b1
    uint _m1;  // Various table offset root parameters
    uint _m2;  // ...
    ...
    uint _m9;  // Root constants
    uint _m10;
    uint _m11;
    uint _m12;
} registers;

layout(set = 0, binding = 0) uniform texture2D _14[];
layout(set = 2, binding = 0) uniform sampler _25[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in float UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    uint _46 = registers._m6 + 2u;

    // Textures can only be accessed through heap.
    // Here we see address computation being based on all three inputs.
    // Push constant + constant + dynamic index
    // All of this is wrapped in a nonuniform thing for extra spice.
    vec4 _66 = texture(nonuniformEXT(sampler2D(_14[registers._m1 + (INDEX + 40u)], _25[registers._m3 + 30u])), vec2(UV));

    // Root constants, directly copy from push constants.
    // The cbufferLoad instruction gives us constant offsets, so we can
    // directly access push constant members here.
    vec4 _84 = uintBitsToFloat(uvec4(
        registers._m8, registers._m9, registers._m10, registers._m11));

    // Root CBV, here implemented with buffer device address,
    // kinda ugly SPIR-V, but gotta go fast.
    // This works great on AMD, not so much on NVIDIA … but we can
    // Use a normal plain UBO instead if we want.
    // Ah, the luxury of giving the implementation a choice … :)
    
    AddCarry _98;
    _98._m0 = uaddCarry(registers._m0.x, 0u * 16u, _98._m1);
    PhysicalPointerFloat4NonWrite _105 = PhysicalPointerFloat4NonWrite(uvec2(_98._m0, registers._m0.y + _98._m1));

    // Bindless UBO action, table CBV here we go …
    // Not a great look for performance on some GPUs.
    // On Pascal in particular, we have to emit bindless SSBO instead, rip perf.
    // Not much we can do sadly.
    SV_Target.x = ((_84.x + _66.x) + _105.value.x) + _21[_46]._m0[0u].x;
    SV_Target.y = ((_84.y + _66.y) + _105.value.y) + _21[_46]._m0[0u].y;
    SV_Target.z = ((_84.z + _66.z) + _105.value.z) + _21[_46]._m0[0u].z;
    SV_Target.w = ((_84.w + _66.w) + _105.value.w) + _21[_46]._m0[0u].w;
}

It’s common to think of the Table parameter as equivalents to Vulkan descriptor sets, but that is misleading. The correct way to think about it is that D3D12 has one huge (technically two for Sampler heap) descriptor set, and for convenience you can push offsets into that heap as desired. In Vulkan, descriptor sets can be allocated and bound in complete isolation. Shader model 6.6 removes all of this pretense and shows it how it really is. Everything is just an index to the heap, and the pre-6.6 world tried to hide this fact.

My main problems with the binding model

Very loose coupling between root signature and table descriptor access

The only thing the root signature does is to define how offsets into the heap is computed. There is no knowledge about the array-ed-ness of a resource, nor if every resource is even valid at record time (exception here is STATIC descriptors, but alas, extremely rare in the wild). Unfortunately, this loose nature of root signature and shader matching complicates descriptor hoisting, and when you complicate hoisting, you better pray the target hardware has pristine bindless support (pro-tip, they don’t).

A root signature could say that “we have 100 CBVs in this range”, but shaders might just happen to use a few CBVs from that range. Doing descriptor work per shader is pretty gross and goes against the spirit of the modern graphics APIs. Aggressive hoisting and repacking is CPU overhead that we shouldn’t consider except in exceptional circumstances. A native driver might have a much easier time dealing with these things since they can modify their own command streams at the last minute if they want. In D3D12, SIMULTANEOUS_ACCESS command lists don’t exist, so the driver has this option. Perhaps this restriction was put in place precisely for these scenarios? Who knows, we can only armchair these things!

The D3D12 docs also encourage developers to use somewhat generic root signatures and rarely change them, so a root signature is likely going to be quite “fat” compared to an ideal Vulkan pipeline layout, making useful workarounds even more difficult …

Awkward and unnecessary aliasing

In a D3D12 descriptor heap, there is no descriptor type. Descriptor #10 can be an SRV and descriptor #11 can be a CBV, it’s all a few bits between friends. Surely, games are not going to commit the grave sin of placing the wrong descriptor type in the wrong index, right? … Right?! Of course they do. Debugging insanity like this was the prelude to vkd3d-proton’s fast descriptor QA checking mechanism. It’s actually quite fast. In AAA games, we have code like this running at 40 FPS usually. More than good enough I’d say when hunting for bugs.

void descriptor_qa_report_fault(uint fault_type, uint heap_offset, uint cookie, uint heap_index, uint descriptor_type, uint actual_descriptor_type, uint instruction)
{
    uint _63 = atomicAdd(QAGlobalData.fault_atomic, 1u);
    if (_63 == 0u)
    {
        QAGlobalData.failed_cookie = cookie;
        QAGlobalData.failed_offset = heap_offset;
        QAGlobalData.failed_heap = heap_index;
        QAGlobalData.failed_descriptor_type_mask = descriptor_type;
        QAGlobalData.actual_descriptor_type_mask = actual_descriptor_type;
        QAGlobalData.failed_instruction = instruction;
        QAGlobalData.failed_shader_hash = uvec2(291u, 0u);
        memoryBarrierBuffer();
        QAGlobalData.fault_type = fault_type;
    }
}

uint descriptor_qa_check(uint heap_offset, uint descriptor_type_mask, uint instruction)
{
    uint _98 = QAHeapData.descriptor_count;
    uint _100 = QAHeapData.heap_index;
    uvec2 _102 = QAHeapData.cookies_descriptor_info[heap_offset];
    uint _110 = QAGlobalData.live_status_table[_102.x >> 5u];
    uint _121 = (uint(heap_offset >= _98) | (((_102.y & descriptor_type_mask) == descriptor_type_mask) ? 0u : 2u)) | (((_110 & (1u << (_102.x & 31u))) != 0u) ? 0u : 4u);
    if (_121 != 0u)
    {
        descriptor_qa_report_fault(_121, heap_offset, _102.x, _100, descriptor_type_mask, _102.y, instruction);
        return _98;
    }
    return heap_offset;
}

void main()
{
    uint _45 = descriptor_qa_check(registers._m1 + (INDEX + 40u), 1u, 1u);
    vec4 _132 = texture(nonuniformEXT(sampler2D(_14[_45], _18[registers._m3 + 30u])), vec2(UV));
    SV_Target.x = _132.x;
    SV_Target.y = _132.y;
    SV_Target.z = _132.z;
    SV_Target.w = _132.w;
}

With this scheme we can check for:

Is the descriptor type correct?
Is the resource destroyed already?
Did we index the heap out of bounds?

If we hit failures we get a neat atomic cookie in host memory we can sample at regular intervals on CPU. Works surprisingly well and tailor made for our needs, which is likely why it’s so fast compared to generic alternatives. We also only report the first failure. With this mode enabled, we allocate N + 1 descriptors, and reserve the last descriptor as a NULL descriptor which we can OpSelect on failure to avoid GPU hangs. (well, if someone did a faulty atomic compswap loop on a NULL descriptor I guess we’d timeout, but whatever, not much we can do about that).

Forcing the addition of VK_VALVE_mutable_descriptor_type

The combination of weirdly aliased descriptor types and full bindless is pretty deadly since Vulkan has typed descriptors, for good reason! This unfortunate mismatch forced us to create many descriptor sets, one per type, which looks very dumb. It consumes enough memory that we actually ran into serious performance issues on AMD hardware in the beginning. To fix all the immediate issues, we developed VK_VALVE_mutable_descriptor_type, which is kinda gross, but it lets us implement the D3D12 weird aliased descriptors directly, and all the problems went away. It’s a pretty goofy model though and not something Vulkan developers should be using directly. Not all descriptors have the same size in hardware, so there is gaps between descriptors, which is what I’m pretty sure is the case for native D3D12 as well, rip K$ … The more natural model would have been to have descriptor heaps where equally sized objects were allocated in different heaps, but we have to deal with whatever problems D3D12 throws our way and make it work …

Another “nice” side effect of using mutable was that some GPU hangs went away. If the game used the wrong descriptor type, it seemed to at least not read a descriptor that pointed to already freed memory, but rather just a descriptor of wrong type. Somehow, this helped certain games to run around the time the extension was released. It is deeply disturbing that games can ship in this state. :\

A game engine targeting Vulkan can be smarter about all these things. For example, having 1M sampled images seems reasonable, as that’s the number one thing you’d use bindless for, but do you need 1M CBVs, fully bindless? I doubt it. I’m far more in favor of a model where normal, packable, hoistable descriptors are used by default, and bindless is only tapped into as needed.

Enforces a fully bindless implementation

When implementing this binding model in other APIs, we’re forced to make everything fully bindless. Everything is accessed in terms of the global descriptor heap, and even if 99% of shaders use a very traditional D3D11-style binding model, we cannot do anything smart here because of one critical design flaw of D3D12 …

The great VOLATILE mistake of Root Signature 1.0

Root Signature 1.0 really doubles down on forcing full bindless everywhere, even when shaders don’t actually need it. All descriptors are considered VOLATILE, which means that we require all the Vulkan flags that are opt-in:

UPDATE_AFTER_BIND
UPDATE_WHILE_PENDING
PARTIALLY_BOUND
VARIABLE_COUNT

This fact completely blocks us from hoisting descriptors on the fly. Even if the shader is simple enough, we’re royally screwed because there is no guarantee that descriptors are actually valid at the time of command recording. Almost all games still cling to Root Signature 1.0, and probably rely on driver heroics to work around any performance implication. Table CBV on Pascal GPUs for example is disgustingly slow in vkd3d-proton. In Vulkan, these GPUs don’t even support UPDATE_AFTER_BIND UBOs! I can only wonder what utter depravities the D3D12 driver engineers had to do to make this work well … Hoist descriptors late with device generated commands based on the PSO? Bleh. The API forces us to implement these as bindless SSBO, and I die a little inside every time I have to think about this. Root CBV works perfectly fine, yet games insist on using table CBVs with VOLATILE just because …

On AMD, everything is fully bindless anyways, as descriptors are just memory, but it’s unfortunate that the D3D12 API went VOLATILE by default. It really should have been opt-in. It’s not like the poor Pascal souls can actually buy new GPUs these days even if they wanted to 🙂

The second mistake of Root Signature 1.1 – STATIC

The mistake of VOLATILE-by-default was recognized by the time 1.1 rolled around, and STATIC was now made the default. STATIC is the nicest mode possible for drivers, since it fully allows hoisting of descriptors, but you actually lose robustness guarantees (!). Most big game engines probably looked at this for 2 seconds and noped out.

D3D in general has gone to extreme lengths to ensure robustness guarantees, and engines of course rely on every esoteric OOB scenario to not fail. This is hopefully not by design, but engine bugs get masked for years if they never cause issues in practice. Losing that guarantee means no one is going to risk random GPU crashes in the wild for just a potential performance increase.

The actual fix that no game ever uses – STATIC + BOUNDS_CHECKED

The actual fix is fairly obscure. At some point a more reasonable mode was added which preserves bounds checking. Of course, no game I know of actually uses this. vkd3d-proton can take advantage of STATIC / BOUNDS_CHECKED CBVs in some cases where we will hoist them to push descriptors for significant NV perf gains, but alas, when no games use the APIs like intended, it might as well not exist …

Immutable sampler jank

The docs imply that the driver is supposed to implement this with an internal hashmap of sampler objects. Oddly enough, these sampler objects live outside the normal sampler heap you bind to the command list, and in fact, you don’t have to bind anything to use immutable samplers, it’s intended to be implemented through driver magic.

This magic is certainly odd, but in Vulkan we have to allocate and bind immutable samplers to a command buffer. This is because very little hardware actually supports true immutable samplers in the sense that they are embedded in the shader code itself. While we still have to allocate descriptor sets, at least we don’t have to actually write the immutable samplers to the descriptor set.

On AMD, the descriptors are read directly from scalar registers, so there’s nothing stopping a motivated compiler from emitting a bunch of immediate register moves and go to town, but sadly, most hardware don’t work like this, and I don’t think drivers even take advantage of this possibility yet.

My big question is how this is even supposed to work on D3D12 if the samplers have to live somewhere in memory. If the shader is simply storing constant offsets into some hidden sampler heap, then surely those constants are not stable across different runs, and thus shader caching fails. Patching in new constants seems pretty gross, but I can only speculate … Either way, in vkd3d-proton we end up creating a VkDescriptorSet per root signature which holds all immutable samplers for a given root signature. This is automatically bound when flushing dirty state in a draw or dispatch call.

Painful and awkward raw buffer types

In Vulkan we are blessed with the SSBO, the flexible and versatile buffer. D3D12 is a bit more weird and has extremely specific alignment and robustness guarantees which don’t map cleanly to anything.

First, we have ByteAddressBuffer.

Alignment requirement of buffer binding, 16 bytes (good)
Load1,2,3,4 variants, with 4 byte alignment requirement (2 for 16-bit). SM 6.2 improved this a fair bit, with a templated type, but for us in DXIL land, very little changes.
Robustness is checked per-component (why ._.)
Loads and stores are done at specific byte offsets, which is super ugly, but mostly for shader authors

The only natural way to implement this is a uint[] array, where we unroll loads and store per-component. Fun times. I would be surprised if hardware actually supported vectorized load-store with per-component robustness at scalar alignment …

The second style is StructuredBuffer which represents a buffer as a T data[] array. One deeply frustrating aspect of this is that data structures are tightly packed, meaning T could be float3, and we’d get 12 byte stride. StructuredBuffers must be bound in terms of elements, not byte offsets, so it’s perfectly valid to bind a structured buffer at 12 byte offset. This cannot be implemented safely on GPUs which report 16 byte SSBO alignment sadly. Due to all this, it’s almost impossible to cleanly declare a vec3 data[]; inside the SPIR-V, even if we have scalar block layout support. Just like ByteAddressBuffer, we kinda have to unroll loads and stores with uint data[] again, shame … We have gone to great lengths to support an “offset” buffer, which means we can bind an SSBO at 16 byte alignment, and nudge the offset inside the shader to fix things up, but it is horrible.

There are some mind-melting optimizations in flight for dxil-spirv which can vectorize these things, but it’s not a perfect solution. Weird robustness and alignment rules forces us to do some pretty horrible things, and we only have a few escape hatches which we can exploit, and dxil-spirv will try hard to squeeze every last legal vectorization opportunity out of the existing rules.

For example, StructuredBuffers have a rule saying that robustness only needs to be checked once, based on the structure element that is being accessed. For dynamic access into the element itself, it is undefined behavior to straddle the element boundary. Think StructuredBuffer<float4> where you do a dynamic access into the components, and index overflows.

I’m wondering if hardware actually has a special descriptor type they use to implement structured buffers. Perhaps the same ones which are used for vertex attribute fetch? And yes, even AMD uses tbuffer with stride pulled from descriptors, so saying it’s fully programmable fetch is a bit misleading I think.

Texel buffers are not a valid solution either. They can be bound at scalar alignment, which is nice, but we cannot safely use vectorized load-store either. Texel buffers can only be read and written to fully. No write masks allowed. Another problem is mix and match of 16-bit load-store which completely broke any hope of using texel buffers. In Vulkan, we can redeclare an SSBO multiple times with different data layouts, which is basically a human readable version of ByteAddressBuffer.

It’s a bit frustrating to deal with these problems, because almost no sane application is going to run into the edge cases, or need all the esoteric robustness features, but we have to make every possible edge case work. We cannot take shortcuts. To demonstrate the insanity we have to go through on certain GPUs …

StructuredBuffer<float3> B;

float3 main(uint index : INDEX) : SV_Target
{
	return B[index];
}

Since this can be bound at offset 12, we might see this horrible code. The offset buffer is associated with the descriptor heap, and we allocate a separate side channel buffer for this purpose.

layout(set = 1, binding = 0, std430) restrict readonly buffer SSBO_Offsets
{
    uvec2 _m0[];
} _13;

layout(set = 1, binding = 1, std430) restrict readonly buffer SSBO
{
    uint _m0[];
} _18[];

layout(push_constant, std430) uniform RootConstants
{
   // root parameters
} registers;

layout(location = 0) flat in uint INDEX;
layout(location = 0) out vec3 SV_Target;

void main()
{
    // Load extra offset + actual range. We cannot bind the SSBO tightly enough
    // on some GPUs.
    // Since the access is not nonuniform, we can give a strong hint to compiler
    // that the offsets can be loaded as a broadcast.
    uvec2 _37 = _13._m0[subgroupBroadcastFirst(registers._m1)] >> uvec2(2u);
    uint _41 = INDEX * 3u;
    // Trip explicit OOB if we fail.
    uint _47 = (_41 < _37.y) ? (_41 + _37.x) : 1073741820u;
    // Unroll 3 loads. ;_;
    vec3 _60 = uintBitsToFloat(uvec3(_18[registers._m1]._m0[_47], _18[registers._m1]._m0[_47 + 1u], _18[registers._m1]._m0[_47 + 2u]));
    SV_Target.x = _60.x;
    SV_Target.y = _60.y;
    SV_Target.z = _60.z;
}

The horrible workarounds

We have a fair amount of issues to deal with. A lot of invalid API behavior in D3D12 happens to work on native drivers, and many games ship with subtle bugs which force us to do horrible things. IHVs are going to have a fun time in 5 years when hardware details change 🙂

What is a raw buffer anyways?

A surprisingly common bug is that D3D12 does not complain if you create a typed buffer, but actually intended to create a raw buffer (ByteAddressBuffer) and vice versa. The difference is just one flag away and is easily missed. For us, that means SSBO vs texel buffer, a catastrophic error, but apparently, D3D12 does not care and drivers will happily consume typed buffers as raw and vice versa. I have no idea how this works, but somehow it does (I wrote tests)! Either way, it got bad enough that when emitting buffer descriptors, we have to emit both an SSBO variant and texel buffer variant and then the shader can pick the correct descriptor it wants. Of course, this being bindless and decoupled, we cannot pick the correct descriptor at draw time, sigh … Sometimes I think bindless all the things is going to be considered a huge mistake in 10 years from a software ecosystem point of view.

What is a NULL descriptor anyways?

A reasonably common bug is the NULL descriptor with wrong type. Technically in D3D12, NULL descriptors are still typed and you have to pass in a valid resource desc when creating one, but mistakes happen all the time. Using the wrong type works perfectly fine on existing hardware in almost all cases (I wrote a test!). For this reason, vkd3d-proton takes the conservative approach, and writing a NULL descriptor means splatting out NULL descriptors to all the different types of descriptors we have in the heap. For MUTABLE, this means two sets, for non-mutable, six! Voila, various GPU hangs disappeared, just like that.

VOLATILE is really bad for validation

Having simple validation is critical for development, and I think the real hidden mistake of VOLATILE being default, is that only GPU-assisted validation can catch any bugs! GPU validation is very heavy and usually avoided when possible. I feel a ton of these bugs could have been caught if descriptors could have been validated on CPU timeline. As long as an invalid descriptor is not dynamically accessed by a shader, it’s all OK! Dynamically accessed is the critical wording here. With this requirement, we can only validate on submit time if we can prove ahead of time that a shader will generate invocations which then statically access an invalid descriptor. Good luck proving that fragment shader invocations happen 100% of the time! Don’t get me wrong, bindless is very powerful and great for the use cases it enables, but derping everything into bindless doesn’t seem like the correct approach to me.

Local root signatures – one great kludge to rule them all

Local root signatures add yet another layer of hell to the binding model. It was annoying enough to implement a dozen different ways to access resources, and now we have do it all over again, but this time, replacing push constants with SBTs + some additional edge cases that are specific to local root signatures!

Local root signatures are used specifically for the shader record tables in ray tracing pipelines. In Vulkan, this data is accessed directly as a buffer, but D3D12 went the way of describing a view instead where each parameter is laid out in memory one after the other. As usual, table pointers is a thing, but this time, they consume 2 DWORDs instead of 1. Why on earth was table pointers described in terms of GPU VA and not just a simple u32 “offset into heap”? Sometimes I find D3D12’s obsession with GPU VA-s quite bizarre and counter-productive. It doesn’t even have buffer device address support, yet we’re using them everywhere, even when it doesn’t make much sense at all. In vkd3d-proton, we had to do the janky thing of returning artificial GPU VA-s for descriptor heaps so that it was fast to convert the VA into an offset.

While root descriptors in global root signatures can be implemented as descriptors, we are forced to use buffer device addresses since the VA is literally sourced from buffer memory. The fun thing is that you can even place RT acceleration structures here as a literal VA! And this is how OpConvertUToAccelerationStructureKHR was born … 🙂

Local root signatures also add the headache of also supporting immutable samplers! This time, we have to defer the creation of a second immutable sampler descriptor set to RTPSO creation, which is super annoying. At bind time, we therefore might run the risk of having to bind one immutable sampler set for the global root signature, then one which is tied to the union of all local root signatures for the RTPSO, so much fun! >_<

Implementing SM 6.6 bindless

SM 6.6 is actually straight forward. Our model in vkd3d-proton was always to map pre-6.6 style bindings to a SM 6.6 style “access the heap directly” model, so this translation is actually not that bad at all. The main annoyance is how there are now 5 (!) opcodes in DXIL to deal with resource handle creation, each with their own idiosyncrasies. I’m pretty sure all of this could have been squeezed into 2 …

CreateHandle
CreateHandleForLib
CreateHandleFromHeap
CreateHandleFromBinding
AnnotateHandle

Fun times indeed …

float4 main(uint index : INDEX, float2 uv : UV) : SV_Target
{
	Texture2D<float4> T = ResourceDescriptorHeap[index];
	return T.Load(int3(uv, 0));
}

#version 460
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_samplerless_texture_functions : require

layout(set = 0, binding = 0) uniform texture2D _9[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in vec2 UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    vec4 _32 = texelFetch(_9[INDEX], ivec2(uvec2(uint(int(UV.x)), uint(int(UV.y)))), int(0u));
    SV_Target.x = _32.x;
    SV_Target.y = _32.y;
    SV_Target.z = _32.z;
    SV_Target.w = _32.w;
}

Nice. I recently completed SM 6.6 support for vkd3d-proton, have fun with that!

How this maps to AMD GPUs (teaser for next post)

If anything, it feels like the D3D12 binding model is tailored pretty well to AMD hardware. It probably makes sense given the Mantle heritage. The binding model on GCN is extremely neat and tidy and was designed with full bindless in mind since the early days, quite forward looking, indeed. Descriptors are just memory that is loaded into scalar registers and passed directly to various operations which take descriptors. Very little magic is going on here and once you understand how this work, the modern API binding models start making some sense. Root parameters and push constants map nicely to user SGPRs, and as long as we don’t use too many u32s, we have a very direct mental model of how this is going to map to hardware.

Where D3D12 suffers I think is on hardware where there are still remnants of legacy descriptor slots in hardware. We cannot take advantage of this due to poor design decisions made in D3D12, but it is what it is. There is only so much we can do as a translation layer.

We’ll explore this topic in a future post …

Conclusion

It has taken many, many months, if not over a year to get where we are today in vkd3d-proton. Handling all of this insanity pushed us to the breaking point, but somehow it works pretty well. There are no shortcuts we can take.

Despite the mammoth length of this post, I don’t think we’re quite done with the subject of binding models. I’ll probably need another post to cover codegen examples for AMD ISA as well as an overview of the million ways we can convert the same DXIL code to different SPIR-V depending on the root signature.

My personal hell of translating DXIL to SPIR-V – part 2

In the previous blog post, I began a long form exploration of the DXIL shader format, and how it translates to SPIR-V. In this part, we’ll look more closely at the LLVM format, how it is parsed, and how to interpret the parsed result.

The LLVM IR binary format is mostly undocumented. Very early on we have to dig through the source to understand what is going on. LLVM IR was never intended to be used as a “standard” format that is shipped between different software stacks. It’s clearly an ad-hoc serialization format that serves the purpose of LLVM internals. The IR format is backwards compatible at the very least, which is why we can parse DXIL LLVM 3.7 modules with modern LLVM versions.

As we’ll see, LLVM IR is very complex to parse compared to SPIR-V. There are some interesting similarities however, as SPIR-V shares some DNA of LLVM.

Layered architecture

LLVM IR is parsed in multiple layers. At the lower level is a compression scheme which feels somewhat like LZ compression. The bit-stream teaches the decoder how to decode the stream, by emitting “templates” (or “code book entries”), and these templates can then be instantiated to form complete records.

The low-level bit-stream parser

The initial part of the LLVM IR puzzle is pulled from RenderDoc’s implementation. The basic gist of it is documented here. To summarize however, the idea is that a module consists of one top-level “block”. A block is a structure of blocks and records. A record has an ID with an array of uint64_t operands (quite similar to SPIR-V, except SPIR-V is array of 32-bit operands).

Storing full uint64_t operands is of course very wasteful, and this is where the primitives types of LLVM IR come in. We can express primitive types compactly with:

Variable length integers (configurable chunk size)
Fixed width integers (configurable bit width)
6-bit chars (useful for C-style identifiers, i.e. a-z, A-Z, 0-9 and _)

Variable length integers are encoded in a scheme where we look at N bits at a time, N – 1 bits contain useful data, and the MSB marks whether to keep looking at N more bits.

Blocks and records are invoked in an esoteric way, which is where abbreviations come in. When we parse, we’re parsing abbreviations one at a time, which either results in some action:

0 – END_BLOCK – Ends block scope
1 – ENTER_SUBBLOCK – Begins a new scope, can nest arbitrarily
2 – DEFINE_ABBREV – Defines a template for how to build new records. For example, we can specify that a record is [vbr4, char6, literal constant] or something. That abbreviation implicitly gets a new ID, starting with 4, which can be invoked when parsing new abbreviations. When decoding this abbreviation, the parser knows ahead of time how to decode the bits into arguments. For char6 strings in particular, it’s also possible to specify an array abbreviation.
3 – UNABBREV_RECORD – YOLO mode, directly decodes a record with a bunch of variable length integers, fairly inefficient. DXC seems to love to emit these 🙂
4+ – Invoke user abbreviations.

In typical LLVM IR fashion, the number of bits used to encode the abbreviation ID is variable. It starts at 2 bits (since there are no user abbreviations to worry about yet), but can grow as needed. Fun!

The details are not super interesting for this post, but suffice to say, there’s a decent amount of detail that goes into parsing this.

The calling code ends up looking something like this:

LLVMBC::BitcodeReader reader(static_cast<const uint8_t *>(data), size);
LLVMBC::BlockOrRecord toplevel = reader.ReadToplevelBlock();

// The top-level block must be MODULE_BLOCK.
if (KnownBlocks(toplevel.id) != KnownBlocks::MODULE_BLOCK)
   return nullptr;

// We should have consumed all bits, only one top-level block.
if (!reader.AtEndOfStream())
   return nullptr;

The BlockOrRecord struct is fairly straight forward, simplified here as:

struct BlockOrRecord
{
  uint32_t id; // What kind of record or block is this?
  Type type; // block or record
  dxil_spv::Vector<BlockOrRecord> children; // If block
  dxil_spv::Vector<uint64_t> ops; // If record
};

Higher level parser

Now, we’re at a level where we have recovered structure from the bit-stream, and now we need to turn the BlockOrRecord structs into actual API objects, llvm::Module, llvm::Function, llvm::Value, etc, etc … dxil-spirv implements a LLVM C++ API drop-in replacement to be able to cross-reference our implementation against the reference implementation at any time (which has saved me many times). The implementation only implements exactly what is needed for DXIL however, so don’t expect too much of it. 🙂

Refer to objects by ID

Very similar to SPIR-V, types and values are referred to by an uint64_t ID. The annoying part however is that types and values implicitly allocate their own IDs, meaning that forgetting to parse something can be fatal. On top of this, IDs may or may not refer to other IDs through deltas relative to their own values or absolute values. It is somewhat context sensitive which one to use, which gets quite annoying to deal with.

Decoding llvm::Type

LLVM IR has a type hierarchy similar to SPIR-V. You start by declaring fundamental types like ints and float, and then upgrade them to vectors, array, pointers or structs. While parsing, the top-level block can contain TYPE blocks, which contains a bunch of records.

for (auto &child : toplevel.children)
{
   if (child.IsBlock())
   {
      switch (KnownBlocks(child.id))
      {
      case KnownBlocks::TYPE_BLOCK:
         for (auto &entry : child.children)
             parse_type(entry);
         break;
      }
   }
}

bool ModuleParseContext::parse_type(const BlockOrRecord &child)
{
   Type *type = nullptr;
   switch (TypeRecord(child.id))
   {
   case TypeRecord::VOID_TYPE:
   case TypeRecord::HALF:
   case TypeRecord::INTEGER:
   case TypeRecord::POINTER:
   case TypeRecord::ARRAY:
   case TypeRecord::FUNCTION:
   // you get the idea
   }
   types.push_back(type);
}

Integers deserve special mention, because they are somewhat whacky in LLVM. First, they have no signage associated with them. This kinda makes sense, since signage only actually matters in certain opcodes, like signed min/max, signed compare, arithmetic vs logical right shift, signed vs unsigned float <-> int conversion, etc. SPIR-V maintains sign for its integer type, but we can ignore it in most scenarios. (There is an esoteric exception to this however where DXIL kinda breaks down, once we dig into relaxed precision signed integers!) Another annoying exception we have to deal with all the time is stage IO and resource variables which are explicitly signed or unsigned in DXIL.

As the grizzled C programmer will know, signed overflow is undefined, but unsigned overflow is not. Does LLVM just not care? Well, it does. LLVM can mark operations as being “no signed wrap”, or “no unsigned wrap” for optimization purposes, but we don’t have to care about those at all fortunately.

Booleans are expressed as 1-bit integers, which kind of makes sense, but at the same time feels like a very LLVM thing to do … Logical operations reduce to simple arithmetic operations on 1-bit values instead.

The final whack part is that you can declare non-POT integer sizes. There are shaders in the wild which declare 11-bit integers and rely on wrapping on these values to work! (dear lord … <_<) I even tried to compile this to x86_64 and yes, it does actually deal with it correctly. I’m kind of amazed, and scared at the same time.

Overall though, type declaration in LLVM IR is pretty easy to understand if you understand SPIR-V.

Decoding constants

Similar as types, constants are records within a block. They can appear at function scope or global scope.

bool ModuleParseContext::parse_constants_record(
    const BlockOrRecord &entry)
{
    llvm::Constant *value = nullptr;
    switch (ConstantsRecord(entry.id))
    {
        case ConstantsRecord::SETTYPE:
        case ConstantsRecord::CONST_NULL:
        case ConstantsRecord::UNDEF:
        case ConstantsRecord::INTEGER:
        // ...
    }
    values.push_back(value);
}

Roughly speaking, this looks very similar, with some quirks. SETTYPE informs subsequent constant blocks which type is actually used. CONST_NULL is example of a fully context sensitive constant, similar to OpConstantNull in SPIR-V.

INTEGERs are converted through sign rotations. Since small negative numbers would be horribly inefficient to encode with VBR otherwise, the first bit is the sign bit, encoded in a sign magnitude scheme. -0 is interpreted as INT64_MIN.

Where LLVM constants get disgusting however, is the pseudo-specialization constant operation support. It is possible to encode a constant cast operation, or constant access chain into a global object (wtf?) this way. I don’t understand the motivation behind this, but there are lots of super weird edge cases here that took some time to iron out.

AGGREGATE is the first time we start to see how value IDs are referenced.

Vector<Value *> constants;
constants.reserve(entry.ops.size());

for (auto &op : entry.ops)
{
   constants.push_back(get_value(op, element_type,
      /* force absolute IDs */ true));
}

Value *value;
// Ah, yes. Why have VECTOR and ARRAY types when you can
// have a context sensitive one instead.
if (current_constant_type_is_vector)
{
   value = context->construct<ConstantDataVector>(
      get_constant_type(), std::move(constants));
}
else
{
   value = context->construct<ConstantDataArray>(
      get_constant_type(), std::move(constants));
}

get_value() is quite sneaky. In LLVM IR, it is valid to forward reference an ID, as long as the type is known. This leads to ProxyValue objects being created, which are resolved later. get_value() can be relatively indexed, or absolutely indexed depending on the context, which is always fun.

Global variables

Global variables are declared in top-level records. Typically these are only used for groupshared variables. In DXIL, a special pointer address space is reserved for this purpose. Global variables can also be used for global look-up tables. Global variables can also have optional initializers. This is very similar to SPIR-V overall. The equivalent is an OpVariable with either Workgroup or Private storage class.

Resource handles are declared in a completely different way … unless we’re DXR (more on that later, sigh v_v …)

Function prototypes

We also get to declare function prototypes at this stage. Some functions only have prototypes, and the common case here is various prototypes which declare dx.op intrinsics functions. If a prototype is declared to also have a body, we place that in a queue.

We also have to parse parameter attribute lists (surprisingly tricky!), just in case the function declares LLVM attributes. The only case we have to care about here is FP32 denorm handling. Why that isn’t a metadata entry, I’ll never know. DXIL really likes splitting its implementation across two completely different systems for no good reason …

Parsing functions

A function body is a block, consisting of records (which express normal opcodes), and other blocks (e.g. constant blocks). The first record we’ll typically see is the DECLAREBLOCKS one, which specifies the number of basic blocks in the function.

Basic blocks

A basic block is a fundamental building block of SSA-based IRs. A basic block enters execution at the first instruction, and executes in a straight line fashion until a terminator instruction executes. A terminator instruction can be anything which transfers control like a direct branch, conditional branch, switch statement, returns, etc. If you know SPIR-V, this is nothing new. It’s the exact same concept.

Unlike SPIR-V, where we have explicit OpLabel opcodes to begin a new block, LLVM makes this implicit. When we observe a terminator, the next instruction will be added to the next basic block.

Context sensitive parsing

The opcodes in the IR match closely to the type hierarchy of LLVM, let’s look at parsing llvm::BinaryOperation. A binary operation is any c = op(a, b) kind of instruction, it’s not necessarily just and/or/xor, etc. This is a catch all for FAdd, IMul, IAdd, Xor, etc.

case FunctionRecord::INST_BINOP:
{
   unsigned index = 0;
   auto lhs = get_value_and_type(entry.ops, index);
   if (!lhs.first)
      return false;
   auto *rhs = get_value(entry.ops, index, lhs.second);
   if (!lhs.first || !rhs)
      return false;
   if (index == entry.ops.size())
      return false;
   auto op = BinOp(entry.ops[index++]);
   auto *value = context->construct<BinaryOperator>(lhs.first, rhs, translate_binop(op, lhs.second));
   if (index < entry.ops.size())
   {
      // Only relevant for FP math,
      // but we only look at fast math state for
      // FP operations anyways.
      auto fast_math_flags = entry.ops[index];
      bool fast = (fast_math_flags &
          (FAST_MATH_UNSAFE_ALGEBRA_BIT |
           FAST_MATH_ALLOW_CONTRACT_BIT)) != 0;
      value->set_fast_math(fast);
   }
   if (!add_instruction(value))
      return false;
   break;
}

In SPIR-V a binary operation would be encoded as:

%id = OpMyBinOp %type %operand_a %operand_b

Very explicit and understandable. LLVM IR on the other hand is more clever, for better or worse.

First, the result %id is implicit, and is allocated linearly as new opcodes come in. The type of an instruction is context sensitive. First, we parse %operand_a. If we have seen this ID already, %type is deduced directly from the operand. If it is a forward reference, the type of %operand_a is encoded in the record explicitly.

IDs in most opcodes are encoded with a relative scheme. Since SSA requires that declarations of an ID dominates all uses of it, the common case is that uses of an ID come after the declaration of it, so this is a decent compression scheme. The implementation of get_value_and_type() is something like:

std::pair<Value *, Type *> ModuleParseContext::get_value_and_type(
   const Vector<uint64_t> &ops, unsigned &index)
{
   if (index >= ops.size())
      return {};

   uint64_t op = ops[index++];
   // Context sensitive, for backwards compat mostly, but
   // modules can choose to use absolute or relative encoding.
   if (use_relative_id)
      op = uint32_t(values.size() - op);

   if (op < values.size())
   {
      // Normal reference.
      return { values[op], values[op]->getType() };
   }
   else
   {
      // Forward reference, the type is encoded in the next element.
      if (index >= ops.size())
         return {};

      auto *type = get_type(ops[index++]);
      auto *proxy = context->construct<ValueProxy>(type, *this, op);
      pending_forward_references.push_back(proxy);
      return { proxy, type };
   }
}

I had tons of bugs where I didn’t handle the possible forward references. Very awkward. I was of the impression that only PHI instructions could possibly have forward references, but of course, it’s never that simple.

Speaking of PHI, get_value() changes here to a signed-aware variant, where the relative value ID is encoded with sign_rotation, just like for INTEGER constants. This is because we expect that PHI inputs are forward referenced just as often as backwards referenced.

Overall, it’s just a grind to implement all relevant opcodes. DXIL only uses a subset, but it’s not well documented which subset of LLVM IR is actually used. We just have to implement new stuff as it comes in. DXIL.rst has a list of which LLVM instructions are supported, but this list cannot be trusted because in DXR, DXC emits various vector instructions (so much for being a scalar IR format) and the unreachable terminator which is missing from the table.

Metadata

Metadata lives in its own block hierarchy and has a completely different set of types, llvm::MDNode, llvm::MDOperand, llvm::ConstantAsMetadata, etc.

At the top of the hierarchy we can declare NamedMDNodes, which we see in the LLVM assembly as:

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.resources = !{!4}
!dx.entryPoints = !{!7}

NamedNodes contain a list of MDNodes, which nest into more MDNodes, or terminate in constant values. These correspond to NAMED_NODE, NODE and VALUE record types, not too many surprises here.

Emitting SPIR-V opcodes

After we get through the parsing step, LLVM IR and SPIR-V has many similarities, and translating opcodes isn’t particularly difficult. For each LLVM basic block, we emit a basic block in SPIR-V and translate the opcodes one by one. We preserve the SSA nature as-is. There are of course a lot of details here, but they’re not very interesting, and too many details to enumerate. The two biggest problems we need to focus on are:

Resource access

Accessing textures, constant buffers, structured buffers and the weird and wonderful zoo of resource types is insanely intricate, and I’ll need to dedicate an entire blog post to this.

Control flow structurization

Another massive issue is the control flow. In LLVM, there is no structurization information whatsoever, and we’ll have to reconstruct this somehow. This is at least one more blog post. After emitting SPIR-V code into basic blocks, we need to rewrite the control flow and annotate the basic blocks with merge information, which then allows us to emit a final SPIR-V module, ready for driver consumption.

Conclusion

This was a rough overview of LLVM IR and how it is parsed from scratch. I think it’s safe to say it’s far more difficult to parse than SPIR-V, which is literally just a stream of [N x uint32_t] opcodes. Parsing IR is not the most exciting part of DXIL to SPIR-V conversion, but it had to be done. On the other hand, it might be useful starting knowledge for other projects.

For next post, we’ll look at how to translate the D3D12 binding model into Vulkan.