PlayStation 2 GS emulation – the final frontier of Vulkan compute emulation

As you may, or may not know, I wrote paraLLEl-RDP back in 2020. It aimed at implementing the N64 RDP in Vulkan compute. Lightning fast, and extremely accurate, plus the added support of up-scaling on top. I’m quite happy how it turned out. Of course, the extreme accuracy was due to Angrylion being used as reference and I could aim for bit-exactness against that implementation.

Since then, there’s been the lingering idea of doing the same thing, but for PlayStation 2. Until now, there’s really only been one implementation in town, GSdx, which has remained the state-of-the-art for 20 years.

paraLLEl-GS is actually not the first compute implementation of the PS2 GS. An attempt was made back in 2014 for OpenCL as far as I recall, but it was never completed. At the very least, I cannot find it in the current upstream repo anymore.

The argument for doing compute shader raster on PS2 is certainly weaker than on N64. Angrylion was – and is – extremely slow, and N64 is extremely sensitive to accuracy where hardware acceleration with graphics APIs is impossible without serious compromises. PCSX2 on the other hand has a well-optimized software renderer, and a pretty solid graphics-based renderer, but that doesn’t mean there aren’t issues. The software renderer does not support up-scaling for example, and there are a myriad bugs and glitches with the graphics-based renderer, especially with up-scaling. As we’ll see, the PS2 GS is quite the nightmare to emulate in its own way.

My main motivation here is basically “because I can”. I already had a project lying around that did “generic” compute shader rasterization. I figured that maybe we could retro-fit this to support PS2 rendering.

I didn’t work on this project alone. My colleague, Runar Heyer, helped out a great deal in the beginning to get this started, doing all the leg-work to study the PS2 from various resources, doing the initial prototype implementation and fleshing out the Vulkan GLSL to emulate PS2 shading. Eventually, we hit some serious roadblocks in debugging various games, and the project was put on ice for a while since I was too drained dealing with horrible D3D12 game debugging day in and day out. The last months haven’t been a constant fire fight, so I’ve finally had the mental energy to finish it.

My understanding of the GS is mostly based on what Runar figured out, and what I’ve seen by debugging games. The GSdx software renderer does not seem like it’s hardware bit-accurate, so we were constantly second-guessing things when trying to compare output. This caused a major problem when we had the idea of writing detailed tests that directly compared against GSdx software renderer, and the test-driven approach fell flat very quickly. As a result, paraLLEl-GS isn’t really aiming for bit-accuracy against hardware, but it tries hard to avoid obvious accuracy issues at the very least.

Basic GS overview

Again, this is based on my understanding, and it might not be correct. 😀

GS is a pixel-pushing monster

The GS is infamous for its insane fill-rate and bandwidth. It could push over a billion pixels per second (in theory at least) back in 2000 which was nuts. While the VRAM is quite small (4 MiB), it was designed to be continuously streamed into using the various DMA engines.

Given the extreme fill-rate requirements, we have to design our renderer accordingly.

GS pixel pipeline is very basic, but quirky

In many ways, the GS is actually simpler than N64 RDP. Single texture, and a single cycle combiner, where N64 had a two stage combiner + two stage blender. Whatever AA support is there is extremely basic as well, where N64 is delightfully esoteric. The parts of the pixel pipeline that is painful to implement with traditional graphics APIs is:

Blending goes beyond 1.0

Inherited from PS1, 0x80 is treated as 1.0, and it can go all the way up to 0xff (almost 2). Shifting by 7 is easier than dividing by 255 I suppose. I’ve seen some extremely ugly workarounds in PCSX2 before to try working around this since UNORM formats cannot support this as is. Textures are similar, where alpha > 1.0 is representable.

There is also wrapping logic that can be used for when colors or alpha goes above 0xFF.

Destination alpha testing

The destination alpha can be used as a pseudo-stencil of sorts, and this is extremely painful without programmable blending. I suspect this was added as PS1 compatibility, since PS1 also had this strange feature.

Conditional blending

Based on the alpha, it’s possible to conditionally disable blending. Quite awkward without programmable blending … This is another PS1 compat feature. With PS1, it can be emulated by rendering every primitive twice with state changes in-between, but this quickly gets impractical with PS2.

Alpha correction

Before alpha is written out, it’s possible to OR in the MSB. Essentially forcing alpha to 1. It is not equivalent to alphaToOne however, since it’s a bit-wise OR of the MSB.

Alpha test can partially discard

A fun thing alpha tests can do is to partially discard. E.g. you can discard just color, but keep the depth write. Quite nutty.

AA1 – coverage-to-alpha – can control depth write per pixel

This is also kinda awkward. The only anti-alias PS2 has is AA1 which is a coverage-to-alpha feature. Supposedly, less than 100% coverage should disable depth writes (and blending is enabled), but the GSdx software renderer behavior here is extremely puzzling. I don’t really understand it yet.

32-bit fixed-point Z

I’ve still yet to see any games actually using this, but technically, it has D32_UINT support. Fun! From what I could grasp, GSdx software renderer implements this with FP64 (one of the many reasons I refuse to believe GSdx is bit-accurate), but FP64 is completely impractical on GPUs. When I have to, I’ll implement this with fixed-point math. 24-bit Z and 16-bit should be fine with FP32 interpolation I think.

Pray you have programmable blending

If you’re on a pure TBDR GPU most of this is quite doable, but immediate mode desktop GPUs quickly degenerates into ROV or per-pixel barriers after every primitive to emulate programmable blending, both which are horrifying for performance. Of course, with compute we can make our own TBDR to bypass all this. 🙂

D3D9-style raster rules

Primitives are fortunately provided in a plain form in clip-space. No awkward N64 edge equations here. The VU1 unit is supposed to do transforms and clipping, and emit various per-vertex attributes:

X/Y: 12.4 unsigned fixed-point
Z: 24-bit or 32-bit uint
FOG: 8-bit uint
RGBA: 8-bit, for per-vertex lighting
STQ: For perspective correct texturing with normalized coordinates. Q = 1 / w, S = s * Q, T = t * Q. Apparently the lower 8-bits of the mantissa are clipped away, so bfloat24? Q can be negative, which is always fun. No idea how this interacts with Inf and NaN …
UV: For non-perspective correct texturing. 12.4 fixed-point un-normalized.

  • Triangles are top-left raster, just like modern GPUs.
  • Pixel center is on integer coordinate, just like D3D9. (This is a common design mistake that D3D10+ and GL/Vulkan avoids).
  • Lines use Bresenham’s algorithm, which is not really feasible to upscale, so we have to fudge it with rect or parallelogram.
  • Points snap to nearest pixel. Unsure which rounding is used though … There is no interpolation ala gl_PointCoord.
  • Sprites are simple quads with two coordinates. STQ or UV can be interpolated and it seems to assume non-rotated coordinates. To support rotation, you’d need 3 coordinates to disambiguate.

All of this can be implemented fairly easily in normal graphics APIs, as long as we don’t consider upscaling. We have to rely on implementation details in GL and Vulkan, since these APIs don’t technically guarantee top-left raster rules.

Since X/Y is unsigned, there is an XY offset that can be applied to center the viewport where you want. This means the effective range of X/Y is +/- 4k pixels, a healthy guard band for 640×448 resolutions.

Vertex queue

The GS feels very much like old school OpenGL 1.0 with glVertex3f and friends. It even supports TRIANGLE_FAN! Amazing … RGBA, STQ and various registers are set, and every XYZ register write forms a vertex “kick” which latches vertex state and advances the queue. An XYZ register write may also be a drawing kick, which draws a primitive if the vertex queue is sufficiently filled. The vertex queue is managed differently depending on the topology. The semantics here seem to be pretty straight forward where strip primitives shift the queue by one, and list primitives clear the queue. Triangle fans keep the first element in the queue.

Fun swizzling formats

A clever idea is that while rendering to 24-bit color or 24-bit depth, there is 8 bits left unused in the MSBs. You can place textures there, because why not. 8H, 4HL, 4HH formats support 8-bit and 4-bit palettes nicely.

Pixel coordinates on PS2 are arranged into “pages”, which are 8 KiB, then subdivided into 32 blocks, and then, the smaller blocks are swizzled into a layout that fits well with a DDA-style renderer. E.g. for 32-bit RGBA, a page is 64×32 pixels, and 32 8×8 blocks are Z-order swizzled into that page.

Framebuffer cache and texture cache

There is a dedicated cache for framebuffer rendering and textures, one page’s worth. Games often abuse this to perform feedback loops, where they render on top of the pixels being sampled from. This is the root cause of extreme pain. N64 avoided this problem by having explicit copies into TMEM (and not really having the bandwidth to do elaborate feedback effects), and other consoles rendered to embedded SRAM (ala a tiler GPU), so these feedbacks aren’t as painful, but the GS is complete YOLO. Dealing with this gracefully is probably the biggest challenge. Combined with the PS2 being a bandwidth monster, developers knew how to take advantage of copious blending and blurring passes …

Texturing

Texturing on the GS is both very familar, and arcane.

On the plus side, the texel-center is at half-pixel, just like modern APIs. It seems like it has 4-bit sub-texel precision instead of 8 however. This is easily solved with some rounding. It also seems to have floor-rounding instead of nearest-rounding for bi-linear.

The bi-linear filter is a normal bi-linear. No weird 3-point N64 filter here.

On the weirder side, there are two special addressing modes.

REGION_CLAMP supports an arbitrary clamp inside a texture atlas (wouldn’t this be nice in normal graphics APIs? :D). It also works with REPEAT, so you can have REPEAT semantics on border, but then clamp slightly into the next “wrap”. This is trivial to emulate.

REGION_REPEAT is … worse. Here we can have custom bit-wise computation per coordinate. So something like u’ = (u & MASK) | FIX. This is done per-coordinate in bi-linear filtering, which is … painful, but solvable. This is another weird PS1 feature that was likely inherited for compatibility. At least on PS1, there was no bi-linear filtering to complicate things 🙂

Mip-mapping is also somewhat curious. Rather than relying on derivatives, the log2 of interpolated Q factor, along with some scaling factors are used to compute the LOD. This is quite clever, but I haven’t really seen any games use it. The down-side is that triangle-setup becomes rather complex if you want to account for correct tri-linear filtering, and it cannot support e.g. anisotropic filtering, but this is 2000, who cares! Not relying on derivatives is a huge boon for the compute implementation.

Formats are always “normalized” to RGBA8_UNORM. 5551 format is expanded to 8888 without bit-replication. There is no RGBA4444 format.

It’s quite feasible to implement the texturing with plain bindless.

CLUT

This is a 1 KiB cache that holds the current palette. There is an explicit copy step from VRAM into that CLUT cache before it can be used. Why hello there, N64 TMEM!

The CLUT is organized such that it can hold one full 256 color palette in 32-bit colors. On the other end, it can hold 32 palettes of 16 colors at 16 bpp.

TEXFLUSH

There is an explicit command that functions like a “sync and invalidate texture cache”. In the beginning I was hoping to rely on this to guide the hazard tracking, but oh how naive I was. In the end, I simply had to ignore TEXFLUSH. Basically, there are two styles of caching we could take with GS.

With “maximal” caching, we can assume that frame buffer caches and texture caches are infinitely large. The only way a hazard needs to be considered is after an explicit flush. This … breaks hard. Either games forget to use TEXFLUSH (because it happened to work on real hardware), or they TEXFLUSH way too much.

With “minimal” caching, we assume there is no caching and hazards are tracked directly. Some edge case handling is considered for feedback loops.

I went with “minimal”, and I believe GSdx did too.

Poking registers with style – GIF

The way to interact with the GS hardware is through the GIF, which is basically a unit that reads data and pokes the correct hardware registers. At the start of a GIF packet, there is a header which configures which registers should be written to, and how many “loops” there are. This maps very well to mesh rendering. We can consider something like one “loop” being:

  • Write RGBA vertex color
  • Write texture coordinate
  • Write position with draw kick

And if we have 300 vertices to render, we’d use 300 loops. State registers can be poked through the Address + Data pair, which just encodes target register + 64-bit payload. It’s possible to render this way too of course, but it’s just inefficient.

Textures are uploaded through the same mechanism. Various state registers are written to set up transfer destinations, formats, etc, and a special register is nudged to transfer 64-bit at a time to VRAM.

Hello Trongle – GS

If you missed the brain-dead simplicity of OpenGL 1.0, this is the API for you! 😀

For testing purposes, I added a tool to generate a .gs dump format that PCSX2 can consume. This is handy for comparing implementation behavior.

First, we program the frame buffer and scissor:

TESTBits test = {};
test.ZTE = TESTBits::ZTE_ENABLED;
test.ZTST = TESTBits::ZTST_GREATER; // Inverse Z, LESS is not supported.
iface.write_register(RegisterAddr::TEST_1, test);

FRAMEBits frame = {};
frame.FBP = 0x0 / PAGE_ALIGNMENT_BYTES;
frame.PSM = PSMCT32;
frame.FBW = 640 / BUFFER_WIDTH_SCALE;
iface.write_register(RegisterAddr::FRAME_1, frame);

ZBUFBits zbuf = {};
zbuf.ZMSK = 0; // Enable Z-write
zbuf.ZBP = 0x118000 / PAGE_ALIGNMENT_BYTES;
iface.write_register(RegisterAddr::ZBUF_1, zbuf);

SCISSORBits scissor = {};
scissor.SCAX0 = 0;
scissor.SCAY0 = 0;
scissor.SCAX1 = 640 - 1;
scissor.SCAY1 = 448 - 1;
iface.write_register(RegisterAddr::SCISSOR_1, scissor);

Then we nudge some registers to draw:

struct Vertex
{
    PackedRGBAQBits rgbaq;
    PackedXYZBits xyz;
} vertices[3] = {};

for (auto &vert : vertices)
{
   vert.rgbaq.A = 0x80;
   vert.xyz.Z = 1;
}

vertices[0].rgbaq.R = 0xff;
vertices[1].rgbaq.G = 0xff;
vertices[2].rgbaq.B = 0xff;

vertices[0].xyz.X = p0.x << SUBPIXEL_BITS;
vertices[0].xyz.Y = p0.y << SUBPIXEL_BITS;
vertices[1].xyz.X = p1.x << SUBPIXEL_BITS;
vertices[1].xyz.Y = p1.y << SUBPIXEL_BITS;
vertices[2].xyz.X = p2.x << SUBPIXEL_BITS;
vertices[2].xyz.Y = p2.y << SUBPIXEL_BITS;

PRIMBits prim = {};
prim.TME = 0; // Turn off texturing.
prim.IIP = 1; // Interpolate RGBA (Gouraud shading)
prim.PRIM = int(PRIMType::TriangleList);

static const GIFAddr addr[] = { GIFAddr::RGBAQ, GIFAddr::XYZ2 };
constexpr uint32_t num_registers = sizeof(addr) / sizeof(addr[0]);
constexpr uint32_t num_loops = sizeof(vertices) / sizeof(vertices[0]);
iface.write_packed(prim, addr, num_registers, num_loops, vertices);

This draws a triangle. We provide coordinates directly in screen-space.

And finally, we need to program the CRTC. Most of this is just copy-pasta from whatever games tend to do.

auto &priv = iface.get_priv_register_state();

priv.pmode.EN1 = 1;
priv.pmode.EN2 = 0;
priv.pmode.CRTMD = 1;
priv.pmode.MMOD = PMODEBits::MMOD_ALPHA_ALP;
priv.smode1.CMOD = SMODE1Bits::CMOD_NTSC;
priv.smode1.LC = SMODE1Bits::LC_ANALOG;
priv.bgcolor.R = 0x0;
priv.bgcolor.G = 0x0;
priv.bgcolor.B = 0x0;
priv.pmode.SLBG = PMODEBits::SLBG_ALPHA_BLEND_BG;
priv.pmode.ALP = 0xff;
priv.smode2.INT = 1;

priv.dispfb1.FBP = 0;
priv.dispfb1.FBW = 640 / BUFFER_WIDTH_SCALE;
priv.dispfb1.PSM = PSMCT32;
priv.dispfb1.DBX = 0;
priv.dispfb1.DBY = 0;
priv.display1.DX = 636; // Magic values that center the screen.
priv.display1.DY = 50; // Magic values that center the screen.
priv.display1.MAGH = 3; // scaling factor = MAGH + 1 = 4 -> 640 px wide.
priv.display1.MAGV = 0;
priv.display1.DW = 640 * 4 - 1;
priv.display1.DH = 448 - 1;

dump.write_vsync(0, iface);
dump.write_vsync(1, iface);

When the GS is dumped, we can load it up in PCSX2 and voila:

And here’s the same .gs dump is played through parallel-gs-replayer with RenderDoc. For debugging, I’ve spent a lot of time making it reasonably convenient. The images are debug storage images where I can store before and after color, depth, debug values for interpolants, depth testing state, etc, etc. It’s super handy to narrow down problem cases. The render pass can be split into 1 or more triangle chunks as needed.

To add some textures, and flex the capabilities of the CRTC a bit, we can try uploading a texture:

int chan;
auto *buf = stbi_load("/tmp/test.png", &w, &h, &chan, 4);
iface.write_image_upload(0x300000, PSMCT32, w, h, buf,
                         w * h * sizeof(uint32_t));
stbi_image_free(buf);

TEX0Bits tex0 = {};
tex0.PSM = PSMCT32;
tex0.TBP0 = 0x300000 / BLOCK_ALIGNMENT_BYTES;
tex0.TBW = (w + BUFFER_WIDTH_SCALE - 1) / BUFFER_WIDTH_SCALE;
tex0.TW = Util::floor_log2(w - 1) + 1;
tex0.TH = Util::floor_log2(h - 1) + 1;
tex0.TFX = COMBINER_DECAL;
tex0.TCC = 1; // Use texture alpha as blend alpha
iface.write_register(RegisterAddr::TEX0_1, tex0);

TEX1Bits tex1 = {};
tex1.MMIN = TEX1Bits::LINEAR;
tex1.MMAG = TEX1Bits::LINEAR;
iface.write_register(RegisterAddr::TEX1_1, tex1);

CLAMPBits clamp = {};
clamp.WMS = CLAMPBits::REGION_CLAMP;
clamp.WMT = CLAMPBits::REGION_CLAMP;
clamp.MINU = 0;
clamp.MAXU = w - 1;
clamp.MINV = 0;
clamp.MAXV = h - 1;
iface.write_register(RegisterAddr::CLAMP_1, clamp);

While PS2 requires POT sizes for textures, REGION_CLAMP is handy for NPOT. Super useful for texture atlases.

struct Vertex
{
    PackedUVBits uv;
    PackedXYZBits xyz;
} vertices[2] = {};

for (auto &vert : vertices)
    vert.xyz.Z = 1;

vertices[0].xyz.X = p0.x << SUBPIXEL_BITS;
vertices[0].xyz.Y = p0.y << SUBPIXEL_BITS;
vertices[1].xyz.X = p1.x << SUBPIXEL_BITS;
vertices[1].xyz.Y = p1.y << SUBPIXEL_BITS;
vertices[1].uv.U = w << SUBPIXEL_BITS;
vertices[1].uv.V = h << SUBPIXEL_BITS;

PRIMBits prim = {};
prim.TME = 1; // Turn on texturing.
prim.IIP = 0;
prim.FST = 1; // Use unnormalized coordinates.
prim.PRIM = int(PRIMType::Sprite);

static const GIFAddr addr[] = { GIFAddr::UV, GIFAddr::XYZ2 };
constexpr uint32_t num_registers = sizeof(addr) / sizeof(addr[0]);
constexpr uint32_t num_loops = sizeof(vertices) / sizeof(vertices[0]);
iface.write_packed(prim, addr, num_registers, num_loops, vertices);

Here we render a sprite with un-normalized coordinates.

Finally, we use the CRTC to do blending against white background.

priv.pmode.EN1 = 1;
priv.pmode.EN2 = 0;
priv.pmode.CRTMD = 1;
priv.pmode.MMOD = PMODEBits::MMOD_ALPHA_CIRCUIT1;
priv.smode1.CMOD = SMODE1Bits::CMOD_NTSC;
priv.smode1.LC = SMODE1Bits::LC_ANALOG;
priv.bgcolor.R = 0xff;
priv.bgcolor.G = 0xff;
priv.bgcolor.B = 0xff;
priv.pmode.SLBG = PMODEBits::SLBG_ALPHA_BLEND_BG;
priv.smode2.INT = 1;

priv.dispfb1.FBP = 0;
priv.dispfb1.FBW = 640 / BUFFER_WIDTH_SCALE;
priv.dispfb1.PSM = PSMCT32;
priv.dispfb1.DBX = 0;
priv.dispfb1.DBY = 0;
priv.display1.DX = 636; // Magic values that center the screen.
priv.display1.DY = 50; // Magic values that center the screen.
priv.display1.MAGH = 3; // scaling factor = MAGH + 1 = 4 -> 640 px wide.
priv.display1.MAGV = 0;
priv.display1.DW = 640 * 4 - 1;
priv.display1.DH = 448 - 1;

Glorious 256×179 logo 😀

Implementation details

The rendering pipeline

Before we get into the page tracker, it’s useful to define a rendering pipeline where synchronization is implied between each stage.

  • Synchronize CPU copy of VRAM to GPU. This is mostly unused, but happens for save state load, or similar
  • Upload data to VRAM (or perform local-to-local copy)
  • Update CLUT cache from VRAM
  • Unswizzle VRAM into VkImages that can be sampled directly, and handle palettes as needed, sampling from CLUT cache
  • Perform rendering
  • Synchronize GPU copy of VRAM back to CPU. This will be useful for readbacks. Then CPU should be able to unswizzle directly from a HOST_CACHED_BIT buffer as needed

This pipeline matches what we expect a game to do over and over:

  • Upload texture to VRAM
  • Upload palette to VRAM
  • Update CLUT cache
  • Draw with texture
    • Trigger unswizzle from VRAM into VkImage if needed
    • Begins building a “render pass”, a batch of primitives

When there are no backwards hazards here, we can happily keep batching and defer any synchronization. This is critical to get any performance out of this style of renderer.

Some common hazards here include:

Copy to VRAM which was already written by copy

This is often a false positive, but we cannot track per-byte. This becomes a simple copy barrier and we move on.

Copy to VRAM where a texture was sampled from, or CLUT cache read from

Since the GS has a tiny 4 MiB VRAM, it’s very common that textures are continuously streamed in, sampled from, and thrown away. When this is detected, we have to submit all vram copy work, all texture unswizzle work and then begin a new batch. Primitive batches are not disrupted.

This means we’ll often see:

  • Copy xN
  • Barrier
  • Unswizzle xN
  • Barrier
  • Copy xN
  • Barrier
  • Unswizzle xN
  • Barrier
  • Rendering

Sample texture that was rendered to

Similar, but here we need to flush out everything. This basically breaks the render pass and we start another one. Too many of these is problematic for performance obviously.

Copy to VRAM where rendering happened

Basically same as sampling textures, this is a full flush.

Other hazards are ignored, since they are implicitly handled by our pipeline.

Page tracker

Arguably, the hardest part of GS emulation is dealing with hazards. VRAM is read and written to with reckless abandon and any potential read-after-write or write-after-write hazard needs to be dealt with. We cannot rely on any game doing this for us, since PS2 GS just deals with sync in most cases, and TEXFLUSH is the only real command games will use (or forget to use).

Tracking per byte is ridiculous, so my solution is to first subdivide the 4 MiB VRAM into pages. A page is the unit for frame buffers and depth buffers, so it is the most meaningful place to start.

PageState

On page granularity, we track:

  • Pending frame buffer write?
  • Pending frame buffer read? (read-only depth)

Textures and VRAM copies have 256 byte alignment, and to avoid a ton of false positives, we need to track on a per-block basis. There are 32 blocks per page, so a u32 bit-mask is okay.

  • VRAM copy writes
  • VRAM copy reads
  • Pending read into CLUT cache or VkImage
  • Blocks which have been clobbered by any write, on next texture cache invalidate, throw away images that overlap

As mentioned earlier, there are also cases where you can render to 24-bit color, while sampling from the upper 8-bits without hazard. We need to optimize for that case too, so there is also:

  • A write mask for framebuffers
  • A read mask for textures

In the example above, FB write mask is 0xffffff and texture cache mask is 0xff000000. No overlap, no invalidate 😀

For host access, there are also timeline semaphore values per page. These values state which sync point to wait for if the host desires mapped read or mapped write access. Mapped write access may require more sync than mapped read if there are pending reads on that page.

Caching textures

Every page contains a list of VkImages which have been associated with it. When a page’s textures has been invalidated, the image is destroyed and has to be unswizzled again from VRAM.

There is a one-to-many relationship with textures and pages. A texture may span more than one page, and it’s enough that only one page is clobbered before the texture is invalidated.

Overall, there are a lot of micro-details here, but the important things to note here is that conservative and simple tracking will not work on PS2 games. Tracking at a 256 byte block level and considering write/read masks is critical.

Special cases

There are various situations where we may have false positives due to how textures work. Since textures are POT sized, it’s fairly common for e.g. a 512×448 texture of a render target to be programmed as a 512×512 texture. The unused region should ideally be clamped out with REGION_CLAMP, but most games don’t. A render target might occupy those unused pages. As long as the game’s UV coordinates don’t extend into the unused red zone, there are no hazards, but this is very painful to track. We would have to analyze every single primitive to detect if it’s sampling into the red zone.

As a workaround, we ignore any potential hazard in that red zone, and just pray that a game isn’t somehow relying on ridiculous spooky-action-at-a-distance hazards to work in the game’s favor.

There are more spicy special cases, especially with texture sampling feedback, but that will be for later.

Updating CLUT in a batched way

Since we want to batch texture uploads, we have to batch CLUT uploads too. To make this work, we have 1024 copies of CLUT, a ring buffer of snapshots.

One workgroup loops through the updates and writes them to an SSBO. I did a similar thing for N64 RDP’s TMEM update, where TMEM was instanced. Fortunately, CLUT update is far simpler than TMEM update.

shared uint tmp_clut[512];

// ...

// Copy from previous instance to allow a
// CLUT entry to be partially overwritten and used later
uint read_index = registers.read_index * CLUT_SIZE_16;
tmp_clut[gl_LocalInvocationIndex] =
    uint(clut16.data[read_index]);
tmp_clut[gl_LocalInvocationIndex + 256u] =
    uint(clut16.data[read_index + 256u]);
barrier();

for (uint i = 0; i < registers.clut_count; i++)
{
  // ...
  if (active_lane)
  {
    // update tmp_clut. If 256 color, all threads participate.
    // 16 color update is a partial update.
  }

  // Flush current CLUT state to SSBO.
  barrier();
  clut16.data[gl_LocalInvocationIndex + clut.instance * CLUT_SIZE_16] =
    uint16_t(tmp_clut[gl_LocalInvocationIndex]);
  clut16.data[gl_LocalInvocationIndex + clut.instance * CLUT_SIZE_16 + 256u] =
    uint16_t(tmp_clut[gl_LocalInvocationIndex + 256u]);
  barrier();
}

One potential optimization is that for 256 color / 32 bpp updates, we can parallelize the CLUT update, since nothing from previous iterations will be preserved, but the CLUT update time is tiny anyway.

Unswizzling textures from VRAM

Since this is Vulkan, we can just allocate a new VkImage, suballocate it from VkDeviceMemory and blast it with a compute shader.

Using Vulkan’s specialization constants, we specialize the texture format and all the swizzling logic becomes straight forward code.

REGION_REPEAT shenanigans is also resolved here, so that the ubershader doesn’t have to consider that case and do manual bilinear filtering.

Even for render targets, we roundtrip through the VRAM SSBO. There is not really a point going to the length of trying to forward render targets into textures. Way too many bugs to squash and edge cases to think about.

Triangle setup and binning

Like paraLLEl-RDP, paraLLEl-GS is a tile-based renderer. Before binning can happen, we need triangle setup. As inputs, we provide attributes in three arrays.

Position
struct VertexPosition
{
  ivec2 pos;
  float z;     // TODO: Should be uint for 32-bit Z.
  int padding; // Free real-estate?
};
Per-Vertex attributes
struct VertexAttribute
{
  vec2 st;
  float q;
  uint rgba; // unpackUnorm4x8
  float fog; // overkill, but would be padding anyway
  u16vec2 uv;
};
Per-primitive attributes
struct PrimitiveAttribute
{
  i16vec4 bb; // Scissor
  // Index into state UBO, as well as misc state bits.
  uint state;
  // Texture state which should be scalarized. Affects code paths.
  // Also holds the texture index (for bindless).
  uint tex;
  // Texture state like lod scaling factors, etc.
  // Does not affect code paths.
  uint tex2;  
  uint alpha; // AFIX / AREF
  uint fbmsk;
  uint fogcol;
};

For rasterization, we have a straight forward barycentric-based rasterizer. It is heavily inspired by https://fgiesen.wordpress.com/2011/07/06/a-trip-through-the-graphics-pipeline-2011-part-6/, which in turn is based on A Parallel Algorithm for Polygon Rasterization (Paneda, 1988) and describes the “standard” way to write a rasterizer with parallel hardware. Of course, the PS2 GS is DDA, i.e. a scanline rasterizer, but in practice, this is just a question of nudging ULPs of precision, and since I’m not aware of a bit-exact description of the GS’s DDA, this is fine. paraLLEl-RDP implements the raw DDA form for example. It’s certainly possible if we have to.

As an extension to a straight-forward triangle rasterizer, I also need to support parallelograms. This is used to implement wide-lines and sprites. Especially wide-line is kinda questionable, but I’m not sure it’s possible to fully solve up-scaling + Bresenham in the general case. At least I haven’t run into a case where this really matters.

Evaluating coverage and barycentric I/J turns into something like this:

bool evaluate_coverage_single(PrimitiveSetup setup,
  bool parallelogram, 
  ivec2 parallelogram_offset,
  ivec2 coord, inout float i, inout float j)
{
  int a = idot3(setup.a, coord);
  int b = idot3(setup.b, coord);
  int c = idot3(setup.c, coord);

  precise float i_result = float(b) * setup.inv_area + setup.error_i;
  precise float j_result = float(c) * setup.inv_area + setup.error_j;
  i = i_result;
  j = j_result;

  if (parallelogram && a.x < 0)
  {
    b += a + parallelogram_offset.x;
    c += a + parallelogram_offset.y;
    a = 0;
  }

  return all(greaterThanEqual(ivec3(a, b, c), ivec3(0)));
}

inv_area is computed in a custom fixed-point RCP, which is ~24.0 bit accurate. Using the standard GPU RCP would be bad since it’s just ~22.5 bit accurate and not consistent across implementations. There is no reason to skimp on reproducibility and accuracy, since we’re not doing work per-pixel.

error_i and error_j terms are caused by the downsampling of the edge equations and tie-break rules. As a side effect of the GS’s [-4k, +4k] pixel range, the range of the cross-product requires 33-bit in signed integers. By downsampling a bit, we can get 32-bit integer math to work just fine with 8 sub-pixel accuracy for super-sampling / multi-sampling. Theoretically, this means our upper up-sampling limit is 8×8, but that’s ridiculous anyway, so we’re good here.

The parallelogram offsets are very small numbers meant to nudge the tie-break rules in our favor as needed. The exact details of the implementation escape me. I wrote that code years ago. It’s not very hard to derive however.

Every primitive gets a struct of transformed attributes as well. This is only read if we actually end up shading a primitive, so it’s important to keep this separate to avoid polluting caches with too much garbage.

struct TransformedAttributes
{
  vec4 stqf0;
  vec4 stqf1;
  vec4 stqf2;
  uint rgba0;
  uint rgba1;
  uint rgba2;
  uint padding;
  vec4 st_bb;
};

Using I/J like this will lead to small inaccuracies when interpolating primitives which expect to land exactly on the top-left corner of a texel with NEAREST filtering. To combat this, a tiny epsilon offset is used when snapping texture coordinates. Very YOLO, but what can you do. As far as I know, hardware behavior is sub-texel floor, not sub-texel round.

precise vec2 uv_1 = uv * scale_1;

// Want a soft-floor here, not round behavior.
const float UV_EPSILON_PRE_SNAP = 1.0 / 16.0;
// We need to bias less than 1 / 512th texel, so that linear filter will RTE to correct subpixel.
// This is a 1 / 1024th pixel bias to counter-act any non-POT inv_scale_1 causing a round-down event.
const float UV_EPSILON_POST_SNAP = 16.0 / 1024.0;

if (sampler_clamp_s)
  uv_1.x = texture_clamp(uv_1.x, region_coords.xz, LOD_1);
if (sampler_clamp_t)
  uv_1.y = texture_clamp(uv_1.y, region_coords.yw, LOD_1);

// Avoid micro-precision issues with UV and flooring + nearest.
// Exact rounding on hardware is somwhat unclear.
// SotC requires exact rounding precision and is hit particularly bad.
// If the epsilon is too high, then FF X save screen is screwed over,
// so ... uh, ye.
// We likely need a more principled approach that is actually HW accurate in fixed point.
uv_1 = (floor(uv_1 * 16.0 + UV_EPSILON_PRE_SNAP) + UV_EPSILON_POST_SNAP) *
       inv_scale_1 * 0.0625;

Binning

This is mostly uninteresting. Every NxN pixel block gets an array of u16 primitive indices to shade. This makes the maximum number of primitives per render pass 64k, but that’s enough for PS2 games. Most games I’ve seen so far tend to be between 10k and 30k primitives for the “main” render pass, but I haven’t tested the real juggernauts of primitive grunt yet, but even so, having to do a little bit of incremental rendering isn’t a big deal.

NxN is usually 32×32, but it can be dynamically changed depending on how heavy the geometry load is. For large resolutions and high primitive counts, the binning and memory cost is unacceptable if the resolution is just 16×16 for example. One subgroup is responsible for iterating through all primitives in a block.

Since binning and triangle is state-less, triangle-setup and binning for back-to-back passes are batched up nicely to avoid lots of silly barriers.

The ubershader

A key difference between N64 and PS2 is fill-rate and per-pixel complexity. For N64, the ideal approach is to specialize the rasterizing shader, write out per-pixel color + depth + coverage + etc, then merge that data in a much simpler ubershader that only needs to consider depth and blend state rather than full texturing state and combiner state. This is very bandwidth intensive on the GPU, but the alternative is the slowest ubershader written by man. We’re saved by the fact that N64 fill-rate is abysmal. Check out this video by Kaze to see how horrible it is.

The GS is a quite different beast. Fill-rate is very high, and per-pixel complexity is fairly low, so a pure ubershader is viable. We can also rely on bindless this time around too, so texturing complexity becomes a fraction of what I had to deal with on N64.

Fine-grained binning

Every tile is 4×4, 4×8 and 8×8 for subgroup sizes 16, 32 and 64 respectively. For super-sampling it’s even smaller (it’s 4×4 / 4×8 / 8×8 in the higher resolution domain instead).

In the outer loop, we pull in up to SubgroupSize’s worth of primitives, and bin them in parallel.

for (int i = 0; i < tile.coarse_primitive_count;
     i += int(gl_SubgroupSize))
{
  int prim_index = i + int(gl_SubgroupInvocationID);
  bool is_last_iteration = i + int(gl_SubgroupSize) >= 
                           tile.coarse_primitive_count;

  // Bin primitives to tile.
  bool binned_to_tile = false;
  uint bin_primitive_index;
  if (prim_index < tile.coarse_primitive_count)
  {
    bin_primitive_index = 
      uint(coarse_primitive_list.data[
           tile.coarse_primitive_list_offset + prim_index]);
    binned_to_tile = primitive_intersects_tile(bin_primitive_index);
  }

  // Iterate per binned primitive, do per pixel work now.
  // Scalar loop.
  uvec4 work_ballot = subgroupBallot(binned_to_tile);

In the inner loop, we can do a scalarized loop which checks coverage per-pixel, one primitive at a time.

// Scalar data
uint bit = subgroupBallotFindLSB(work_ballot);

if (gl_SubgroupSize == 64)
{
  if (bit >= 32)
    work_ballot.y &= work_ballot.y - 1;
  else
    work_ballot.x &= work_ballot.x - 1;
}
else
{
  work_ballot.x &= work_ballot.x - 1;
}

shade_primitive_index = subgroupShuffle(bin_primitive_index, bit);

Early Z

We can take advantage of early-Z testing of course, but we have to be careful if there are rasterized pixels we haven’t resolved yet, and there are Z-writes in flight. In this case we have to defer to late Z to perform test.

// We might have to remove opaque flag.
bool pending_z_write_can_affect_result =
  (pixel.request.z_test || !pixel.request.z_write) &&
  pending_shade_request.z_write;

if (pending_z_write_can_affect_result)
{
  // Demote the pixel to late-Z,
  // it's no longer opaque and we cannot discard earlier pixels.
  // We need to somehow observe the previous results.
  pixel.opaque = false;
}

Deferred on-tile shading

Since we’re an uber-shader, all pixels are “on-chip”, i.e. in registers, so we can take advantage of culling pixels that won’t be visible anyway. The basic idea here is that after rasterization, if a pixel is considered opaque, it will simply replace the shading request that exists for that framebuffer coordinate. It won’t be visible at all anyway.

Lazy pixel shading

We only need to perform shading when we really have to, i.e., we’re shading a pixel that depends on the previous pixel’s results. This can happen for e.g. alpha test (if test fails, we preserve existing data), color write masks, or of course, alpha blending.

If our pixel remains opaque, we can just kill the pending pixel shade request. Very nice indeed. The gain here wasn’t as amazing as I had hoped since PS2 games love blending, but it helps culling out a lot of shading work.

if (pixel.request.coverage > 0)
{
  need_flush = !pixel.opaque && pending_shade_request.coverage > 0;

  // If there is no hazard, we can overwrite the pending pixel.
  // If not, defer the update until we run a loop iteration.
  if (!need_flush)
  {
    set_pending_shade_request(pixel.request, shade_primitive_index);
    pixel.request.coverage = 0;
    pixel.request.z_write = false;
  }
}

If we have flushes that need to happen, we do so if one pixel needs it. It’s just as fast to resolve all pixels anyway.

// Scalar branch
if (subgroupAny(need_flush))
{
  shade_resolve();
  if (has_work && pixel.request.coverage > 0)
    set_pending_shade_request(pixel.request, shade_primitive_index);
}

The resolve is a straight forward waterfall loop that stays in uniform control flow to be well defined on devices without maximal reconvergence support.

while (subgroupAny(has_work))
{
  if (has_work)
  {
    uint state_index =
      subgroupBroadcastFirst(pending_shade_request.state);
    uint tex = subgroupBroadcastFirst(prim_tex);
    if (state_index == pending_shade_request.state && prim_tex == tex)
    {
      has_work = false;
      shade_resolve(pending_primitive_index, state_index, tex);
    }
  }
}

This scalarization ensures that all branches on things like alpha test mode, blend modes, etc, are purely scalar, and GPUs like that. Scalarizing on the texture index is technically not that critical, but it means we end up hitting the same branches for filtering modes, UBOs for scaling factors are loaded uniformly, etc.

When everything is done, the resulting framebuffer color and depth is written out to SSBO. GPU bandwidth is kept to a minimum, just like a normal TBDR renderer.

Super-sampling

Just implementing single sampled rendering isn’t enough for this renderer to be really useful. The software renderer is certainly quite fast, but not fast enough to keep up with intense super-sampling. We can fix that now.

For e.g. 8x SSAA, we keep 10 versions of VRAM on the GPU.

  • 1 copy represents the single-sampled VRAM. It is super-sampled.
  • 1 copy represents the reference value for single-sampled VRAM. This allows us to track when we should discard the super-samples and splat the single sample to all. This can happen if someone copies to VRAM over a render target for whatever reason.
  • 8 copies which each represent the super-samples. Technically, we can reconstruct a higher resolution image from these samples if we really want to, but only the CRTC could easily do that.

When rendering super-sampled, we load the single-sampled VRAM and reference. If they match, we load the super-sampled version. This is important for cases where we’re doing incremental rendering.

On tile completion we use clustered subgroup ops to do multi-sample resolve, then write out the super-samples, and the two single-sampled copies.

uvec4 ballot_color = subgroupBallot(fb_color_dirty);
uvec4 ballot_depth = subgroupBallot(fb_depth_dirty);

// No need to mask, we only care about valid ballot for the
// first sample we write-back.
if (NUM_SAMPLES >= 16)
{
  ballot_color |= ballot_color >> 8u;
  ballot_depth |= ballot_depth >> 8u;
}

if (NUM_SAMPLES >= 8)
{
  ballot_color |= ballot_color >> 4u;
  ballot_depth |= ballot_depth >> 4u;
}

if (NUM_SAMPLES >= 4)
{
  ballot_color |= ballot_color >> 2u;
  ballot_depth |= ballot_depth >> 2u;
}

ballot_color |= ballot_color >> 1u;
ballot_depth |= ballot_depth >> 1u;

// GLSL does not accept cluster reduction as spec constant.
if (NUM_SAMPLES == 16)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 16) / 16.0);
else if (NUM_SAMPLES == 8)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 8) / 8.0);
else if (NUM_SAMPLES == 4)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 4) / 4.0);
else
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 2) / 2.0);

fb_color_dirty = subgroupInverseBallot(ballot_color);
fb_depth_dirty = subgroupInverseBallot(ballot_depth);

The main advantage of super-sampling over straight up-scaling is that up-scaling will still have jagged edges, and super-sampling retains a coherent visual look where 3D elements have similar resolution as UI elements. One of my pet peeves is when UI elements have a significantly different resolution from 3D objects and textures. HD texture packs can of course alleviate that, but that’s a very different beast.

Super-sampling also lends itself very well to CRT post-processing shading, which is also a nice bonus.

Dealing with super-sampling artifacts

It’s a fact of life that super-sampling always introduces horrible artifacts if not handled with utmost care. Mitigating this is arguably easier with software renderers over traditional graphics APIs, since we’re not limited by the fixed function interpolators. These tricks won’t make it perfect by any means, but it greatly mitigates jank in my experience, and I already fixed many upscaling bugs that GSdx Vulkan backend does not solve as we shall see later.

Sprite primitives should always render at single-rate

Sprites are always UI elements or similar, and games do not expect us to up-scale them. Doing so either results in artifacts where we sample outside the intended rect, or we risk overblurring the image if bilinear filtering is used.

The trick here is just to force-snap the pixel coordinate we use when rasterizing and interpolating. This is very inefficient of course, but UI shouldn’t take up the entire screen. And if it does (like in a menu), the GPU load is tiny anyway.

const uint SNAP_RASTER_BIT = (1u << STATE_BIT_SNAP_RASTER);
const uint SNAP_ATTR_BIT = (1u << STATE_BIT_SNAP_ATTRIBUTE);

if (SUPER_SAMPLE && (prim_state & SNAP_RASTER_BIT) != 0)
  fb_pixel = tile.fb_pixel_single_rate;

res.request.coverage = evaluate_coverage(
  prim, fb_pixel, i, j,
  res.request.multisample, SAMPLING_RATE_DIM_LOG2);

Flat primitives should interpolate at single-pixel coordinate

Going further, we can demote SSAA interpolation to MSAA center interpolation dynamically. Many UI elements are unfortunately rendered with normal triangles, so we have to be a bit more careful. This snap only affects attribute interpolation, not Z of course.

res.request.st_bb = false;
if (SUPER_SAMPLE &&
    (prim_state & (SNAP_RASTER_BIT | SNAP_ATTR_BIT)) == SNAP_ATTR_BIT)
{
  vec2 snap_ij = evaluate_barycentric_ij(
    prim.b, prim.c, prim.inv_area,
    prim.error_i, prim.error_j, tile.fb_pixel_single_rate,
    SAMPLING_RATE_DIM_LOG2);

  i = snap_ij.x;
  j = snap_ij.y;
  res.request.st_bb = true;
}

Here, we snap interpolation to the top-left pixel. This fixes any artifacts for primitives which align their rendering to a pixel center, but some games are mis-aligned, so this snapping can cause texture coordinates to go outside the expected area. To clean this up, we compute a bounding box of final texture coordinates. Adding bounding boxes can technically cause notorious block-edge artifacts, but that was mostly a thing on PS1 since emulators like to convert nearest sampling to bilinear.

The heuristic for this is fairly simple. If perspective is used, if all vertices in a triangle have exact same Q, we assume it’s a flat UI primitive. The primitive’s Z coordinates must also match. This is done during triangle setup on the GPU. There can of course be false positives here, but it should be rare. In my experience this hack works well enough in the games I tried.

Results

Here’s a good example of up-sampling going awry in PCSX2. This is with Vulkan backend:

Notice the bloom on the glass being mis-aligned and a subtle (?) rectangular pattern being overlaid over the image. This is caused by a post-processing pass rendering in a page-like pattern, presumably to optimize for GS caching behavior.

 

With 8x SSAA in paraLLEl-GS it looks like this instead. There is FSR1 post-upscale in effect here which changes the look a bit, but the usual trappings of bad upscale cannot be observed here. This is another reason to do super-sample; texture mis-alignment has a tendency to fix itself.

Also, if you’re staring at the perf numbers, this is RX 7600 in a low power state :’)

Typical UI issues can be seen in games as well. Here’s native resolution:

and 4x upscale, which … does not look acceptable.

This UI is tricky to render in upscaled mode, since it uses triangles, but the MSAA snap trick above works well and avoids all artifacts. With straight upscale, this is hard to achieve in normal graphics APIs since you’d need interpolateAtOffset beyond 0.5 pixels, which isn’t supported. Perhaps you could do custom interpolation with derivatives or something like that, but either way, this glitch can be avoided. The core message is basically to never upscale UI beyond plain nearest neighbor integer scale. It just looks bad.

There are cases where PCSX2 asks for high blending accuracy. One example is MGS2, and I found a spot where GPU perf is murdered. My desktop GPU cannot keep 60 FPS here at 4x upscale. PCSX2 asks you to turn up blend-accuracy for this game, but …

What happens here is we hit the programmable blending path with barrier between every primitive. Ouch! This wouldn’t be bad for the tiler mobile GPUs, but for a desktop GPU, it is where perf goes to die. The shader in question does subpassLoad and does programmable blending as expected. Barrier, tiny triangle, barrier, tiny triangle, hnnnnnnng.

paraLLEl-GS on the other hand always runs with 100% blend accuracy (assuming no bugs of course). Here’s 16xSSAA (equivalent to 4x upscale). This is just 25 W and 17% GPU utilization on RX 7600. Not bad.

Other difficult cases include texture sampling feedback. One particular case I found was in Valkyrie Profile 2.

This game has a case where it’s sampling it’s own pixel’s alpha as a palette index. Quirky as all hell, and similar to MGS2 there’s a barrier between every pixel.

In paraLLEl-GS, this case is detected, and we emit a magical texture index, which resolved to just looking at in-register framebuffer color instead. Programmable blending go brr. These cases have to be checked per primitive, which is quite rough on CPU time, but it is what it is. If we don’t hit the good path, GPU performance completely tanks.

The trick here is to analyze the effective UV coordinates, and see if UV == framebuffer position. If we fall off this path, we have to go via texture uploads, which is bad.

ivec2 uv0_delta = uv0 - pos[0].pos;
ivec2 uv1_delta = uv1 - pos[1].pos;
ivec2 min_delta = min(uv0_delta, uv1_delta);
ivec2 max_delta = max(uv0_delta, uv1_delta);

if (!quad)
{
  ivec2 uv2_delta = uv2 - pos[2].pos;
  min_delta = min(min_delta, uv2_delta);
  max_delta = max(max_delta, uv2_delta);
}

int min_delta2 = min(min_delta.x, min_delta.y);
int max_delta2 = max(max_delta.x, max_delta.y);

// The UV offset must be in range of [0, 2^SUBPIXEL_BITS - 1].
// This guarantees snapping with NEAREST.
// 8 is ideal. That means pixel centers during interpolation
// will land exactly in the center of the texel.
// In theory we could allow LINEAR if uv delta was
// exactly 8 for all vertices.
if (min_delta2 < 0 || max_delta2 >= (1 << SUBPIXEL_BITS))
  return ColorFeedbackMode::Sliced;

// Perf go brrrrrrr.
return ColorFeedbackMode::Pixel;
if (feedback_mode == ColorFeedbackMode::Pixel)
{
  mark_render_pass_has_texture_feedback(ctx.tex0.desc);
  // Special index indicating on-tile feedback.
  // We could add a different sentinel for depth feedback.
  // 1024k CLUT instances and 32 sub-banks. Fits in 15 bits.
  // Use bit 15 MSB to mark feedback texture.
  return (1u << (TEX_TEXTURE_INDEX_BITS - 1u)) |
         (render_pass.clut_instance * 32 + uint32_t(ctx.tex0.desc.CSA));
}

It’s comfortably full-speed on PCSX2 here, despite the copious number of barriers, but paraLLEl-GS is reasonably close perf-wise, actually. At 8x SSAA.

Overall, we get away with 18 render pass barriers instead of 500+ which was the case without this optimization. You may notice the interlacing artifacts on the swirlies. Silly game has a progressive scan output, but downsamples it on its own to a field before hitting CRTC, hnnnnng 🙁 Redirecting framebuffer locations in CRTC might work as a per-game hack, but either way, I still need to consider a better de-interlacer. Some games actually render explicitly in fields (640×224), which is very annoying.

This scene in the MGS2 intro also exposes some funny edge cases with sampling.

To get the camo effect, it’s sampling its own framebuffer as a texture, with overlapping coordinates, but not pixel aligned, so this raises some serious questions about caching behavior. PCSX2 doesn’t seem to add any barriers here, and I kinda had to do the same thing. It looks fine to me compared to software renderer at least.

if (feedback_mode == ColorFeedbackMode::Sliced)
{
  // If game explicitly clamps the rect to a small region,
  // it's likely doing well-defined feedbacks.
  // E.g. Tales of Abyss main menu ping-pong blurs.
  // This code is quite flawed,
  // and I'm not sure what the correct solution is yet.
  if (desc.clamp.desc.WMS == CLAMPBits::REGION_CLAMP &&
      desc.clamp.desc.WMT == CLAMPBits::REGION_CLAMP)
  {
    ivec4 clamped_uv_bb(
      int(desc.clamp.desc.MINU),
      int(desc.clamp.desc.MINV),
      int(desc.clamp.desc.MAXU),
      int(desc.clamp.desc.MAXV));

    ivec4 hazard_bb(
      std::max<int>(clamped_uv_bb.x, bb.x),
      std::max<int>(clamped_uv_bb.y, bb.y),
      std::min<int>(clamped_uv_bb.z, bb.z),
      std::min<int>(clamped_uv_bb.w, bb.w));

    cache_texture = hazard_bb.x > hazard_bb.z ||
                    hazard_bb.y > hazard_bb.w;
  }
  else
  {
    // Questionable,
    // but it seems almost impossible to do this correctly and fast.
    // Need to emulate the PS2 texture cache exactly,
    // which is just insane.
    // This should be fine.
    cache_texture = false;
  }
}

If we’re in a mode where texture points directly to the frame buffer we should relax the hazard tracking a bit to avoid 2000+ barriers. This is clearly spooky since Tales of Abyss’s bloom effect as shown earlier depends on this to be well behaved, but in that case, at least it uses REGION_CLAMP to explicitly mark the ping-pong behavior. I’m not sure what the proper solution is here.

The only plausible solution to true bit-accuracy with real hardware is to emulate the caches directly, one pixel at a time. You can kiss performance good bye in that case.

One of the worst stress tests I’ve found so far has to be Shadow of the Collosus. Just in the intro, we can make the GPU kneel down to 24 FPS with maximum blend accuracy on PCSX2, at just 2x upscale! Even with normal blending accuracy, it is extremely heavy during the intro cinematic.

At 8x SSAA, perf is still looking pretty good for paraLLEl-GS, but it’s clearly sweating now.

We’re actually still CPU bound on the geometry processing. Optimizing the CPU code hasn’t been a huge priority yet. There’s unfortunately a lot of code that has to run per-primitive, where hazards can happen around every corner that has to be dealt with somehow. I do some obvious optimizations, but it’s obviously not as well-oiled as PCSX2 in that regard.

Deck?

It seems fast enough to comfortably do 4x SSAA. Maybe not in SotC, but … hey. 😀

What now?

For now, the only real way to test this is through GS dumps. There’s a hack-patch for PCSX2 that lets you dump out a raw GS trace, which can be replayed. This works via mkfifo as a crude hack to test in real-time, but some kind of integration into an emulator needs to happen at some point if this is to turn into something that’s useful for end users.

There’s guaranteed to be a million bugs lurking since the PS2 library is ridiculously large and there’s only so much I can be arsed to test myself. At least, paraLLEl-GS has now become my preferred way to play PS2 games, so I can say mission complete.

A potential use case for this is due to its standalone library nature, it may be useful as very old-school rendering API for the old greybeards around that still yearn for the day of PS2 programming for whatever reason :p

Real-time video streaming experiments with forward error correction

As I previously discussed in my PyroFling post about real-time video streaming, one challenge I mentioned was related to error correction. Using UDP, packet loss is inevitable, so there’s two approaches to reduce streaming jank:

  • Error masking – hallucinate missed frames
  • Forward error correction (FEC) – add redundancy to avoid dropped packets

Re-sending packets is a waste of time in a low latency environment like this, so we can ignore that. If re-sending was okay, I’d just use TCP and forget about all of this anyway.

With intra-refresh, error masking is half decent, so I wanted to focus on FEC. Error correction is its own field of study, but I didn’t have the time to actually study the field. State-of-the-art error correction is extremely advanced, complex to implement and IP encumbered (*ahem*, RaptorQ, *ahem*), but I evaluated some less recent approaches.

Understanding the data

Which FEC mechanism we choose needs to take the input data into consideration.

N bytes split into 1024 byte sub-packets

A video packet is a variable number of bytes every frame, which I split up into 1024 bytes (+ header) each. A successful transmission only happens when all N bytes are received successfully. Sending partially valid data to the video decoder is likely going to result in horrible things happening, so if even one sub-packet is dropped, I have to drop the full frame.

Some error correction schemes rely on fixed block lengths, which isn’t ideal for our variable length input. The classic example everyone taking classes on the subject learn is the Hamming (7, 4) code, but this code is better suited for noisy analog channels where we don’t know if any bit was actually received correctly. What we really want is a method that takes extra knowledge about packet loss into account.

Erasure channel

Sending UDP packets over the internet functions like an erasure channel. At the receiver, we know if data was missed. Corrupt packets are dropped by the network (random bit-errors causing CRC check to pass is theoretically possible I suppose, but I don’t consider that).

Small block size

Since a packet is received all or nothing, we’re actually error correcting in a vectorized fashion. The message we’re looking at error correcting is byte n for for all sub-packets P in [0, ceil(N / 1024)), where N is the video packet size in bytes. The error correction algorithm will perform the exact same operation for every byte n in a given packet.

Another way of looking at this is to consider every packet a single 8192-bit number, but that’s a very mathematical way of looking at it. Either way, given a typical 10 mbit/s stream at 60 fps, we expect about 20 sub-packets per video frame. Some frames will be very small, and some will be larger.

Flexible redundancy ratios and block sizes

Some block codecs like Reed-Solomon are well known and very powerful, but seemed a bit too rigid in its block structure. It also has block lengths that seem better suited to bit-streams.

Being able to adapt how much FEC is used is quite useful in a dynamic system such as streaming. A Compact Disc (which uses Reed-Solomon) has to bake in a fixed amount of error correction, but with streaming, a feedback channel can let us dynamically adjust the amount of error correction used as needed. I quickly rejected these codes.

The most basic FEC – YOLO XOR

I don’t think this code has a formal name, but understanding it is the foundation for the upcoming section. Given N packets, just take the XOR of all the packets and send that as one FEC block. If there is 1 packet loss, we can recover it by taking the XOR of all received packets and the FEC block. For small video frames with a small number of packets, this is actually the method I went with.

The downside of course is that there is no obvious way to recover more than 1 packet. I found that spurious packet loss could have 2 or 3 drops in some cases, especially in very large video frames that span up to 100 sub-packets, so this approach was too naive for me.

Fountain codes

While looking around, I ran into a very clever scheme called a fountain code, in particular, the Luby Transform. There is a nice YouTube video explaining it. I also dug up Chapter 50 in an old textbook I used in my university studies, which had a chapter dedicated to this with more mathematical rigor.

This method has some nice properties that are well suited for network transmission:

  • Designed for erasure channels (e.g. IP networks)
  • Flexible FEC ratios
  • Receiver can complete decode after receiving enough data packets (with some major caveats)

A fountain code is called so because the encoder can spit out an arbitrary number of packets. There is no fixed block structure, and the process is pseudo-random. As long as the encoder and decoder agree on a seed for the process, very little side channel data needs to be communicated.

The algorithm is essentially YOLO XOR, with a lot of statistical tweaks. First, we consider the degree d of a packet, which is the number of blocks we take the XOR of when generating a packet. A degree of 1 is the base case where we send a block as-is, and a degree of ceil(N / 1024) is taking the XOR of everything (i.e. the YOLO XOR case).

The packets chosen for XOR-ing are randomized. On the receiver end, we look at all our received packets, and if we find a case where we have a packet with degree d, and d – 1 of the packets have been recovered, we can recover the last one through … more XOR-ing. By recovering a new packet, this may cause other packets to reach this condition and the cycle continues. To kick-start this process, some packets with degree = 1 must be transmitted.

To make this work well, the literature describes a very particular distribution for d to minimize the expected redundancy. I implemented all of this, but I found some unfortunate practical problems. It is (very) possible I had bugs of course, but debugging completely random processes is not very fun and it’s not like I had a reference result to compare against.

Non-deterministic amount of packets needed to complete decode

Given the completely random process, it’s unbounded how many packets have to be encoded to actually be able to decode, even with no packet loss. Studying the literature, the examples I found seemed to assume a very large number of blocks K. K would be 10000 for example, and as K increases, the variance of redundancy ratios decreases. For my example of K = 20, the algorithm seemed to collapse. Occasionally, I needed 2 or 3x redundancy to complete the decode, which is obviously unacceptable.

Painful statistical modelling

The statistical distribution for the degree factor d depends on the number of blocks to send, K. This value changes every frame. K can be arbitrarily large, so computing LUTs got awkward.

Smol brain LT code

Following the enlightened example of grug brain, I massively simplified the LT code down to something that maybe isn’t as theoretically good, but it worked very well in practice for my particular needs, where packet loss ratios are fairly low and K is low.

This basically boils down to a heavily “rigged” LT, but otherwise the encoder and decoder does not really change.

Send all packets as-is first (d = 1)

This is an obvious thing to do. If there is no packet loss, we guarantee that we can start decoding immediately (good for latency). There is no randomness in this process.

Fixed degree factor

I found that a fixed d factor of K / 2 worked well. For odd K, alternate d between ceil(K / 2) and floor(K / 2). For large K, clamping the factor d to something reasonable like 64 worked well too.

Mirror selection of blocks

For every pair of blocks, we want to ensure that blocks selected for XOR are the complement of each other. This guarantees that by receiving a pair of FEC blocks, we will always be able to correct one missed packet. With 50% probability, we can recover two packet losses. The odd/even split above is designed to make sure that an odd/even pair always covers all K blocks. As the number of blocks increases, we’ll be able to recover more losses (with lower and lower probability).

Results

Given a fixed number of data packets, and N randomly lost packets, we can observe how well this FEC recovers data. The recovery rate for 1 lost packet is 100% due to our design, so that’s not interesting. XOR degree factor is 20.

With larger data blocks, and more FEC blocks to match the redundancy ratio, recovery ratio improves, but beyond 4 losses, the codec starts collapsing. That’s fine. I haven’t had too many issues with burst losses like these. XOR degree factor is 40. There’s some interesting stair-stepping here, which might be caused by the mirroring. This suggests we get most bang for our bandwidth by using odd number of FEC blocks.

It’s possible to tweak things however. Using a smaller degree is good when using a lot of FEC packets and more errors are expected. Maybe it’s possible to use a blend of high degree and low degree packets as well (basically the entire point of LT), but this kind of tweaking can be left to another time. PyroFling’s expectation is low number of losses per video frame and simplicity beats theoretical performance.

If we add e.g. 0.5% random, uncorrelated packet loss ratio, a degree factor of d = N / 2 seems much better.

For larger data sets, the number of expected losses starts increasing, so degree factors seem to prefer d = 100 over d = 200. For larger video frames, we’re far more likely to encounter at least one packet loss, so that’s why loss ratio for 0 FEC packets approaches 100%.

Is this even good?

Compared to the state of the art, this is likely far from optimal, but this was good enough for my uses and here’s the latest log from a 4-hour play session between Trondheim and Bergen:

2322932 complete, 54 dropped video, 9683 FEC recovered

~99.5% of dropped packets were avoided. This was at 25% FEC redundancy rate. Every dropped video packet is disruptive and lasts many frames, so this improvement was transformative.

This isn’t an academic project, so I don’t really care about comparing against a million different FEC algorithms. 🙂

UDP pacing to reduce bursty drops

A common technique in UDP streaming is to not send all packets immediately, but pace them over an interval. Sending over the full frame interval increases latency by quite a bit, but pacing a stream to a max instantaneous rate of e.g. 60 mbit/s worked alright. The added latency is only a few ms for a ~10-15 mbit/s stream, which is acceptable. Since I’m on Linux, and I was lazy, I rely on the kernel to do this automatically:

sudo tc qdisc add dev $iface root fq maxrate 60000000

Conclusion

Hopefully this demonstrates a simple FEC that is fairly accessible. The last piece of the PyroFling adventure will be to finally tackle Vulkan Video encode.

Modernizing Granite’s mesh rendering

Granite’s renderer is currently quite old school. It was written with 2017 mobile hardware in mind after all. Little to no indirect drawing, a bindful material system, etc. We’ve all seen that a hundred times before. Granite’s niche ended up exploring esoteric use cases, not high-end rendering, so it was never a big priority for me to fix that.

Now that mesh shading has starting shipping and is somewhat proven in the wild with several games shipping UE5 Nanite, and Alan Wake II – which all rely on mesh shaders to not run horribly slow – it was time to make a more serious push towards rewriting the entire renderer in Granite. This has been a slow burn project that’s been haunting me for almost half a year at this point. I haven’t really had the energy to rewrite a ton of code like this in my spare time, but as always, holidays tend to give me some energy for these things. Video shenanigans have also kept me distracted this fall.

I’m still not done with this rewrite, but enough things have fallen into place, that I think it’s time to write down my findings so far.

Design requirements

Reasonable fallbacks

I had some goals for this new method. Unlike UE5 Nanite and Alan Wake II, I don’t want to hard-require actual VK_EXT_mesh_shader support to run acceptably. Just thinking in terms of meshlets should benefit us in plain multi-draw-indirect (MDI) as well. For various mobile hardware that doesn’t support MDI well (or at all …), I’d also like a fallback path that ends up using regular direct draws. That fallback path is necessary to evaluate performance uplift as well.

What Nanite does to fallback

This is something to avoid. Nanite relies heavily on rendering primitive IDs to a visibility buffer, where attributes are resolved later. In the primary compute software rasterizer, this becomes a 64-bit atomic, and in the mesh shader fallback, a single primitive ID is exported to fragment stage as a per-primitive varying, where fragment shader just does the atomic (no render targets, super fun to debug …). The problem here is that per-primitive varyings don’t exist in the classic vertex -> fragment pipeline. There are two obvious alternatives to work around this:

  • Geometry shaders. Pass-through mode can potentially be used if all the stars align on supported hardware, but using geometry shaders should revoke your graphics programmer’s license.
  • Unroll a meshlet into a non-indexed draw. Duplicate primitive ID into 3 vertices. Use flat shading to pull in the primitive ID.

From my spelunking in various shipped titles, Nanite does the latter, and fallback rendering performance is halved as a result (!). Depending on the game, meshlet fallbacks are either very common or very rare, so real world impact is scene and resolution dependent, but Immortals of Aveum lost 5-15% FPS when I tested it.

The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading suggests rendering out a visibility G-Buffer using InstanceID (fed through some mechanism) and SV_PrimitiveID, which might be worth exploring at some point. I’m not sure why Nanite did not go that route. It seems like it would have avoided the duplicated vertices.

Alan Wake II?

Mesh shaders are basically a hard requirement for this game. It will technically boot without mesh shader support, but the game gives you a stern warning about performance, and they are not kidding. I haven’t dug into what the fallback is doing, but I’ve seen people posting videos demonstrating sub-10 FPS on a 1080 Ti. Given the abysmal performance, I wouldn’t be surprised if they just disabled all culling and draw everything in the fallback.

A compressed runtime meshlet format

While studying https://github.com/zeux/meshoptimizer I found support for compressed meshes, a format that was turned into a glTF EXT. It seems to be designed for decompressing on CPU (completely serial algorithm), which was not all that exciting for me, but this sparked an idea. What if I could decompress meshlets on the GPU instead? There are two ways this can be useful:

  • Would it be fast enough to decompress inline inside the mesh shader? This can potentially save a lot of read bandwidth during rendering and save precious VRAM.
  • Bandwidth amplifier on asset loading time. Only the compressed meshlet format needs to go over PCI-e wire, and we decompress directly into VRAM. Similar idea to GDeflate and other compression formats, except I should be able to come up with something that is way faster than a general purpose algorithm and also give decent compression ratios.

I haven’t seen any ready-to-go implementation of this yet, so I figured this would be my starting point for the renderer. Always nice to have an excuse to write some cursed compute shaders.

Adapting to implementations

One annoying problem with mesh shading is that different vendors have very different fast paths through their hardware. There is no single implementation that fits all. I’ve spent some time testing various parameters and observe what makes NV and AMD go fast w.r.t. mesh shaders, with questionable results. I believe this is the number 1 reason mesh shaders are still considered a niche feature.

Since we’re baking meshlets offline, the format itself must be able to adapt to implementations that prefer 32/64/128/256 primitive meshlets. It must also adapt nicely to MultiDrawIndirect-style rendering.

Random-access

It should be efficient to decode meshlets in parallel, and in complete isolation.

The format

I went through some (read: way too many) design iterations before landing on this design.

256 vert/prim meshlets

Going wide means we get lower culling overhead and emitting larger MDI calls avoids us getting completely bottlenecked on command stream frontend churn. I tried going lower than 256, but performance suffered greatly. 256 seemed like a good compromise. With 256 prim/verts, we can use 8-bit index buffers as well, which saves quite a lot of memory.

Sublets – 8×32 grouping

To consider various hardware implementations, very few will be happy with full, fat 256 primitive meshlets. To remedy this, the encoding is grouped in units of 32 – a “sublet” – where we can shade the 8 groups independently, or have larger workgroups that shade multiple sublets together. Some consideration is key to be performance portable. At runtime we can specialize our shaders to fit whatever hardware we’re targeting.

Using grouping of 32 is core to the format as well, since we can exploit NV warps being 32-wide and force Wave32 on RDNA hardware to get subgroup accelerated mesh shading.

Format header

// Can point to mmap-ed file.
struct MeshView
{
    const FormatHeader *format_header;
    const Bound *bounds;
    const Bound *bounds_256; // Used to cull in units of 256 prims
    const Stream *streams;
    const uint32_t *payload;
    uint32_t total_primitives;
    uint32_t total_vertices;
    uint32_t num_bounds;
    uint32_t num_bounds_256;
};
struct FormatHeader
{
    MeshStyle style;
    uint32_t stream_count;
    uint32_t meshlet_count;
    uint32_t payload_size_words;
};

The style signals type of mesh. This is naturally engine specific.

  • Wireframe: A pure position + index buffer
  • Textured: Adds UV + Normal + Tangent
  • Skinned: Adds bone indices and weights on top

A stream is 32 values encoded in some way.

enum class StreamType
{
    Primitive = 0,
    Position,
    NormalTangentOct8,
    UV,
    BoneIndices,
    BoneWeights,
};

Each meshlet has stream_count number of Stream headers. The indexing is trivial:

streams[RuntimeHeader::stream_offset + int(StreamType)]
// 16 bytes
struct Stream
{
    union
    {
       uint32_t base_value[2];
       struct { uint32_t prim_count; uint32_t vert_count; } counts;
    } u;
    uint32_t bits;
    uint32_t offset_in_words;
};

This is where things get a bit more interesting. I ended up supporting some encoding styles that are tailored for the various attribute formats.

Encoding attributes

There’s two parts to this problem. First is to decide on some N-bit fixed point values, and then find the most efficient way to pull those bits from a buffer. I went through several iterations on the actual bit-stuffing.

Base value + DELTA encoding

A base value is encoded in Stream::base_value, and the decoded bits are an offset from the base. To start approaching speed-of-light decoding, this is about as fancy as we can do it.

I went through various iterations of this model. The first idea had a predictive encoding between neighbor values, where subgroup scan operations were used to complete the decode, but it was too slow in practice, and didn’t really improve bit rates at all.

Index buffer

Since the sublet is just 32-wide, we can encode with 5-bit indices. 15 bit / primitive. There is no real reason to use delta encode here, so instead of storing base values in the stream header, I opted to use those bits to encode vertex/index counts.

Position

This is decoded to 3×16-bit SINT. The shared exponent is stored in top 16 bits of Stream::bits.

vec3 position = ldexp(vec3(i16vec3(decoded)), exponent);

This facilitates arbitrary quantization as well.

UV

Similar idea as position, but 2×16-bit SINT. After decoding similar to position, a simple fixup is made to cater to typical UVs which lie in range of [0, +1], not [-1, +1].

vec2 uv = 0.5 * ldexp(vec2(i16vec2(decoded)), exponent) + 0.5;
Normal/Tangent

Encoded as 4×8-bit SNORM. Normal (XY) and Tangent (ZW) are encoded with Octahedral encoding from meshoptimizer library.

To encode the sign of tangent, Stream::bits stores 2 bits, which signals one of three modes:

  • Uniform W = -1
  • Uniform W = +1
  • LSB of decoded W encodes tangent W. Tangent’s second component loses 1 bit of precision.
Bone index / weight

Basically same as Normal/Tangent, but ignore tangent sign handling.

First (failed?) idea – bitplane encoding

For a long time, I was pursuing bitplane encoding, which is one of the simplest ways to encode variable bitrates. We can encode 1 bit for 32 values by packing them in one u32. To speed up decoding further, I aimed to pack everything into 128-bit aligned loads. This avoids having to wait for tiny, dependent 32-bit loads.

For example, for index buffers:

uint meshlet_decode_index_buffer(
   uint stream_index, uint chunk_index,
   int lane_index)
{
    uint offset_in_b128 =
      meshlet_streams.data[stream_index].offset_in_b128;

    // Fixed 5-bit encoding.
    offset_in_b128 += 4 * chunk_index;

    // Scalar load. 64 bytes in one go.
    uvec4 p0 = payload.data[offset_in_b128 + 0];
    uvec4 p1 = payload.data[offset_in_b128 + 1];
    uvec4 p2 = payload.data[offset_in_b128 + 2];
    uvec4 p3 = payload.data[offset_in_b128 + 3];

    uint indices = 0;

    indices |= bitfieldExtract(p0.x, lane_index, 1) << 0u;
    indices |= bitfieldExtract(p0.y, lane_index, 1) << 1u;
    indices |= bitfieldExtract(p0.z, lane_index, 1) << 2u;
    indices |= bitfieldExtract(p0.w, lane_index, 1) << 3u;

    indices |= bitfieldExtract(p1.x, lane_index, 1) << 8u;
    indices |= bitfieldExtract(p1.y, lane_index, 1) << 9u;
    indices |= bitfieldExtract(p1.z, lane_index, 1) << 10u;
    indices |= bitfieldExtract(p1.w, lane_index, 1) << 11u;

    indices |= bitfieldExtract(p2.x, lane_index, 1) << 16u;
    indices |= bitfieldExtract(p2.y, lane_index, 1) << 17u;
    indices |= bitfieldExtract(p2.z, lane_index, 1) << 18u;
    indices |= bitfieldExtract(p2.w, lane_index, 1) << 19u;

    indices |= bitfieldExtract(p3.x, lane_index, 1) << 4u;
    indices |= bitfieldExtract(p3.y, lane_index, 1) << 12u;
    indices |= bitfieldExtract(p3.z, lane_index, 1) << 20u;

    return indices;
}

On Deck, this ends up looking like

s_buffer_load_dwordx4 x 4
v_bfe_u32 x 15
v_lshl_or_b32 x 15

Thinking about ALU and loads in terms of scalar and vectors can greatly help AMD performance when done right, so this approach felt natural.

For variable bit rates, I’d have code like:

if (bits & 4) { unroll_4_bits_bit_plane(); }
if (bits & 2) { unroll_2_bits_bit_plane(); }
if (bits & 1) { unroll_1_bit_bit_plane(); }

However, I abandoned this idea, since while favoring SMEM so heavily, the VALU with all the bitfield ops wasn’t exactly amazing for perf. I’m still just clocking out one bit per operation here. AMD performance was quite alright compared to what I ended up with in the end, but NVIDIA performance was abysmal, so I went back to the drawing board, and ended up with the absolute simplest solution that would work.

Tightly packed bits

This idea is to just literally pack bits together, clearly a revolutionary idea that noone has ever done before. A VMEM load or two per thread, then some shifts should be all that is needed to move the components into place.

E.g. for index buffers:

uvec3 meshlet_decode_index_buffer(uint stream_index,
   uint chunk_index,
   int lane_index)
{
  uint offset_in_words = 
    meshlet_streams.data[stream_index].offset_in_words;
  return meshlet_decode3(offset_in_words, lane_index, 5);
}

For the actual decode I figured it would be pretty fast if all the shifts could be done in 64-bit. At least AMD has native instructions for that.

uvec3 meshlet_decode3(uint offset_in_words,
   uint index,
   uint bit_count)
{
    const uint num_components = 3;
    uint start_bit = index * bit_count * num_components;
    uint start_word = offset_in_words + start_bit / 32u;
    start_bit &= 31u;
    uint word0 = payload.data[start_word];
    uint word1 = payload.data[start_word + 1u];
    uvec3 v;

    uint64_t word = packUint2x32(uvec2(word0, word1));
    v.x = uint(word >> start_bit);
    start_bit += bit_count;
    v.y = uint(word >> start_bit);
    start_bit += bit_count;
    v.z = uint(word >> start_bit);
    return bitfieldExtract(v, 0, int(bit_count));
}

There is one detail here. For 13, 14 and 15 bit components with uvec3 decode, more than two u32 words may be needed, so in this case, encoder must choose 16 bit. (16-bit works due to alignment.) This only comes up in position encode, and encoder can easily just ensure 12 bit deltas is enough to encode, quantizing a bit more as necessary.

Mapping to MDI

Every 256-wide meshlet can turn into an indexed draw call with VK_INDEX_TYPE_UINT8_EXT, which is nice for saving VRAM. The “task shader” becomes a compute shader that dumps out a big multi-draw indirect buffer. The DrawIndex builtin in Vulkan ends up replacing WorkGroupID in mesh shader for pulling in per-meshlet data.

Performance sanity check

Before going further with mesh shading fun, it’s important to validate performance. I needed at least a ballpark idea of how many primitives could be pumped through the GPU with a good old vkCmdDrawIndexed and the MDI method where one draw call is one meshlet. This was then to be compared against a straight forward mesh shader.

Zeux’s Niagara renderer helpfully has a simple OBJ for us to play with.

When exported to the new meshlet format it looks like:

[INFO]: Stream 0: 54332 bytes. (Index) 15 bits / primitive
[INFO]: Stream 1: 75060 bytes. (Position) ~25 bits / pos
[INFO]: Stream 2: 70668 bytes. (Normal/Tangent) ~23.8 bits / N + T + sign
[INFO]: Total encoded vertices: 23738 // Vertex duplication :(
[INFO]: Average radius 0.037 (908 bounds) // 32-wide meshlet
[INFO]: Average cutoff 0.253 (908 bounds)
[INFO]: Average radius 0.114 (114 bounds) // 256-wide meshlet
[INFO]: Average cutoff 0.697 (114 bounds)
// Backface cone culling isn't amazing for larger meshlets.
[INFO]: Exported meshlet:
[INFO]: 908 meshlets
[INFO]: 200060 payload bytes
[INFO]: 86832 total indices
[INFO]: 14856 total attributes
[INFO]: 703872 uncompressed bytes

One annoying thing about meshlets is attribute duplication when one vertex is reused across meshlets, and using tiny 32-wide meshlets makes this way worse. Add padding on top for encode and the compression ratio isn’t that amazing anymore. The primitive to vertex ratio is ~1.95 here which is really solid, but turning things into meshlets tends to converge to ~1.0.

I tried different sublet sizes, but NVIDIA performance collapsed when I didn’t use 32-wide sublets, and going to 64 primitive / 32 vertex only marginally helped P/V ratios. AMD runtime performance did not like that in my testing (~30% throughput loss), so 32/32 it is!

After writing this section, AMD released a blog post suggesting that the 2N/N structure is actually good, but I couldn’t replicate that in my testing at least and I don’t have the energy anymore to rewrite everything (again) to test that.

Test scene

The classic “instance the same mesh a million times” strategy. This was tested on RTX 3070 (AMD numbers to follow, there are way more permutations to test there …). The mesh is instanced in a 13x13x13 grid. Here we’re throwing 63.59 million triangles at the GPU in one go.

Spam vkCmdDrawIndexed with no culling

5.5 ms

layout(location = 0) in vec3 POS;
layout(location = 1) in mediump vec3 NORMAL;
layout(location = 2) in mediump vec4 TANGENT;
layout(location = 3) in vec2 UV;

layout(location = 0) out mediump vec3 vNormal;
layout(location = 1) out mediump vec4 vTangent;
layout(location = 2) out vec2 vUV;

// The most basic vertex shader.
void main()
{
  vec3 world_pos = (M * vec4(POS, 1.0)).xyz;
  vNormal = mat3(M) * NORMAL;
  vTangent = vec4(mat3(M) * TANGENT.xyz, TANGENT.w);
  vUV = UV;
  gl_Position = VP * vec4(world_pos, 1.0);
}

With per-object frustum culling

This is the most basic thing to do, so for reference.

4.3 ms

One massive MDI

Here we’re just doing basic frustum culling of meshlets as well as back-face cone culling and emitting one draw call per meshlet that passes test.

3.9 ms

Significantly more geometry is rejected now due to back-face cull and tighter frustum cull, but performance isn’t that much better. Once we start considering occlusion culling, this should turn into a major win over normal draw calls. In this path, we have a bit more indirection in the vertex shader, so that probably accounts for some loss as well.

void main()
{
    // Need to index now, but shouldn't be a problem on desktop hardware.
    mat4 M = transforms.data[draw_info.data[gl_DrawIDARB].node_offset];

    vec3 world_pos = (M * vec4(POS, 1.0)).xyz;
    vNormal = mat3(M) * NORMAL;
    vTangent = vec4(mat3(M) * TANGENT.xyz, TANGENT.w);
    vUV = UV;

    // Need to pass down extra data to sample materials, etc.
    // Fragment shader cannot read gl_DrawIDARB.
    vMaterialID = draw_info.data[gl_DrawIDARB].material_index;

    gl_Position = VP * vec4(world_pos, 1.0);
}

Meshlet – Encoded payload

Here, the meshlet will read directly from the encoded payload, and decode inline in the shader. No per-primitive culling is performed.

4.1 ms

Meshlet – Decoded payload

4.0 ms

We’re at the point where we are bound on fixed function throughput. Encoded and Decoded paths are basically both hitting the limit of how much data we can pump to the rasterizer.

Per-primitive culling

To actually make good use of mesh shading, we need to consider per-primitive culling. For this section, I’ll be assuming a subgroup size of 32, and a meshlet size of 32. There are other code paths for larger workgroups, which require some use of groupshared memory, but that’s not very exciting for this discussion.

The gist of this idea was implemented in https://gpuopen.com/geometryfx/. Various AMD drivers adopted the idea as well to perform magic driver culling, but the code here isn’t based on any other code in particular.

Doing back-face culling correctly

This is tricky, but we only need to be conservative, not exact. We can only reject when we know for sure the primitive is not visible.

Perspective divide and clip codes

The first step is to do W divide per vertex and study how that vertex clips against the X, Y, and W planes. We don’t really care about Z. Near-plane clip is covered by negative W tests, and far plane should be covered by simple frustum test, assuming we have a far plane at all.

vec2 c = clip_pos.xy / clip_pos.w;

uint clip_code = clip_pos.w <= 0.0 ? CLIP_CODE_NEGATIVE_W : 0;
if (any(greaterThan(abs(c), vec2(4.0))))
    clip_code |= CLIP_CODE_INACCURATE;
if (c.x <= -1.0)
    clip_code |= CLIP_CODE_NEGATIVE_X;
if (c.y <= -1.0)
    clip_code |= CLIP_CODE_NEGATIVE_Y;
if (c.x >= 1.0)
    clip_code |= CLIP_CODE_POSITIVE_X;
if (c.y >= 1.0)
    clip_code |= CLIP_CODE_POSITIVE_Y;

vec2 window = roundEven(c * viewport.zw + viewport.xy);

There are things to unpack here. The INACCURATE clip code is used to denote a problem where we might start to run into accuracy issues when converting to fixed point, or GPUs might start doing clipping due to guard band exhaustion. I picked the value arbitrarily.

The window coordinate is then computed by simulating the fixed point window coordinate snapping done by real GPUs. Any GPU supporting DirectX will have a very precise way of doing this, so this should be okay in practice. Vulkan also exposes the number of sub-pixel bits in the viewport transform. On all GPUs I know of, this is 8. DirectX mandates exactly 8.

vec4 viewport =
    float(1 << 8 /* shader assumes 8 */) *
        vec4(cmd->get_viewport().x +
               0.5f * cmd->get_viewport().width - 0.5f,
             cmd->get_viewport().y +
               0.5f * cmd->get_viewport().height - 0.5f,
             0.5f * cmd->get_viewport().width,
             0.5f * cmd->get_viewport().height) -
             vec4(1.0f, 1.0f, 0.0f, 0.0f);

This particular way of doing it comes into play later when discussing micro-poly rejection. One thing to note here is that Vulkan clip-to-window coordinate transform does not flip Y-sign. D3D does however, so beware.

Shuffle clip codes and window coordinates

void meshlet_emit_primitive(uvec3 prim, vec4 clip_pos, vec4 viewport)
{
  // ...
  vec2 window = roundEven(c * viewport.zw + viewport.xy);

  // vertex ID maps to gl_SubgroupInvocationID
  // Fall back to groupshared as necessary
  vec2 window_a = subgroupShuffle(window, prim.x);
  vec2 window_b = subgroupShuffle(window, prim.y);
  vec2 window_c = subgroupShuffle(window, prim.z);
  uint code_a = subgroupShuffle(clip_code, prim.x);
  uint code_b = subgroupShuffle(clip_code, prim.y);
  uint code_c = subgroupShuffle(clip_code, prim.z);
}

Early reject or accept

Based on clip codes we can immediately accept or reject primitives.

uint or_code = code_a | code_b | code_c;
uint and_code = code_a & code_b & code_c;
bool culled_planes = (and_code & CLIP_CODE_PLANES) != 0;
bool is_active_prim = false;

if (!culled_planes)
{
    is_active_prim =
        (or_code & (CLIP_CODE_INACCURATE |
                    CLIP_CODE_NEGATIVE_W)) != 0;

    if (!is_active_prim)
        is_active_prim = cull_triangle(window_a,
                                       window_b,
                                       window_c);
}
  • If all three vertices are outside one of the clip planes, reject immediately
  • If any vertex is considered inaccurate, accept immediately
  • If one or two of the vertices have negative W, we have clipping. Our math won’t work, so accept immediately. (If all three vertices have negative W, the first test rejects).
  • Perform actual back-face cull.

Actual back-face cull

bool cull_triangle(vec2 a, vec2 b, vec2 c)
{
  precise vec2 ab = b - a;
  precise vec2 ac = c - a;

  // This is 100% accurate as long as the primitive
  // is no larger than ~4k subpixels, i.e. 16x16 pixels.
  // Normally, we'd be able to do GEQ test, but GE test is conservative,
  // even with FP error in play.

  // Depending on your engine and API conventions, swap these two.
  precise float pos_area = ab.y * ac.x;
  precise float neg_area = ab.x * ac.y;

  // If the pos value is (-2^24, +2^24),
  // the FP math is exact,
  // if not, we have to be conservative.
  // Less-than check is there to ensure that 1.0 delta
  // in neg_area *will* resolve to a different value.
  bool active_primitive;
  if (abs(pos_area) < 16777216.0)
    active_primitive = pos_area > neg_area;
  else
    active_primitive = pos_area >= neg_area;

  return active_primitive;
}

To compute winding, we need a 2D cross product. While noodling with this code, I noticed that we can still do it in FP32 instead of full 64-bit integer math. We’re working with integer-rounded values here, so based on the magnitudes involved we can pick the exact GEQ test. If we risk FP rounding error, we can use GE test. If the results don’t test equal, we know for sure area must be negative, otherwise, it’s possible it could have been positive, but the intermediate values rounded to same value in the end.

3.3 ms

Culling primitives helped as expected. Less pressure on the fixed function units.

Micro-poly rejection

Given how pathologically geometry dense this scene is, we expect that most primitives never trigger the rasterizer at all.

If we can prove that the bounding box of the primitive lands between two pixel grids, we can reject it since it will never have coverage.

if (active_primitive)
{
    // Micropoly test.
    const int SUBPIXEL_BITS = 8;
    vec2 lo = floor(ldexp(min(min(a, b), c), ivec2(-SUBPIXEL_BITS)));
    vec2 hi = floor(ldexp(max(max(a, b), c), ivec2(-SUBPIXEL_BITS)));
    active_primitive = all(notEqual(lo, hi));
}

There is a lot to unpack in this code. If we re-examine the viewport transform:

vec4 viewport = float(1 << 8 /* shader assumes 8 */) *
  vec4(cmd->get_viewport().x +
        0.5f * cmd->get_viewport().width - 0.5f,
      cmd->get_viewport().y +
        0.5f * cmd->get_viewport().height - 0.5f,
      0.5f * cmd->get_viewport().width,
      0.5f * cmd->get_viewport().height) -
      vec4(1.0f, 1.0f, 0.0f, 0.0f);

First, we need to shift by 0.5 pixels. The rasterization test happens at the center of a pixel, and it’s more convenient to sample at integer points. Then, due to top-left rasterization rules on all desktop GPUs (a DirectX requirement), we shift the result by one sub-pixel. This ensures that should a primitive have a bounding box of [1.0, 2.0], we will consider it for rasterization, but [1.0 + 1.0 / 256.0, 2.0] will not. Top-left rules are not technically guaranteed in Vulkan however (it just has to have some rule), so if you’re paranoid, increase the upper bound by one sub-pixel.

1.9 ms

Now we’re only submitting 1.2 M primitives to the rasterizer, which is pretty cool, given that we started with 31 M potential primitives. Of course, this is a contrived example with ridiculous micro-poly issues.

We’re actually at the point here where reporting the invocation stats (one atomic per workgroup) becomes a performance problem, so turning that off:

1.65 ms

With inline decoding there’s some extra overhead, but we’re still well ahead:

2.5 ms

Build active vertex / primitive masks

This is quite straight forward. Once we have the counts, SetMeshOutputCounts is called and we can compute the packed output indices with a mask and popcount.

uint vert_mask = 0u;
if (is_active_prim)
    vert_mask = (1u << prim.x) | (1u << prim.y) | (1u << prim.z);

uvec4 prim_ballot = subgroupBallot(is_active_prim);

shared_active_prim_offset = subgroupBallotExclusiveBitCount(prim_ballot);
shared_active_vert_mask = subgroupOr(vert_mask);

shared_active_prim_count_total = subgroupBallotBitCount(prim_ballot);
shared_active_vert_count_total = bitCount(shared_active_vert_mask);

Special magic NVIDIA optimization

Can we improve things from here? On NVIDIA, yes. NVIDIA seems to under-dimension the shader export buffers in their hardware compared to peak triangle throughput, and their developer documentation on the topic suggests:

  • Replace attributes with barycentrics and allowing the Pixel Shader to fetch and interpolate the attributes

Using VK_KHR_fragment_shader_barycentrics we can write code like:

// Mesh output
layout(location = 0) flat out uint vVertexID[];
layout(location = 1) perprimitiveEXT out uint vTransformIndex[];

// Fragment
layout(location = 0) pervertexEXT in uint vVertexID[];
layout(location = 1) perprimitiveEXT flat in uint vTransformIndex;

// Fetch vertex IDs
uint va = vVertexID[0];
uint vb = vVertexID[1];
uint vc = vVertexID[2];

// Load attributes from memory directly
uint na = attr.data[va].n;
uint nb = attr.data[vb].n;
uint nc = attr.data[vc].n;

// Interpolate by hand
mediump vec3 normal = gl_BaryCoordEXT.x * decode_rgb10a2(na) +
    gl_BaryCoordEXT.y * decode_rgb10a2(nb) +
    gl_BaryCoordEXT.z * decode_rgb10a2(nc);

// Have to transform normals and tangents as necessary.
// Need to pass down some way to load transforms.
normal = mat3(transforms.data[vTransformIndex]) * normal;
normal = normalize(normal);

1.0 ms

Quite the dramatic gain! Nsight Graphics suggests we’re finally SM bound at this point (> 80% utilization), where we used to be ISBE bound (primitive / varying allocation). An alternative that I assume would work just as well is to pass down a primitive ID to a G-buffer similar to Nanite.

There are a lot of caveats with this approach however, and I don’t think I will pursue it:

  • Moves a ton of extra work to fragment stage
    • I’m not aiming for Nanite-style micro-poly hell here, so doing work per-vertex seems better than per-fragment
    • This result isn’t representative of a real scene where fragment shader load would be far more significant
  • Incompatible with encoded meshlet scheme
    • It is possible to decode individual values, but it sure is a lot of dependent memory loads to extract a single value
  • Very awkward to write shader code like this at scale
    • Probably need some kind of meta compiler that can generate code, but that’s a rabbit hole I’m not going down
    • Need fallbacks, barycentrics is a very modern feature
  • Makes skinning even more annoying
    • Loading multiple matrices with fully dynamic index in fragment shader does not scream performance, then combine that with having to compute motion vectors on top …
  • Only seems to help throughput on NVIDIA
  • We’re already way ahead of MDI anyway

Either way, this result was useful to observe.

AMD

Steam Deck

Before running the numbers, we have to consider that the RADV driver already does some mesh shader optimizations for us automatically. The NGG geometry pipeline automatically converts vertex shading workloads into pseudo-meshlets, and RADV also does primitive culling in the driver-generated shader.

To get the raw baseline, we’ll first consider the tests without that path, so we can see how well RADV’s own culling is doing. The legacy vertex path is completely gone on RDNA3 as far as I know, so these tests have to be done on RDNA2.

No culling, plain vkCmdDrawIndexed, RADV_DEBUG=nongg

Even locked to 1600 MHz (peak), GPU is still just consuming 5.5 W. We’re 100% bound on fixed function logic here, the shader cores are sleeping.

44.3 ms

Basic frustum culling

As expected, performance scales as we cull. Still 5.5 W. 27.9 ms

NGG path, no primitive culling, RADV_DEBUG=nonggc

Not too much changed in performance here. We’re still bound on the same fixed function units pumping invisible primitives through. 28.4 ms

Actual RADV path

When we don’t cripple RADV, we get a lot of benefit from driver culling. GPU hits 12.1 W now. 9.6 ms

MDI

Slight win. 8.9 ms

Forcing Wave32 in mesh shaders

Using Vulkan 1.3’s subgroup size control feature, we can force RDNA2 to execute in Wave32 mode. This requires support in

 VkShaderStageFlags requiredSubgroupSizeStages;

The Deck drivers and upstream Mesa ship support for requiredSize task/mesh shaders now which is very handy. AMD’s Windows drivers or AMDVLK/amdgpu-pro do not, however 🙁 It’s possible Wave32 isn’t the best idea for AMD mesh shaders in the first place, it’s just that the format favors Wave32, so I enable it if I can.

Testing various parameters

While NVIDIA really likes 32/32 (anything else I tried fell off the perf cliff), AMD should in theory favor larger workgroups. However, it’s not that easy in practice, as I found.

Decoded meshlet – Wave32 – N/N prim/vert

  • 32/32: 9.3 ms
  • 64/64: 10.5 ms
  • 128/128: 11.2 ms
  • 256/256: 12.8 ms

These results are … surprising.

Encoded meshlet – Wave32 N/N prim/vert

  • 32/32: 10.7 ms
  • 64/64: 11.8 ms
  • 128/128: 12.7 ms
  • 256/256: 14.7 ms

Apparently Deck (or RDNA2 in general) likes small meshlets?

Wave64?

No meaningful difference in performance on Deck.

VertexID passthrough?

No meaningful difference either. This is a very NVIDIA-centric optimization I think.

A note on LocalInvocation output

In Vulkan, there are some properties that AMD sets for mesh shaders.

VkBool32 prefersLocalInvocationVertexOutput;
VkBool32 prefersLocalInvocationPrimitiveOutput;

This means that we should write outputs using LocalInvocationIndex, which corresponds to how RDNA hardware works. Each thread can export one primitive and one vertex and the thread index corresponds to primitive index / vertex index. Due to culling and compaction, we will have to roundtrip through groupshared memory somehow to satisfy this.

For the encoded representation, I found that it’s actually faster to ignore this suggestion, but for the decoded representation, we can just send the vertex IDs through groupshared, and do split vertex / attribute shading. E.g.:

if (meshlet_lane_has_active_vert())
{
    uint out_vert_index = meshlet_compacted_vertex_output();
    uint vert_id = meshlet.vertex_offset + linear_index;

    shared_clip_pos[out_vert_index] = clip_pos;
    shared_attr_index[out_vert_index] = vert_id;
}

barrier();

if (gl_LocalInvocationIndex < shared_active_vert_count_total)
{
    TexturedAttr a =
      attr.data[shared_attr_index[gl_LocalInvocationIndex]];
    mediump vec3 n = unpack_bgr10a2(a.n).xyz;
    mediump vec4 t = unpack_bgr10a2(a.t);
    gl_MeshVerticesEXT[gl_LocalInvocationIndex].gl_Position =
      shared_clip_pos[gl_LocalInvocationIndex];
    vUV[gl_LocalInvocationIndex] = a.uv;
    vNormal[gl_LocalInvocationIndex] = mat3(M) * n;
    vTangent[gl_LocalInvocationIndex] = vec4(mat3(M) * t.xyz, t.w);
}

Only computing visible attributes is a very common optimization in GPUs in general and RADV’s NGG implementation does it roughly like this.

Either way, we’re not actually beating the driver-based meshlet culling on Deck. It’s more or less already doing this work for us. Given how close the results are, it’s possible we’re still bound on something that’s not raw compute. On the positive side, the cost of using encoded representation is very small here, and saving RAM for meshes is always nice.

Already, the permutation hell is starting to become a problem. It’s getting quite obvious why mesh shaders haven’t taken off yet 🙂

RX 7600 numbers

Data dump section incoming …

NGG culling seems obsolete now?

By default RADV disables NGG culling on RDNA3, because apparently it has a much stronger fixed function culling in hardware now. I tried forcing it on with RADV_DEBUG=nggc, but found no uplift in performance for normal vertex shaders. Curious. Here’s with no culling, where the shader is completely export bound.

But, force NGG on, and it still doesn’t help much. Culling path takes as much time as the other, the instruction latencies are just spread around more.

RADV

  • vkCmdDrawIndexed, no frustum culling: 5.9 ms
  • With frustum cull: 3.7 ms
  • MDI: 5.0 ms
Wave32 – Meshlet
  • Encoded – 32/32: 3.3 ms
  • Encoded – 64/64 : 2.5 ms
  • Encoded – 128/128: 2.7 ms
  • Encoded – 256/256: 2.9 ms
  • Decoded – 32/32: 3.3 ms
  • Decoded – 64/64: 2.4 ms
  • Decoded – 128/128: 2.6 ms
  • Decoded – 256/256: 2.7 ms
Wave64 – Meshlet
  • Encoded – 64/64: 2.4 ms
  • Encoded – 128/128: 2.6 ms
  • Encoded – 256/256: 2.7 ms
  • Decoded – 64/64: 2.2 ms
  • Decoded – 128/128: 2.5 ms
  • Decoded – 256/256: 2.7 ms

Wave64 mode is doing quite well here. From what I understand, RADV hasn’t fully taken advantage of the dual-issue instructions in RDNA3 specifically yet, which is important for Wave32 performance, so that might be a plausible explanation.

There was also no meaningful difference in doing VertexID passthrough.

It’s not exactly easy to deduce anything meaningful out of these numbers, other than 32/32 being bad on RDNA3, while good on RDNA2 (Deck)?

AMD doesn’t seem to like the smaller 256 primitive draws on the larger desktop GPUs. I tried 512 and 1024 as a quick test and that improved throughput considerably, still, with finer grained culling in place, it should be a significant win.

amdgpu-pro / proprietary (Linux)

Since we cannot request specific subgroup size, the driver is free to pick Wave32 or Wave64 as it pleases, so I cannot test the difference. It won’t hit the subgroup optimized paths however.

  • vkCmdDrawIndexed, no culling : 6.2 ms
  • With frustum cull: 4.0 ms
  • MDI: 5.3 ms
  • Meshlet – Encoded – 32/32: 2.5 ms
  • Meshlet – Encoded – 64/64 : 2.6 ms
  • Meshlet – Encoded – 128/128: 2.7 ms
  • Meshlet – Encoded – 256/256: 2.6 ms
  • Meshlet – Decoded – 32/32: 2.1 ms
  • Meshlet – Decoded – 64/64: 2.1 ms
  • Meshlet – Decoded – 128/128: 2.1 ms
  • Meshlet – Decoded – 256/256: 2.1 ms

I also did some quick spot checks on AMDVLK, and the numbers are very similar.

The proprietary driver is doing quite well here in mesh shaders. On desktop, we can get significant wins on both RADV and proprietary with mesh shaders, which is nice to see.

It seems like the AMD Windows driver skipped NGG culling on RDNA3 as well. Performance is basically the same.

Task shader woes

The job of task shaders is to generate mesh shader work on the fly. In principle this is nicer than indirect rendering with mesh shaders for two reasons:

  • No need to allocate temporary memory to hold for indirect draw
  • No need to add extra compute passes with barriers

However, it turns out that this shader stage is even more vendor specific when it comes to tuning for performance. So far, no game I know of has actually shipped with task shaders (or the D3D12 equivalent amplification shader), and I think I now understand why.

The basic task unit I settled on was:

struct TaskInfo
{
    uint32_t aabb_instance;  // AABB, for top-level culling
    uint32_t node_instance;  // Affine transform
    uint32_t material_index; // To be eventually forwarded to fragment
    uint32_t mesh_index_count;
    // Encodes count [1, 32] in lower bits.
    // Mesh index is aligned to 32.
    uint32_t occluder_state_offset;
    // For two-phase occlusion state (for later)
};

An array of these is prepared on CPU. Each scene entity translates to one or more TaskInfos. Those are batched up into one big buffer, and off we go.

The logical task shader for me was to have N = 32 threads which tests AABB of N tasks in parallel. For the tasks that pass the test, test 32 meshlets in parallel. This makes it so the task workgroup can emit up to 1024 meshlets.

When I tried this on NVIDIA however …

18.8 ms

10x slowdown … The NVIDIA docs do mention that large outputs are bad, but I didn’t expect it to be this bad:

Avoid large outputs from the amplification shader, as this can incur a significant performance penalty. Generally, we encourage a flexible implementation that allows for fine-tuning. With that in mind, there are a number of generic factors that impact performance:

  • Size of the payloads. The AS payload should preferably stay below 108 bytes, but if that is not possible, then keep it at least under 236 bytes.

If we remove all support for hierarchical culling, the task shader runs alright again. 1 thread emits 0 or 1 meshlet. However, this means a lot of threads dedicated to culling, but it’s similar in performance to plain indirect mesh shading.

AMD however, is a completely different story. Task shaders are implemented by essentially emitting a bunch of tiny indirect mesh shader dispatches anyway, so the usefulness of task shaders on AMD is questionable from a performance point of view. While writing this blog, AMD released a new blog on the topic, how convenient!

When I tried NV-style task shader on AMD, performance suffered quite a lot.

However, the only thing that gets us to max perf on both AMD and NV is to forget about task shaders and go with vkCmdDrawMeshTasksIndirectCountEXT instead. While the optimal task shader path for each vendor gets close to indirect mesh shading, having a universal fast path is good for my sanity. The task shader loss was about 10% for me even in ideal situations on both vendors, which isn’t great. As rejection ratios increase, this loss grows even more. This kind of occupancy looks way better 🙂

The reason for using multi-indirect-count is to deal with the limitation that we can only submit about 64k workgroups in any dimension, similar to compute. This makes 1D atomic increments awkward, since we’ll easily blow past the 64k limit. One alternative is to take another tiny compute pass that prepares a multi-indirect draw, but that’s not really needed. Compute shader code like this works too:

// global_offset = atomicAdd() in thread 0

if (gl_LocalInvocationIndex == 0 && draw_count != 0)
{
  uint max_global_offset = global_offset + draw_count - 1;
  // Meshlet style.
  // Only guaranteed to get 0xffff meshlets,
  // so use 32k as cutoff for easy math.
  // Allocate the 2D draws in-place, avoiding an extra barrier.
  uint multi_draw_index = max_global_offset / 0x8000u;
  uint local_draw_index = max_global_offset & 0x7fffu;
  const int INC_OFFSET = NUM_CHUNK_WORKGROUPS == 1 ? 0 : 1;
  atomicMax(output_draws.count[1], multi_draw_index + 1);
  atomicMax(output_draws.count[
    2 + 3 * multi_draw_index + INC_OFFSET],
    local_draw_index + 1);

  if (local_draw_index <= draw_count)
  {
    // This is the thread that takes us over the threshold.
    output_draws.count[
      2 + 3 * multi_draw_index + 1 - INC_OFFSET] =
      NUM_CHUNK_WORKGROUPS;
    output_draws.count[2 + 3 * multi_draw_index + 2] = 1;
  }

  // Wrapped around, make sure last bucket sees 32k meshlets.
  if (multi_draw_index != 0 && local_draw_index < draw_count)
  {
    atomicMax(output_draws.count[
      2 + 3 * (multi_draw_index - 1) +
      INC_OFFSET], 0x8000u);
  }
}

This prepares a bunch of (8, 32k, 1) dispatches that are processed in one go. No chance to observe a bunch of dead dispatches back-to-back like task shaders can cause. In the mesh shader, we can use DrawIndex to offset the WorkGroupID by the appropriate amount (yay, Vulkan). A dispatchX count of 8 is to shade the full 256-wide meshlet through 8x 32-wide workgroups. As the workgroup size increases to handle more sublets per group, dispatchX count decreases similarly.

Occlusion culling

To complete the meshlet renderer, we need to consider occlusion culling. The go-to technique for this these days is two-phase occlusion culling with HiZ depth buffer. Some references:

Basic gist is to keep track of which meshlets are considered visible. This requires persistent storage of 1 bit per unit of visibility. Each pass in the renderer needs to keep track of its own bit-array. E.g. shadow passes have different visibility compared to main scene rendering.

For Granite, I went with an approach where 1 TaskInfo points to one uint32_t bitmask. Each of the 32 meshlets within the TaskInfo gets 1 bit. This makes the hierarchical culling nice too, since we can just test for visibility != 0 on the entire word. Nifty!

First phase

Here we render all objects which were considered visible last frame. It’s extremely likely that whatever was visible last frame is visible this frame, unless there was a full camera cut or similar. It’s important that we’re actually rendering to the framebuffer now. In theory, we’d be done rendering now if there were no changes to camera or objects in the scene.

HiZ pass

Based on the objects we drew in phase 1, build a HiZ depth map. This topic is actually kinda tricky. Building the mip-chain in one pass is great for performance, but causes some problems. With NPOT textures and single pass, there is no obvious way to create a functional HiZ, and the go-to shader for this, FidelityFX SPD, doesn’t support that use case.

The problem is that the size of mip-maps round down, so if we have a 7×7 texture, LOD 1 is 3×3 and LOD 2 is 1×1. In LOD2, we will be able to query a 4×4 depth region, but the edge pixels are forgotten.

The “obvious” workaround is to pad the texture to POT, but that is a horrible waste of VRAM. The solution I went with instead was to fold in the neighbors as the mips are reduced. This makes it so that the edge pixels in each LOD also remembers depth information for pixels which were truncated away due to NPOT rounding.

I rolled a custom HiZ shader similar to SPD with some extra subgroup shenanigans because why not (SubgroupShuffleXor with 4 and 8).

Second phase

In this pass we submit for rendering any object which became visible this frame, i.e. the visibility bit was not set, but it passed occlusion test now. Again, if camera did not change, and objects did not move, then nothing should be rendered here.

However, we still have to test every object, in order to update the visibility buffer for next frame. We don’t want visibility to remain sticky, unless we have dedicated proxy geometry to serve as occluders (might still be a thing if game needs to handle camera cuts without large jumps in rendering time).

In this pass we can cull meshlet bounds against the HiZ.

Because I cannot be arsed to make a fancy SVG for this, the math to compute a tight AABB bound for a sphere is straight forward once the geometry is understood.

The gist is to figure out the angle, then rotate the (X, W) vector with positive and negative angles. X / W becomes the projected lower or upper bound. Y bounds are computed separately.

vec2 project_sphere_flat(float view_xy, float view_z, float radius)
{
    float len = length(vec2(view_xy, view_z));
    float sin_xy = radius / len;

    float cos_xy = sqrt(1.0 - sin_xy * sin_xy);
    vec2 rot_lo = mat2(cos_xy, sin_xy, -sin_xy, cos_xy) *
      vec2(view_xy, view_z);
    vec2 rot_hi = mat2(cos_xy, -sin_xy, +sin_xy, cos_xy) *
      vec2(view_xy, view_z);

    return vec2(rot_lo.x / rot_lo.y, rot_hi.x / rot_hi.y);
}

The math is done in view space where the sphere is still a sphere, which is then scaled to window coordinates afterwards. To make the math easier to work with, I use a modified view space in this code where +Y is down and +Z is in view direction.

bool hiz_cull(vec2 view_range_x, vec2 view_range_y, float closest_z)
// view_range_x: .x -> lower bound, .y -> upper bound
// view_range_y: same
// closest_z: linear depth. ViewZ - Radius for a sphere

First, convert to integer coordinates.

// Viewport scale first applies any projection scale in X/Y
// (without Y flip).
// The scale also does viewport size / 2 and then
// offsets into integer window coordinates.

vec2 range_x = view_range_x *
  frustum.viewport_scale_bias.x +
  frustum.viewport_scale_bias.z;
vec2 range_y = view_range_y *
  frustum.viewport_scale_bias.y +
  frustum.viewport_scale_bias.w;

ivec2 ix = ivec2(range_x);
ivec2 iy = ivec2(range_y);

ix.x = clamp(ix.x, 0, frustum.hiz_resolution.x - 1);
ix.y = clamp(ix.y, ix.x, frustum.hiz_resolution.x - 1);
iy.x = clamp(iy.x, 0, frustum.hiz_resolution.y - 1);
iy.y = clamp(iy.y, iy.x, frustum.hiz_resolution.y - 1);

Figure out a LOD where we only have to sample a 2×2 footprint. findMSB to the rescue.

int max_delta = max(ix.y - ix.x, iy.y - iy.x);
int lod = min(findMSB(max_delta - 1) + 1, frustum.hiz_max_lod);
ivec2 lod_max_coord = max(frustum.hiz_resolution >> lod, ivec2(1)) - 1;

// Clamp to size of the actual LOD.
ix = min(ix >> lod, lod_max_coord.xx);
iy = min(iy >> lod, lod_max_coord.yy);

And finally, sample:

ivec2 hiz_coord = ivec2(ix.x, iy.x);

float d = texelFetch(uHiZDepth, hiz_coord, lod).x;
bool nx = ix.y != ix.x;
bool ny = iy.y != iy.x;

if (nx)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(1, 0)).x);

if (ny)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(0, 1)).x);

if (nx && ny)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(1, 1)).x);

return closest_z < d;

Trying to get up-close, it’s quite effective.

Without culling:

With two-phase:

As the culling becomes more extreme, GPU go brrrrr. Mostly just bound on HiZ pass and culling passes now which can probably be tuned a lot more.

Conclusion

I’ve spent way too much time on this now, and I just need to stop endlessly tuning various parameters. This is the true curse of mesh shaders, there’s always something to tweak. Given the performance I’m getting, I can call this a success, even if there might be some wins left on the table by tweaking some more. Now I just need to take a long break from mesh shaders before I actually rewrite the renderer to use this new code … And maybe one day I can even think about how to deal with LODs, then I would truly have Nanite at home!

The “compression” format ended up being something that can barely be called a compression format. To chase decode performance of tens of billions of primitives per second through, I suppose that’s just how it is.

My scuffed game streaming adventure – PyroFling

My side projects have a tendency to evolve from a tiny weekend experiment into something that ends up satisfying a very specific niche use case after multiple weekends of nerdsniping myself. This is one of those projects where I started experimenting with how to use external memory in Vulkan and file descriptor flinging on Linux, and it just … grew from there.

This is a wild braindump ride with some of the topics being:

  • Basic Unix IPC
    • Fling those file descriptors like a champ
  • Writing a Vulkan layer that captures a swapchain
    • Knowing how to write a layer is pretty useful for any hardcore Vulkan programmer
  • A deeper understanding of how Vulkan WSI can be implemented under the hood
    • Acquire elite, arcane knowledge
  • Techniques for audio/video sync with low latency
    • A million MPV flags will not save you
    • Bespoke hacks will, however
  • How to coax FFmpeg into encoding video with very low latency
    • All the AVOptions, oh my!
  • Using /dev/uinput to create virtual gamepads
    • Tie it all together

Making a custom WSI implementation

The first part of this project was to make my own custom WSI implementation and a “server” that could act as a compositor of some sorts. The main difference was that rather than putting the swapchain on screen – which is a rabbit hole I’m not getting myself into – I just wanted to dump the results to a video file. At the end of last year, I was fiddling around with Vulkan video + FFmpeg, and this was the perfect excuse to start considering encoding as well. It would be pretty neat to get a swapchain to stay in VRAM, be encoded directly in Vulkan video and then get back H.264/H.265/AV1 packets.

Rather than redirecting WSI to a different “surface” which can get very tricky, this approach is very simple. This is implemented in a Vulkan layer where we hook the swapchain.

The basic gist is that we copy any presented swapchain image to an image owned by a layer, which is then sent over to the “compositor” with external memory. Synchronization is all explicit using external semaphores, because of course!

The protocol needed for a Vulkan swapchain is pretty simple. In Linux, we can use a Unix domain socket with SOCK_SEQPACKET. This is kinda like a reliable datagram that can also send and receive file descriptors as side band information.

// name is a path, e.g. /tmp/pyrofling-server

int fd = socket(AF_UNIX, SOCK_SEQPACKET, 0);

struct stat s = {};
if (stat(name, &s) >= 0)
{
    if ((s.st_mode & S_IFMT) == S_IFSOCK)
    {
       fprintf(stderr, "Rebinding socket.\n");
       unlink(name);
    }
}

struct sockaddr_un addr_unix = {};
addr_unix.sun_family = AF_UNIX;
strncpy(addr_unix.sun_path, name, sizeof(addr_unix.sun_path) - 1);
bind(fd.get_native_handle(),
     reinterpret_cast<const sockaddr *>(&addr_unix),
     sizeof(addr_unix));

From here, clients can connect to the server using e.g. connect() and server can listen() / accept() clients, just like normal TCP. The main difference is that SEQPACKET is not stream based, so we can send individual messages instead, ala UDP, using sendmsg() rather than plain send().

msghdr msg = {};

// data
msg.msg_iov = iovs;
msg.msg_iovlen = iov_count;

if (fling_fds_count)
{
    // control payload
    msg.msg_control = cmsg_buf;
    msg.msg_controllen = CMSG_SPACE(sizeof(int) * fling_fds_count);

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    cmsg->cmsg_type = SCM_RIGHTS;
    cmsg->cmsg_level = SOL_SOCKET;
    cmsg->cmsg_len = CMSG_LEN(sizeof(int) * fling_fds_count);
    auto *fds = reinterpret_cast<int *>(CMSG_DATA(cmsg));
    for (size_t i = 0; i < fling_fds_count; i++)
       fds[i] = fling_fds[i].get_native_handle();
}

ssize_t ret = sendmsg(fd.get_native_handle(), &msg, MSG_NOSIGNAL);

On the receiving end:

msghdr msg = {};
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = cmsg_buf;
msg.msg_controllen = sizeof(cmsg_buf);

ssize_t ret = recvmsg(fd.get_native_handle(), &msg, 0);

and then we grab the FDs. These FDs are tied to the message, so we know if this is an image, a semaphore, etc.

// Capture any FDs we receive.
std::vector<FileHandle> received_fds;

for (auto *cmsg = CMSG_FIRSTHDR(&msg); cmsg;
     cmsg = CMSG_NXTHDR(&msg, cmsg))
{
    if (cmsg->cmsg_level == SOL_SOCKET &&
        cmsg->cmsg_type == SCM_RIGHTS &&
        cmsg->cmsg_len > CMSG_LEN(0))
    {
       size_t data_len = cmsg->cmsg_len - CMSG_LEN(0);
       size_t num_fds = data_len / sizeof(int);
       auto *fds = reinterpret_cast<const int *>(CMSG_DATA(cmsg));
       for (size_t i = 0; i < num_fds; i++)
          received_fds.emplace_back(fds[i]);
    }
}

The protocol from here is pretty simple. Most WSI implementations would be some kind of variant of this under the hood I think.

Physical device (client -> server)

To use external memory in Vulkan we must be sure that the devices are compatible. We can get compatibility information in VkPhysicalDeviceIDProperties.

struct Device
{
    uint8_t device_uuid[16];
    uint8_t driver_uuid[16];
    uint8_t luid[8];
    uint32_t luid_valid;
};

For OPAQUE_FD external types in Vulkan, these must match. There is no particular need to be fancy and use DRM modifiers here. Client sends this information over once. Each VkSurfaceKHR has one connection associated with it. In Vulkan, there can only be one active non-retired swapchain assigned to a surface, so this model works well.

Create a group of images (client -> server)

When using external memory in Vulkan, the creator and consumer of the external memory must agree on VkImageCreateInfo parameters, so we just fling that information over as-is. If this were a more normal WSI, like X or Wayland, this is where DRM modifiers becomes important, because the consumer is probably not Vulkan, but I only really care about OPAQUE_FD for my use case since I know the image is being consumed in Vulkan.

struct ImageGroup
{
    // Assumptions made:
    // Layers = 1
    // Type = 2D
    // Levels = 1
    uint32_t num_images;
    uint32_t width;
    uint32_t height;

    // VkSurfaceFormatKHR
    // Assume that server can deal with anything reasonable.
    // We don't have to actually flip on display.
    uint32_t vk_format;
    uint32_t vk_color_space; // sRGB or HDR10? :3

    uint32_t vk_image_usage;
    uint32_t vk_image_flags;
    uint32_t vk_external_memory_type; // OPAQUE or DRM modifier.
    uint32_t vk_num_view_formats;
    uint32_t vk_view_formats[15]; // If MUTABLE and vk_num_formats != 0.
    uint64_t drm_modifier; // Unused atm.
};

Along with this message, num_image FDs are expected. The server will then import the memory, create images and bind.

If the server’s Vulkan device differs from the client, we can round-trip through system memory with VK_EXT_external_host_memory. Two separate GPUs can import the same system memory. This is very useful to me since I have two GPUs installed and being able to render on one GPU and encode on another GPU is pretty nifty. Can also be very nice to let iGPU do hardware accelerated encode down the line.

Present (client -> server)

struct PresentImage
{
    // Serial from image group.
    uint64_t image_group_serial;

    // If period > 0, FIFO semantics.
    // If period == 0, MAILBOX semantics.
    uint16_t period;

    // Must be [0, VulkanImageGroup::num_images).
    uint16_t index;

    // OPAQUE or something special. Binary semaphores only.
    uint32_t vk_external_semaphore_type;

    // Represents the release barrier that client performs.
    uint32_t vk_old_layout;
    uint32_t vk_new_layout;

    // An ID which is passed back in FrameComplete.
    uint64_t id;
};

One binary semaphore is expected as FD here. Explicit sync, yay. I could of course have used timeline semaphores here, but I really didn’t need anything more fancy than binary semaphores and Vulkan WSI requires binary semaphores anyway. If I ever want to port this to Windows, I’ll run into the problem that AMD does not support external timeline OPAQUE_WIN32, so … there’s that 🙁

The client needs to perform an image barrier to VK_QUEUE_FAMILY_EXTERNAL. The server side completes the transition with an acquire barrier from EXTERNAL into whatever queue family it uses.

The present ID is used later so we can implement KHR_present_wait properly.

Acquire (server -> client)

struct AcquireImage
{
    // Serial from image group.
    uint64_t image_group_serial;

    // Must be [0, VulkanImageGroup::num_images).
    uint32_t index;

    // OPAQUE or something special. Binary semaphores only.
    // If type is 0, it is an eventfd handle on host timeline.
    uint32_t vk_external_semaphore_type;
};

Acquire is async as intended. Typically, the server side does RGB -> YUV conversion and once that “blit” is done, we can release the image to client as long as there are new pending presents that are done.

Fortunately, we don’t have to hook vkAcquireNextImageKHR in this implementation since we’re still rendering to the display as normal. In QueuePresentKHR, we’ll do:

  • Wait for QueuePresentKHR semaphores
  • Acquire image from server (in the common case, this never blocks)
  • Queue wait for our acquire semaphore
  • Copy WSI image to internal image (transition image layouts as necessary)
  • Resignal QueuePresentKHR semaphores + signal external OPAQUE_FD semaphore
  • Send present message to server
  • Call QueuePresentKHR as normal

However, if we were redirecting the WSI completely, implementing the semaphore and fence parameters in vkAcquireNextImageKHR is actually quite awkward since there is no host vkSignalSemaphore and vkSignalFence in Vulkan sadly. Some bonus tips how to do it properly for the curious:

acquire has temporary import semantics

The semaphore you give to vkAcquireNextImageKHR isn’t really signaled as you’d expect, rather, it has temporary import semantics with a magic payload, i.e. the semaphore is replaced with a temporary payload of unknown type. When you subsequently wait on that semaphore, the temporary payload is consumed and freed and the semaphore is reverted to its original state. This is very useful, since we should implement AcquireNextImageKHR with vkImportSemaphoreFd and vkImportFenceFd.

From spec:

Passing a semaphore to vkAcquireNextImageKHR is equivalent to temporarily importing a semaphore payload to that semaphore.

NOTE:

Because the exportable handle types of an imported semaphore correspond to its current imported payload, and vkAcquireNextImageKHR behaves the same as a temporary import operation for which the source semaphore is opaque to the application, applications have no way of determining whether any external handle types can be exported from a semaphore in this state. Therefore, applications must not attempt to export external handles from semaphores using a temporarily imported payload from vkAcquireNextImageKHR.

As long as we can import a payload, we can do whatever we want, neat!

Async acquire (easy)

This is trivial, just import the binary semaphore we got from AcquireImage message.

sync acquire

If the server gives us back a CPU-side eventfd or something similar, this is more awkward. On Linux, we can import SYNC_FD with fd -1. This means it’s already signaled, and it’s a way to signal a binary semaphore from CPU. However, not all implementations support SYNC_FD, although I believe the last holdout (NVIDIA) added support for it in a recent beta, so maybe relying on SYNC_FD on Linux is feasible soon.

If that isn’t available we have to go into really nasty hackery, having a pool of already signaled binary OPAQUE_FD semaphores for example. On present, we can signal a new payload on the side, place that in our “pool” of binary semaphores that we can import into an acquire later. Supremely scuffed, but hey, gotta do what you gotta do.

Retire (server -> client)

I don’t think it was a good idea in the end, but I tried splitting the acquire process in two. The basic idea was that I could aggressively signal acquire early, letting the CPU start recording commands, but before you’d actually submit rendering, you’d have to block until the retire event came through. Alternatively, you could wait for acquire + retire events to come through before considering an acquire complete.

In practice, this ended up being a vestigial feature and I should probably just get rid of it. It maps rather poorly to Vulkan WSI.

Completion (server -> client)

struct FrameComplete
{
    // Serial from image group.
    uint64_t image_group_serial;

    // All processing for timestamp is committed and submitted.
    // Will increase by 1 for every refresh cycle of the server.
    // There may be gaps in the reported timestamp.
    uint64_t timestamp;

    // The current period for frame latches.
    // A new frame complete event is expected after period_ns.
    uint64_t period_ns;

    // When an image is consumed for the first time,
    // it is considered complete.
    uint64_t presented_id;

    FrameCompleteFlags flags;

    // Number of refresh cycles that frame complete
    // was delayed compared to its target timestamp.
    // If this is consistently not zero, the client is too slow.
    uint32_t delayed_count;

    uint64_t headroom_ns;
};

This event represents a “vblank” event. A completion event is fired when an image was done rendering and was consumed by a “vblank” (i.e. encoding) event.

This can be used to implement KHR_present_wait, proper frame pacing, latency control, etc. I didn’t implement all of the fields here fully, but when you control the protocol you can do whatever you want 🙂

Overall, this protocol ended up looking vaguely similar to X11 DRI3 Present protocol with the improvement of being explicit sync, async acquire by default, and a better FIFO queue model that does not require insane hackery to accomplish. Implementing FIFO well on X11 is actually a nightmare requiring worker threads juggling scissors to ensure forward progress. Don’t look at wsi_common_x11.c in Mesa if you value your sanity, just saying.

Frame pacing

A common concern I have with typical screen recording software is that the video output is not well-paced at all. If I record at 60 fps and I’m playing at 144 fps, there’s no way the output will be well paced if it’s just doing the equivalent of taking a snapshot every 16.6 ms. To fix this, I added some modes to optimize for various use cases:

Prefer server sync

The client becomes locked to the server refresh rate. Frame limiting happens either in QueuePresentKHR or WaitForPresentKHR. If the application is using presentIds, we can just redirect WaitForPresentKHR to wait for completion events from our server, instead of the actual swapchain. If it does not use present_wait, we can fall back to frame limiting in QueuePresentKHR. (Frame limiting in AcquireNextImageKHR is broken since you can acquire multiple images in Vulkan and may happen at arbitrary times).

Depending on the use case it can be useful to force MAILBOX present mode on the swapchain to avoid a scenario where we’re blocking on two separate clocks at the same time. If I’m playing on a 144 Hz VRR monitor while being frame limited to 60 fps, that’s not a problem, but recording at 60 fps with a 60 Hz monitor could be a problem. If frame pacing of recording is more important than frame pacing of local monitor, the swapchain that goes on screen should have MAILBOX or IMMEDIATE.

Prefer client sync

Client renders unlocked and server will use whatever latest ready image is. Basically MAILBOX.

Adaptive

Choose between above modes depending if application is using FIFO or non-FIFO presentation modes.

Server side swapchain implementation

Since we’re not tied to a particular display, we can pretend that every N milliseconds, we’re woken up to encode a video frame. At this point, we pick the last ready image whose requested earliest present time has not been reached, simple enough.

We can implement present interval quite neatly as well. When a present request is received, we compute the earliest timestamp we should present based on existing images in the queue.

The timestamp_completed here is in number of frames.

uint64_t compute_next_target_timestamp() const
{
    uint64_t ts = 0;
    for (auto &img : images)
    {
       if (img.state != State::ClientOwned)
       {
          uint64_t target_ts = img.target_timestamp + img.target_period;
          if (target_ts > ts)
             ts = target_ts;
       }
    }

    // If there are no pending presentations in flight,
    // lock-in for the next cycle.
    // Move any target forward to next pending timestamp.
    uint64_t next_ts = timestamp_completed + 1;
    if (ts < next_ts)
       ts = next_ts;

    return ts;
}

This is pretty simple and handles any presentation interval. If the period is 0, we can have multiple presentations in flight where they all have target_ts being equal. In that case we use the largest presentation ID to make sure we’re picking the last image.

Now the image is queued, but it is still in-flight on GPU. Now we kick off a background task which waits for the presentation to complete. At that point we transition the state from Queued to Ready.

Once an image becomes Ready, we can retire old images since we know that they will never be used again as an encode source. If this were a normal fullscreen FLIP-style swapchain, we’d have to careful not to signal acquire semaphores until the newly Ready image was actually flipped on screen. We’re doing a BLIT-style swapchain due to our encoding however, so we can retire right away.

bool retire_obsolete_images(uint64_t current_present_id)
{
    for (size_t i = 0, n = images.size(); i < n; i++)
    {
       auto &img = images[i];
       if ((img.state == State::PresentReady ||
            img.state == State::PresentComplete) &&
           img.present_id < current_present_id)
       {
          img.state = State::ClientOwned;
          if (!send_acquire_and_retire_with_semaphores(i))
              return false;
       }
    }

    return true;
}

At vblank time, we’ll pick the appropriate image to encode.

int get_target_image_index_for_timestamp(uint64_t ts)
{
    int target_index = -1;
    for (size_t i = 0, n = images.size(); i < n; i++)
    {
        auto &img = images[i];
        if (img.state != State::PresentReady &&
            img.state != State::PresentComplete)
            continue;

        if (img.target_timestamp > ts)
            continue;

        // Among candidates, pick the one with largest present ID.
        if (target_index < 0 ||
            img.present_id > images[target_index].present_id)
            target_index = int(i);
    }

    return target_index;
}

If this image is in the Ready state, this is the time to transition it to Complete and send a complete event.

There are some quirks compared to a normal FIFO swapchain however. If the server is being very slow to encode, it’s possible that it misses its own vblank intervals. In this case, we might end up “skipping” ahead in the FIFO queue. E.g. an application might have queued up images to be encoded at frame 1000, 1001 and 1002, but server might end up doing 1000, drop, 1002 where frame 1001 is just skipped. This is technically out of spec for Vulkan FIFO, but I don’t care 😀

I considered keeping up the pace more important rather than slowing down the client progress just because the encoder was too slow for a split second.

From here, video and audio can be encoded fairly straight forward with FFmpeg.

For video:

  • Async compute shader that rescales and converts RGB to YUV420
  • Ideally, we’d pass that on to Vulkan video encode directly, but for now, just read back YUV image to system memory
  • Copy into an AVFrame
    • If using hwaccel, av_hwframe_transfer (so many copies …)
  • Send AVFrame to codec
  • Get AVPacket out
  • Send to muxer (e.g. MKV)

For audio:

  • Create a recording stream
    • Either monitor the soundcard output as an input device
    • … or use pipewire patch bay to record specific audio streams
      • Automating this process would be cool, but … eh

After all this, I felt the side project had kind of come to an end for the time being. I removed some old cobwebs in the IPC parts of my brain and got a deeper understanding of WSI on Linux and got basic hwaccel encoding working with NVENC and VAAPI, mission complete.

Now I could do:

// Client. use explicit_environment in layer JSON to make it opt-in.
// We all know how broken implicit layers that hook WSI by default can be,
// don't we, RTSS :')
$ PYROFLING=1 ./some/vulkan/app

// Server
./pyrofling /tmp/dump.mkv --encoder blah --bitrate blah ...

The pyrofling layer automatically connects to the server if it’s spawned after game starts, and you can restart the server and it reconnects seamlessly. Neat!

The plan at this point was to wait until Vulkan video encode matured and then hook up the encode path properly, but … things happened, as they usually do.

Scratching an itch – playing a single-player game together over the web

Replaying a classic game with friends and family during the holidays tends to be quite enjoyable, and at some point we ended up trying to recreate the experience remotely. The ideal situation was that one of us would host the game and play it while the other would watch the stream and we could banter.

The first attempt was to do this over Discord screen sharing, but the experience here was abysmal. Horrible video quality, stutter, performance, and no good solution for piping through high quality game audio. This testing included Windows. Maybe there’s a way, but I couldn’t find it. I don’t think Discord is designed for this use case.

Bad frame pacing completely breaks immersion, simply unacceptable.

At this point, I figured OBS might be a solution. Just stream to Twitch or something and people could watch that stream while talking over Discord.

While this “worked” in the sense that video was smooth and audio quality good, there were some major drawbacks:

  • Twitch’s idea of “low latency” mode is misleading at best. Expect between 1 and 2 seconds of delay, and as much as 3 in some cases. This was completely useless in practice. It might be barely okay for a streamer watching comments and interacting with an audience. When communicating with “the audience” over voice, and hearing reactions delayed by seconds, it was unusable.
  • Horrible video quality. Twitch caps you to about 6 mbit/s + 8-bit H.264 which has very questionable video quality for game content even with a competent encoder. (Popular streamers get more bandwidth, or so I hear.) This basically forced me into 720p. Surely we can do better than this in 2023 …
  • OBS did not like my multi-GPU setup on Linux and trying to hardware encode on top of that was … not fun 😀

At this point, I wanted to test if OBS was adding more buffering than expected, so I dusted off pyrofling, added an option to mux to RTMP / FLV which Twitch expects, and that’s about all you need to stream to Twitch, really. It worked just fine, but latency did not improve.

Targeting decent latency – 100-200 ms range

For just watching a stream and talking / commenting alongside it, I needed to find a way to get it down to about 100-200 ms, which is the middle ground of latency. I figured most of the delay was due to buffering on Twitch’s end, so I wondered if it’d be possible to host something similar locally. I’d only need to serve one client after all, so bandwidth was not a concern.

This venture quickly failed. The closest I found was https://github.com/ossrs/srs, but I couldn’t get it to work reliably and I couldn’t be arsed to troubleshoot some random Github project.

Hacking PyroFling encoder – MPEG-TS over TCP

The first idea I came up with was to use MPEG-TS as a muxer, add an IO callback, so that instead of writing the MPEG-TS to file I’d beam the data over a socket to any TCP client that connected. FFmpeg can do something similar for you by using “tcp://local-ip:port?listen=1” as the output path, but this is blocking and not practical to use with the FFmpeg API in a multiplexed server.

Video players like MPV and VLC can easily open a raw stream over TCP with e.g. tcp://ip:port. It’ll figure out the stream is MPEG-TS and start playing it just fine.

This actually worked! But again, I ran into issues.

Even in low-latency / no-buffer modes in MPV / VLC, the latency was questionable at best. At the start of the stream, I could observe usable levels of latency, but there seemed to be no internal system to keep latency levels stable over time, especially when audio was also part of the stream. The buffer sizes would randomly grow and eventually I’d sit at over half a second latency. I spent some time searching for random comments from people having the same problems and trying a million different CLI commands that “supposedly” fix the problem, but none of them satisfied me.

At this point, I was too deep in, so … Time to write a no-frills custom video player designed for stable low latency streaming.

Presentation Time Stamp – PTS

FFmpeg and most/all container formats have a concept of a PTS, when to display a video frame or play back audio. This is used to guide A/V synchronization.

Off-line media scenario – sync-on-audio

I already had this path implemented in Granite. Audio playback is continuous, and we can constantly measure the playback cursor of the audio. If we’re a typical media player with a long latency audio buffer to eliminate any chance of audio hick-ups, we take the audio buffer latency into account as well, so at any instantaneous time point, we can estimate current audio PTS as:

estimated_pts = pts_of_current_audio_packet +
                frame_offset_in_audio_packet / sample_rate -
                audio_latency

This raw form of PTS cannot be used as is, since it’s too noisy. Audio is processed in chunks of about 10 ms in most cases, so the estimate will be erratic.

The solution is to smooth this out. We expect the audio PTS to increase linearly with time (duh), so the way I went about it was to fuse wall clock with audio PTS to stay in sync and avoid drift.

// elapsed_time is current wall time
// yes, I use double for time here, sue me.

// Unsmoothed PTS.
double pts = get_estimated_audio_playback_timestamp_raw();

if (pts == 0.0 || smooth_elapsed == 0.0)
{
    // Latch the PTS.
    smooth_elapsed = elapsed_time;
    smooth_pts = pts;
}
else
{
    // Smooth out the reported PTS.
    // The reported PTS should be tied to the host timer,
    // but we need to gradually adjust the timer based on the
    // reported audio PTS to be accurate over time.

    // This is the value we should get in principle if everything
    // is steady.
    smooth_pts += elapsed_time - smooth_elapsed;
    smooth_elapsed = elapsed_time;

    // Basically TAA history accumulation + rejection :D
    if (muglm::abs(smooth_pts - pts) > 0.25)
    {
       // Massive spike somewhere, cannot smooth.
       // Reset the PTS.
       smooth_elapsed = elapsed_time;
       smooth_pts = pts;
    }
    else
    {
       // Bias slightly towards the true estimated PTS.
       // Arbitrary scaling factor.
       smooth_pts += 0.005 * (pts - smooth_pts);
    }
}

Now that we have a smooth estimate of PTS, video sync is implemented by simply displaying the frame that has the PTS closest to our estimate. If you have the luxury of present timing, you could queue up a present at some future time where you know audio PTS will match for perfect sync. In my experience you can be off by about 40 ms (don’t quote me on that) before you start noticing something’s off for non-interactive content.

Fixed video latency – sync-on-video

While sync-on-audio is great for normal video content, it is kinda iffy for latency. At no point in the algorithm above did we consider video latency, and that is kinda the point here. Video latency is the more important target. Correct audio sync becomes less important when chasing low latency I think.

In a real-time decoding scenario, we’re going to be continuously receiving packets, and we have to decode them as soon as they are sent over the wire. This means that at any point, we can query what the last decoded video PTS is.

Based on that, we can set our ideal target video PTS as:

target_video_pts = last_decoded_video_pts - target_latency

Again, this estimate will be very noisy, so we smooth it out as before using wall time as the fused timer:

double target_pts = get_last_video_buffering_pts() - target_latency;
if (target_pts < 0.0)
    target_pts = 0.0;

smooth_pts += elapsed_time - smooth_elapsed;
smooth_elapsed = elapsed_time;

if (muglm::abs(smooth_pts - target_pts) > 0.25)
{
    smooth_elapsed = elapsed_time;
    smooth_pts = target_pts;
}
else
{
    smooth_pts += 0.002 * (target_pts - smooth_pts);
}

Now we have another problem. What to do about audio? Frame skipping or frame duplication is not possible with audio, even a single sample of gap in the audio has disastrous results.

The solution is to dynamically speed audio up and down very slightly in order to tune ourselves to the correct latency target. The idea is basically to sample our estimated audio PTS regularly and adjust the resampling ratio.

latch_estimated_audio_playback_timestamp(smooth_pts);
auto delta = smooth_pts - get_estimated_audio_playback_timestamp_raw();
// Positive value, speed up. Negative value, slow down.
delta = clamp(delta, -0.1, 0.1);

// This is inaudible in practice.
// Practical distortion will be much lower than outer limits.
// And should be less than 1 cent on average.
stream->set_rate_factor(1.0 + delta * 0.05);

This of course requires you to have a high quality audio resampler that can do dynamic adjustment of resampling ratio, which I wrote myself way back in the day for retro emulation purposes. While this technically distorts the audio a bit by altering the pitch, this level of funging is inaudible. 1 cent of a semitone (about 0.05%) is nothing.

I believe this is also how MPV’s sync-on-video works. It’s a useful technique for displaying buttery smooth 60 fps video on a 60 Hz monitor.

Success! Kinda …

By targeting a reasonably low latency in the new player, we were able to get an acceptable stream going over the internet. We did some basic comparisons and Discord voice came through at the same time as the video feed according to my testers, so mission accomplished I guess!

The main drawback now was stream robustness. TCP for live streaming is not a great idea. The second there are hick-ups in the network, the stream collapses for a hot minute since TCP does not accept any loss. When we were all on ethernet instead of Wi-Fi, the experience was generally fine due to near-zero packet loss rates, but right away, a new use case arose:

Wouldn’t it be really cool if we could switch who controls the game?

This is basically the idea of Steam Remote Play Together, which we have used in the past, but it is not really an option for us based on past experience:

  • Latency too high
  • Video quality not great
  • Only supported by specific games
    • And usually only multi-player co-op games
  • Won’t help us playing non-Steam stuff

At this point I knew I had work cut out for me. Latency had to drop by an order of magnitude to make it acceptable for interactive use.

Chasing low latency

The first step in the process was to investigate the actual latency by the encoder and decoder chains, and the results were … kinda depressing.

On the right, my test app was running, while the left was the video feedback over localhost on the same display. The video player was hacked to always present the last available image. 100 ms latency, yikes …

I eventually narrowed part of this down to MPEG-TS muxer in FFmpeg adding a lot of latency, over 50 ms just on its own. It was pretty clear I had to get rid of MPEG-TS somehow. Around this point I started to investigate RTP, but I quickly rejected it.

RTP does not support multiple streams. It cannot mux audio and video, which is mildly baffling. Apparently you’re supposed to use two completely different RTP streams on different ports. Some kind of external protocol is used to provide this as side band information, and when trying to play an RTP stream in FFmpeg you get hit with:

[rtp @ 0x55b3245fd780] Unable to receive RTP payload type 96 without an SDP file describing it

Apparently this is https://en.wikipedia.org/wiki/Session_Description_Protocol, and the whole affair was so cursed I just rejected the entire idea, and rolled my own protocol. I just need to bang over some UDP packets with some sequence counters, payloads and metadata after all, how hard can it be, right?

Turns out it wasn’t hard at all. The rest of the latency issues were removed by:

  • Disabling frame queue in NVENC
  • Disabling encoding FIFO in server
    • Just encode as soon as possible and blast the packet over UDP
    • Pacing be damned
    • We’ll solve frame pacing later
  • Remove B-frames and look-aheads
    • Well, duh :p
    • “zerolatency” tune in libx264

For example, here’s some options for NVENC.

// This is critical !!!
av_dict_set_int(&opts, "delay", 0, 0);

av_dict_set_int(&opts, "zerolatency", 1, 0);
av_dict_set(&opts, "rc", "cbr", 0);
av_dict_set(&opts, "preset", "p1", 0);
av_dict_set(&opts, "tune", "ll", 0);
// There's an ull for ultra-low latency,
// but video quality seemed to completely die
// with no obvious difference in latency.

Some local results with all this hackery in libx264.

On my 144 Hz monitor I could sometimes hit a scenario where the video stream and application hit the same vblank interval, which means we just achieved < 7 ms latency, very nice!

NVENC also hits this target, but more consistently, here with HEVC encode.

AMD with VAAPI HEVC on RX 6800 isn’t quite as snappy though … Hoping Vulkan encode can help here. There might be some weird buffering going on in the VAAPI backends that I have no control over, but still. We’re still in the ~10 ms ballpark. I had better results with AV1 encode on my RX 7600, but I couldn’t be arsed to swap out GPUs just to get some screenshots.

Of course, we’re working with the most trivial video footage possible. The true test is real games where I expect encode/decode latency to be more obvious.

Intra-refresh – the magic trick

When doing very low-latency streaming like this, the traditional GOP structure completely breaks down. Intra frames (or I-frames) are encoded as still images and tend to be fairly large. P- and B-frames on the other hand consume far fewer bits. Low latency streaming also requires a lot more bitrate than normal high-latency encoding since we’re making life very difficult for the poor encoder.

In a constant bit-rate world where we’re streaming over a link with limited bandwidth, the common solution to this bitrate fluctuation is to just buffer. It’s okay if an I-frame takes 100ms+ to transmit as long as the decode buffer is large enough to absorb the bandwidth difference, but we cannot rely on this for low latency streaming.

Here’s a link to the 2010 post by x264 legend Dark Shikari. The solution is intra-refresh where parts of the image is continuously refreshed with I-blocks. Effectively, we have no GOP structure anymore at this point. This allows us to avoid the huge spikes in bandwidth.

libx264 and NVENC support this, but sadly, VAAPI implementations do not 🙁 Hoping we can get this working in Vulkan video encode somehow …

av_dict_set_int(&opts, "intra-refresh", 1, 0);
av_dict_set_int(&opts, "forced-idr", 1, 0);
video.av_ctx->refs = 1; // Otherwise libx264 gets grumpy.

The forced-idr option is used so that we can still force normal I-frames at will. Clients can request “pure” I-frames when connecting to the server to kick-start the decode process. Technically, with intra-refresh you can just keep decoding until the image has been fully refreshed at least once, but I had various issues with FFmpeg decoding errors when trying to decode raw intra-refresh without ever seeing a keyframe first, especially with HEVC, so I went with the dumb solution 🙂 It worked fine.

Fixing frame pacing

When I try to just display the frames as they come in over the network, the results are … less than ideal. The pacing is completely butchered due to variability in:

  • GPU time for game to render
  • Encoding time (scene dependent)
  • Network jitter
  • Decoding time

Under ideal conditions over a local network, network jitter is mostly mitigated, but the variability in encode/decode time is still noticeable on a fixed rate display, causing constant frame drops or dupes. My solution here was to re-introduce a little bit of latency to smooth over this variability.

Step 1 – Set up a low latency swapchain on client

VK_KHR_present_wait is critical to ensure we get the lowest possible latency. On a 60 Hz monitor, we want this frame loop:

while (gaming)
{
    wait_for_last_present_to_flip(); // vkWaitForPresentKHR(id);
    // 16.6ms until next flip.

    wait_for_next_video_frame_in_decode_queue(timeout = N ms);

    // Should just take a few microseconds on GPU.
    yuv_to_rgb_render_pass();

    // Ideally the GPU is done just before the deadline of compositor.
    present(++id);
}

Just in case we barely miss deadline due to shenanigans, FIFO_RELAXED is useful as well.

step 2 – send feedback to server

This is fairly magical and I don’t think any generic “screen capturing” software can and will do this. The idea is that there is an ideal time target when new video frames should be done. If they arrive too early, we can ask the game to slow down slightly, and if it arrives too late, speed up a bit.

Basically, this is a phase locked loop over the network. One nice property of this is that we don’t really need to know the network latency at all, it’s self stabilizing.

while (gaming)
{
     wait_for_last_present_to_flip(); // vkWaitForPresentKHR(id);
     // 16.6ms until next flip.

     auto phase = get_current_time();
     if (wait_for_next_video_frame_in_decode_queue(deadline))
     {
         auto arrive_time = video_frame.decode_done_time;

         // If 0, the packet arrived in sync with WaitForPresentKHR.
         // If negative, it was completed before we waited for it.
         auto phase_offset = arrive_time - phase;

         phase_offset -= target_phase_offset;
         // Continuously notify server by feedback.
         client.send_phase_offset(phase_offset);
     }
     else
     {
        // Could be used to adapt the target_phase_offset maybe?
        missed_frames++;
     }

     // ... render
}

Since the server controls when to send Complete events to the game running, we have full control over whether to render at 60.0 FPS, 60.01 FPS or 59.99 FPS. Tiny adjustments like these is enough to keep the system stable over time. It can also handle scenarios where the refresh rates are a bit more off, for example 59.95 Hz.

Of course, large network spikes, lost packets or just good old game stutter breaks the smooth video, but it will recover nicely. With target_phase_offset = -8ms and deadline of 8ms, I have a very enjoyable gaming experience with smooth frame pacing and low latency over local network.

Step 3 – Adjust audio buffering algorithm

At this point, we don’t really care about A/V sync by PTS. The assumption is that audio and video packets arrive together with very similar PTS values. Rather than trying to target a specific PTS, we just want to ensure there is a consistent amount of audio buffering to safely avoid underrun while keeping latency minimal. This strategy is good enough in practice.

double current_time = get_audio_buffering_duration();
double delta = current_time - target_buffer_time; // 30ms is good tradeoff
set_audio_delta_rate_factor(delta);

Adding a virtual gamepad

As cherry on top, we just need to let the client send gamepad events. Using /dev/uinput on Linux, it’s very simple to create a virtual gamepad that Steam can pick up and it “just werks” in all games I tested. It works fine in other programs too of course. It’s trivial to hook this up.

Cranking the quality

For game content in darker regions, I noticed that 10-bit HEVC looked dramatically better than 8-bit, so I went with that. >30mbit/s 10-bit streams with HEVC or AV1 looks basically flawless to me on a TV even with really difficult game content that tends to obliterate most streams. Good luck getting game streaming services to provide that any time soon!

Putting it all together

Remaining problems

The main problem left is that packet loss recovery isn’t really there. I’m not sure if FFmpeg has facilities to recover gracefully from dropped packets other than freaking out about missing reference frames in the logs, and this is a bit outside my wheelhouse. Intra refresh does a decent job of quickly recovering however.

I have some hopes that using Vulkan video decode directly will allow me to fake the presence of missed reference frames to mask most of the remaining issues, but that’s a lot of work for questionable gain.

Audio is a bit more YOLO, I just ignore it. That seems to be the general strategy of VoIP anyways.

There’s also zero security / encryption. I don’t really care.

Sadly, I haven’t had much luck getting the work in progress Vulkan encode work to run yet. Hooking up a fully Vulkan encode -> decode chain will be nice when that matures. The decode path is already working.

Conclusion

If you actually made it this far, congratulations. I mostly aimed to make this post a braindump of the techniques I went through to make this and I achieved what I set out to do, useful low-latency game streaming tailored exactly for my needs.

Hardcore Vulkan debugging – Digging deep on Linux + AMDGPU

Everyone battle hardened in the world of Vulkan and D3D12 knows that debugging is ridiculously hard once we enter the domain of crashes and hangs. No one wants to do it, and seeing a random GPU crash show up is enough to want to quit graphics programming and take up farming on a remote island. Someone has to do it though, and given how questionable a lot of D3D12 content is w.r.t. correctness, this comes up a lot more often that we’d like in vkd3d-proton land.

The end goal of this blog is to demonstrate the magical UMR tool on Linux, which I would argue is the only reasonable post-mortem debugging method currently available on PC, but before we go that deep, we need to look at the current state of crash debugging on PC and the bespoke tooling we have in vkd3d-proton to deal with crashes.

Eating just crumbs makes for a miserable meal

Breadcrumbs is a common technique that most modern APIs have some kind of implementation of. The goal of breadcrumbs is simply to narrow down which draws or dispatches caused the failure. This information is extremely limited, but can sometimes be enough to figure out a crash if you’re very lucky and you have knowledge about the application’s intentions with every shader (from vkd3d-proton’s point of view, we don’t obviously).

Depending on the competency of the breadcrumb tool, you’d get this information:

  • A range of draws or dispatches which could potentially have caused failure.
    • Ideally, exactly the draw or dispatch which caused failure.
  • If page fault, which address caused failure?
    • Which resource corresponds to that failure? It is also possible that the address does not correspond to any resource. Causing true OOB on D3D12 and Vulkan is very easy.

As far as I know, this is where D3D12 on Windows ends, with two standard alternatives:

  • WriteBufferImmediate (Basically VK_AMD_buffer_marker)
  • DRED

There are vendor tools at least from NVIDIA and AMD which should make this neater, but I don’t have direct experience with any of these tools in D3D12, so let’s move on to the Vulkan side of things.

VK_AMD_buffer_marker breadcrumbs

Buffer markers is the simplest possible solution for implementing breadcrumbs. The basic idea is that a value is written to memory either before the GPU processes a command, or after work is done. On a device lost, counters can be inspected. The user will have to instrument the code somehow, either through a layer or directly. In vkd3d-proton, we can enable debug code which automatically does this for all D3D12 commands with VKD3D_CONFIG=breadcrumbs (not available in release builds).

For example, from our dispatch implementation:

VK_CALL(vkCmdDispatch(list->vk_command_buffer, x, y, z));
VKD3D_BREADCRUMB_AUX32(x);
VKD3D_BREADCRUMB_AUX32(y);
VKD3D_BREADCRUMB_AUX32(z);
VKD3D_BREADCRUMB_COMMAND(DISPATCH);

Then it’s a matter of writing the breadcrumb somewhere:

cmd.type = VKD3D_BREADCRUMB_COMMAND_SET_BOTTOM_MARKER;
cmd.count = trace->counter;
vkd3d_breadcrumb_tracer_add_command(list, &cmd);

VK_CALL(vkCmdWriteBufferMarkerAMD(list->vk_command_buffer,
  VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT,
  host_buffer, ...);

trace->counter++;

cmd.type = VKD3D_BREADCRUMB_COMMAND_SET_TOP_MARKER;
cmd.count = trace->counter;
vkd3d_breadcrumb_tracer_add_command(list, &cmd);

VK_CALL(vkCmdWriteBufferMarkerAMD(list->vk_command_buffer,
  VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
  host_buffer, ...);

We’ll also record commands and the parameters used in a side band buffer so that we can display the faulting command buffers.

Another thing to consider is that the buffer we write to must be coherent with the host. On a device lost happening randomly inside a command we won’t have the opportunity to perform host memory barriers and signal a fence properly, so we must make sure the memory punches straight through to VRAM. On AMD, we can do this with

memory_props = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
               VK_MEMORY_PROPERTY_HOST_COHERENT_BIT |
               VK_MEMORY_PROPERTY_HOST_CACHED_BIT;

if (is_supported(VK_AMD_device_coherent_memory))
{
   memory_props |= VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD |
                   VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD;
}

On fault, we scan through the host buffer, and if we observe that TOP and BOTTOM markers are not 0 (never executed) or UINT32_MAX (done), scan through and report the range of failing commands.

RADV speciality, making buffer markers actually useful

GPUs execute commands concurrently unless barriers are emitted. This means that there is a large range of potential draws or dispatches in flight at any one time. RADV_DEBUG=syncshaders adds barriers in between every command so that we’re guaranteed a hang will narrow down to a single command. No other Vulkan driver supports this, and it makes RADV the only practical driver for breadcrumb techniques, at least on Vulkan. Sure, it is possible to add barriers yourself between every command to emulate this, but for render passes, this becomes extremely annoying since you have to consider restarting the render pass for every draw call …

As a simple example, I’ve hacked up one of the vkd3d-proton tests to write a bogus root descriptor address, which is a great way to crash GPUs in D3D12 and Vulkan.

When running with just breadcrumbs, it’s useless:

Device lost observed, analyzing breadcrumbs ...
Found pending command list context 1 in executable state, TOP_OF_PIPE marker 44, BOTTOM_OF_PIPE marker 0.
===== Potential crash region BEGIN (make sure RADV_DEBUG=syncshaders is used for maximum accuracy) =====
Command: top_marker
marker: 1
Command: set_shader_hash
hash: db5d68a6143611ad, stage: 20
Set arg: 0 (#0)
Set arg: 18446603340520357888 (#ffff800100400000)
Command: root_desc
Set arg: 0 (#0)
Set arg: 18446603340788793344 (#ffff800110400000)
Command: root_desc
Tag: ExecuteIndirect [MaxCommandCount, ArgBuffer cookie, ArgBuffer offset, Count cookie, Count offset]
Set arg: 1 (#1)
Set arg: 7 (#7)
Set arg: 16 (#10)
Set arg: 0 (#0)
Set arg: 0 (#0)
Set arg: 0 (#0)
Command: execute_indirect_unroll_compute
Command: bottom_marker
marker: 1
Command: top_marker
marker: 2
Command: execute_indirect
Command: bottom_marker

... A ton of commands

Command: barrier
Command: bottom_marker
marker: 44
===== Potential crash region END =====

Instead, with syncshaders, it becomes:

===== Potential crash region BEGIN (make sure RADV_DEBUG=syncshaders is used for maximum accuracy) =====
Command: top_marker
marker: 1
Command: set_shader_hash
hash: db5d68a6143611ad, stage: 20
Set arg: 0 (#0)
Set arg: 18446603340520357888 (#ffff800100400000)
Command: root_desc
Set arg: 0 (#0)
Set arg: 18446603340788793344 (#ffff800110400000) <-- bogus pointer
Command: root_desc
Tag: ExecuteIndirect [MaxCommandCount, ArgBuffer cookie, ArgBuffer offset, Count cookie, Count offset]
Set arg: 1 (#1)
Set arg: 7 (#7)
Set arg: 16 (#10)
Set arg: 0 (#0)
Set arg: 0 (#0)
Set arg: 0 (#0)
Command: execute_indirect_unroll_compute
Command: bottom_marker
marker: 1
===== Potential crash region END =====

That’s actionable.

It’s more widely supported than you’d expect

A lot of drivers actually support the buffer marker vendor extension, at least in Mesa land, and even NVIDIA (although, on NVIDIA we use another extension for breadcrumb purposes …)

Narrow down multiple queues

With async compute, it’s possible that multiple command streams are in flight, and with breadcrumbs, it’s not possible to determine which queue actually faulted. To aid this, we have VKD3D_CONFIG=single_queue which serializes everything into one VkQueue.

VK_NV_device_diagnostic_checkpoints

The NV vendor extension simplifies things a fair bit. Rather than allocating memory in sysmem and manually writing out markers, one call is made after every command:

VK_CALL(vkCmdSetCheckpointNV(list->vk_command_buffer,
        NV_ENCODE_CHECKPOINT(context, trace->counter)));

The argument is a void pointer where you can place whatever you want, so we encode command list index and a counter there. On device fault, you can then query checkpoints from your queues.

VK_CALL(vkGetQueueCheckpointDataNV(vk_queue, &checkpoint_count, NULL));
if (checkpoint_count == 0)
  return;

checkpoints = vkd3d_calloc(checkpoint_count, sizeof(VkCheckpointDataNV));
for (i = 0; i < checkpoint_count; i++)
  checkpoints[i].sType = VK_STRUCTURE_TYPE_CHECKPOINT_DATA_NV;
VK_CALL(vkGetQueueCheckpointDataNV(vk_queue,
        &checkpoint_count, checkpoints));

From there, start looking for TOP_OF_PIPE and BOTTOM_OF_PIPE pipeline stages to get a potential range of commands. BOTTOM_OF_PIPE means we know for sure all commands before completed execution, and TOP_OF_PIPE means the command processor might have started executing all commands up to that point.

The main flaw with this extension is there is no easy way to narrow down the range of commands. With RADV we can enforce sync with syncshaders as a (very useful) hack, but there is no such method on NV unless we do it ourselves 🙁

Drilling into hard rock

If we can narrow down a breadcrumb to a specific shader – and it’s reproducible – it might be the time to perform the dark art of shader replacement and GPU debug printing. We know where it crashes, but why is still a mystery.

Shader replacement is a special kind of magic that vkd3d-proton needs to consider since we have no means of modifying the original game shaders directly. We get (incomprehensible) DXIL which we then translate into SPIR-V. vkd3d-proton supports bypassing the translation step, and using our own SPIR-V instead. This SPIR-V can be instrumented with debug code which lets us drill down. This is a horribly slow and tedious process, but it’s the only thing we have to inspect shader execution in real time.

Dumping DXIL and SPIR-V

First, we have to dump all shaders used by a game.

mkdir /tmp/shaders
VKD3D_SHADER_DUMP_PATH=/tmp/shaders %command%

From a breadcrumb trace, we’ll hopefully know the shader hashes which are relevant to look at, and we can round-trip them to Vulkan GLSL using SPIRV-Cross.

mkdir /tmp/override
cp /tmp/shaders/$hash.* /tmp/override
cd /tmp/override
python3 ~/git/dxil-spirv/roundtrip_shaders.py --input . --output . --spirv-cross $PATH_TO_SPIRV_CROSS

From here, we can modify the shader as we please.

$ cat /tmp/override/db5d68a6143611ad.comp
#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

layout(buffer_reference) buffer _7;
layout(buffer_reference, buffer_reference_align = 4, std430) buffer _7
{
   uint _m0[];
};

layout(push_constant, std430) uniform push_cb
{
   uvec2 va0;
} _13;

void main()
{
  uint _22 = uint(0) * 1u;
  uint _28 = atomicAdd(_7(_13.va0)._m0[_22 + (uint(0) >> 2u)], 1u);
}

It’s pretty obvious where it crashes here, but to demonstrate …

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

#include "debug_channel.h"

layout(buffer_reference) buffer _7;
layout(buffer_reference, buffer_reference_align = 4, std430) buffer _7
{
  uint _m0[];
};

layout(push_constant, std430) uniform push_cb
{
  uvec2 va0; // Root descriptor raw address.
} _13;

void main()
{
  DEBUG_CHANNEL_INIT(gl_GlobalInvocationID);
  uint _22 = uint(0) * 1u;

  DEBUG_CHANNEL_MSG(1);
  DEBUG_CHANNEL_MSG(_13.va0.x, _13.va0.y);
  uint _28 = atomicAdd(_7(_13.va0)._m0[_22 + (uint(0) >> 2u)], 1u);
  DEBUG_CHANNEL_MSG(2);
}
cd /tmp/override
make M=$(pwd) -C ~/git/vkd3d-proton/include/shader-debug
make: Entering directory '/home/maister/git/vkd3d-proton/include/shader-debug'
glslc -o /tmp/override/db5d68a6143611ad.spv /tmp/override/db5d68a6143611ad.comp -I/home/maister/git/vkd3d-proton/include/shader-debug --target-env=vulkan1.1 --target-spv=spv1.4 
make: Leaving directory '/home/maister/git/vkd3d-proton/include/shader-debug'

Now we can run again with:

VKD3D_SHADER_OVERRIDE=/tmp/override VKD3D_SHADER_DEBUG_RING_SIZE_LOG2=30 VKD3D_CONFIG=breadcrumbs %command%
25af:info:vkd3d_shader_debug_ring_print_message: Shader: db5d68a6143611ad: Instance 0000000000, ID (0, 0, 0): 1
25af:info:vkd3d_shader_debug_ring_print_message: Shader: db5d68a6143611ad: Instance 0000000000, ID (0, 0, 0): #50400000, #ffff8001

As expected, the shader did not reach 2, because it crashed. The address also correlates with dmesg:

[ 2551.614976] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:56 vmid:1 pasid:32771, for process d3d12 pid 9633 thread d3d12 pid 9633)
[ 2551.614985] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800150400000 from client 0x1b (UTCL2)
[ 2551.614988] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x001C1070
[ 2551.614990] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[ 2551.614992] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0
[ 2551.614994] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0
[ 2551.614996] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x7
[ 2551.614998] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 2551.614999] amdgpu 0000:0b:00.0: amdgpu: RW: 0x1
[ 2611.649035] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.2.0 timeout, signaled seq=35, emitted seq=36

As you can imagine, this is not a fun process when debugging games with 3 ksloc+ long shaders with tons of resource access. To aid this process, we really need UMR …

To make debug print work in crash scenarios, we need to use the same trick as buffer markers, i.e., make the print ring buffer device coherent and uncached.

The holy grail – UMR wave dumps

If all else fails, we have a trump card. This tool is unique to AMD + Linux as far as I know and it lets us inspect all waves which were executing on the GPU at the time of crash. It’s developed by AMD on freedesktop. Alternatively, just install umr-git from AUR if you’re on Arch Linux.

  • ISA dumps
  • Corresponding SPIR-V dump
  • SGPR / VGPR register dumps
  • Which instruction was being executed in every wave
  • GPU disassembly around the crash site

Now this is the real deal. RADV can invoke UMR on crashes and dump out a bunch of useful information. The UMR tool is standalone and should work with AMDVLK or amdgpu-pro as well. Nothing stops you from invoking the same CLI commands that RADV does while the device is hung.

Some preparation

UMR needs to do pretty deep kernel poking, so permissions will be required. First, we have to expose the debug interfaces. Run this as super user after every boot:

#!/bin/bash

chmod 777 /sys/kernel/debug
chmod -R 777 /sys/kernel/debug/dri

If you’re on a multi-GPU system (a good idea if you’re debugging hangs all day every day), it’s possible that the AMD GPU won’t be DRI instance 0.

ls /sys/kernel/debug/dri/*/amdgpu_gfxoff
/sys/kernel/debug/dri/1/amdgpu_gfxoff

If the instance is not 0, RADV currently does not forward the proper arguments to UMR, so check out this Mesa MR for now and add

RADV_UMR_EXTRA_ARGS="--instance 1"

Hopefully a more automatic solution will present itself soon.

A note about Steam Linux Runtime

If trying to debug games from within Steam, Proton or native, the pressure-vessel container will sandbox away /usr/bin/umr, so you’ll need to bypass it somehow. Details are left elsewhere.

Graphics queue only

Currently, RADV only knows how to dump the GFX ring, so we need to ensure only that queue is used if crashes happen in async compute. In vkd3d-proton, we have VKD3D_CONFIG=single_queue for that purpose.

Bracing for impact

RADV_DEBUG=hang VKD3D_CONFIG=single_queue %command%

In RADV, this does a few things:

  • syncshaders is implied
  • After every queue submission, RADV waits for idle. If it times out, it is considered a hang
  • Dump a bunch of internal state, disassemblies, etc
  • Invoke UMR to provide wave dumps

It’s also possible to add this debug option to make page faults a little nicer to debug, but usually not needed:

ACO_DEBUG=force-waitcnt

Note that while in hang debug mode, games will usually run at less than half performance due to the aggressive synchronization.

****************************************************************
* WARNING: RADV_DEBUG=hang is costly and should only be used for debugging! *
****************************************************************
radv: GPU hang detected...
radv: GPU hang report will be saved to '/home/maister/radv_dumps_22769_2023.08.20_12.09.28'!
dmesg: read kernel buffer failed: Operation not permitted
radv: GPU hang report saved successfully!

RADV will try to dump dmesg too, but you probably won’t have permissions, it’s not a big deal.

There’s a lot of useful information here, time to bring out your text editor.

-rw-r--r-- 1 maister maister 780 Aug 20 12:09 1919677f221dcd2d53a149d71db43d179e35dac3.spv
-rw-r--r-- 1 maister maister 152 Aug 20 12:09 app_info.log
-rw-r--r-- 1 maister maister 5429 Aug 20 12:09 bo_history.log
-rw-r--r-- 1 maister maister 56 Aug 20 12:09 bo_ranges.log
-rw-r--r-- 1 maister maister 27 Aug 20 12:09 dmesg.log
-rw-r--r-- 1 maister maister 4927 Aug 20 12:09 gpu_info.log
-rw-r--r-- 1 maister maister 2403 Aug 20 12:09 pipeline.log
-rw-r--r-- 1 maister maister 11231 Aug 20 12:09 registers.log
-rw-r--r-- 1 maister maister 106478 Aug 20 12:09 trace.log
-rw-r--r-- 1 maister maister 826454 Aug 20 12:09 umr_ring.log
-rw-r--r-- 1 maister maister 11388 Aug 20 12:09 umr_waves.log

First, SPIR-V relative to the faulting PSO is dumped. In vkd3d-proton, we emit the DXBC/DXIL hash inside the SPIR-V to correlate back to breadcrumbs or shader dumps.

 %29 = OpString "db5d68a6143611ad.dxbc"

pipeline.log

This is a straight up ISA dump of NIR -> ACO -> AMD ISA.

DISASM:
BB0:
v_mov_b32_e32 v0, 1 ; 7e000281
v_mov_b32_e32 v1, 0 ; 7e020280
global_atomic_add v1, v0, s[2:3] ; dcc88000 00020001
s_endpgm ; bf810000

bo_history.log

This logs all allocations and frees. It’s intended to be parsed with e.g.

python ~/git/mesa/src/amd/vulkan/radv_check_va.py bo_history.log 0x0000800110400000

This is mostly useful to prove application use-after-free or similar.

umr_waves.log

This is the most important dump of all. An entry is made for every active wave. It’s very verbose, but the important parts I think are:

Main Registers:
pc_hi: 00008000 | pc_lo: 00071d10 | wave_inst_dw0: bf810000 | exec_hi: 00000000 | 
exec_lo: 00000001 | m0: f97b2781 | ib_dbg1: 01000000 |
SGPRS:
[ 0.. 3] = { f34d388c, a09e6289, 10400000, ffff8001 }
[ 4.. 7] = { 0000fff0, 00000000, 0000003b, 00000000 }
[ 8.. 11] = { b5cc397c, 650307c8, 138f6ec2, e0c2bfd0 }
[ 12.. 15] = { 3a19d354, 491b2107, 9b54b044, 40a40138 }
...
[ 104.. 107] = { 29bea6e2, ba67090a, 577b6776, 2b42adc6 }
VGPRS: t00 (t01) (t02) (t03) (t04) (t05) (t06) (t07) (t08) (t09) (t10) (t11) (t12) (t13) (t14) (t15) (t16) (t17) (t18) (t19) (t20) (t21) (t22) (t23) (t24) (t25) (t26) (t27) (t28) (t29) (t30) (t31) 
[ 0] = { 00000001 2a50a935 abb6dbce 7ff6fd9b 44024271 3182e80b afb57f2e 156403ff fc122e1a fe904906 13d03a6d 03dc3aa9 834eac40 044c0cf3 467e8798 15aec7d3 a4fd12c5 d1018c88 e77eea3b ce77e044 8e0af2f4 308960e9 7551b5b5 c97c7bee 3b030401 e81d4060 0b52c8cc d6377322 3bc1ac00 0901eb09 b4256737 b1fe11f2 }
[ 1] = { 00000000 09434004 46793b03 67fc758a 0000cd69 a1080140 bdd45dd9 252155aa 082c15d0 00475606 76027c6d 54ba72a2 92741152 06345667 bb634172 bd9c7f6a 8a701309 b3340805 aa2772f1 2f05cdee 7b478c94 80020493 93da61e0 f9052bce 6414435f 80971172 48c70d29 8033f3d8 928489a9 25521422 52dd60ac 4ee699be }
...
 [ 15] = { 6c84febd fff7d1f8 0036b433 922dddef d3d7197e ddfffb89 62247e7e 61af41c7 4ff21fb3 6d1d41b4 880a4918 8cde3ed6 3c72df4f ff76fef5 abe16b69 000b7ac1 d4b325ae 37fedc5d ed3f9301 870dfa3b abfa3757 d30dede1 42016b8e 785bb894 d9e3ba1c 9b7bbdad 6220918b 35729313 bcfffbdc b84f64ed 20d775d6 52135245 }
PGM_MEM:

pgm[7@0x800000071cf0 + 0x0 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x4 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x8 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0xc ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x10 ] = 0x7e000281 v_mov_b32_e32 v0, 1 
pgm[7@0x800000071cf0 + 0x14 ] = 0x7e020280 v_mov_b32_e32 v1, 0 
pgm[7@0x800000071cf0 + 0x18 ] = 0xdcc88000 global_atomic_add v1, v0, s[2:3] 
pgm[7@0x800000071cf0 + 0x1c ] = 0x00020001 ;; 
* pgm[7@0x800000071cf0 + 0x20 ] = 0xbf810000 s_endpgm 
pgm[7@0x800000071cf0 + 0x24 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x28 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x2c ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x30 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x34 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x38 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x3c ] = 0xbf9f0000 s_code_end 
End of disassembly.
TRAPSTS[50000100]:
excp: 256 | illegal_inst: 0 | buffer_oob: 0 | excp_cycle: 0 | 
excp_wave64hi: 0 | dp_rate: 2 | excp_group_mask: 0 | utc_error: 1 |

Here we can see the faulting instruction is the global_atomic_add. It’s using the address in SGPR 2/3, which we can see is … 10400000, ffff8001, which in little endian is 8001’10400000. Only the lower 48 bits are relevant, and if we look at the page fault, the address matches.

If the fault happened in a descriptor-based instruction, we can inspect the descriptor as well, since on AMD, descriptors are consumed in the instruction itself. It’s really convenient in situations like these. 🙂

Correlating fault site back to HLSL or GLSL needs to be done manually, but it is not particularly difficult.

trace.log

This can be used to inspect the PM4 stream. I rarely find it actionable, but it’s nice to have regardless. It adds breadcrumbs, but at the lowest level possible. The most useful thing (for me at least) is that we can inspect the user registers being set.

 Trace point ID: 2
!!!!! This is the last trace point that was reached by the CP !!!!!
... blablabla
10400000 COMPUTE_USER_DATA_2 <- 0x10400000
ffff8001 COMPUTE_USER_DATA_3 <- 0xffff8001
... blablabla
!!!!! This trace point was NOT reached by the CP !!!!!

Real-world debug scenario – Street Fighter 6 GPU hangs

After the last Rashid update (of course this dropped while I was on vacation, hnnnnng), users were observing random GPU hangs during online play, which is the worst kind of GPU hang. After some online play in custom rooms with people on Discord ready to join the breadcrumb crusade, I was able to reproduce the hang at a ~10% rate. I went through a goose chase.

Identify a faulting shader

Breadcrumb trace always pointed to the same shader, which is always a good start. We observed a page fault, so it could have been anything. Use-after-free or OOB access is the usual cause here. The address did not correspond to any resource however, so that’s always a fun start.

Shader replacement?

Replacing shaders seemed to mask the bug, which was rather puzzling. Usually this points to a Vulkan driver bug, but that lead got us nowhere. When dealing with low repro rate random GPU hangs, this is always the worst, since we don’t know if we were just very unlucky with repros, or if the change actually made a difference …

UMR to the rescue

(I didn’t have the old crash dumps lying around, so please excuse the lack of ISA spam. I didn’t feel like spending my Sunday reproducing a bug that I already fixed :v)

Sometimes RADV_DEBUG=hang masks bugs as well due to extra sync, but fortunately we got a wave dump eventually. The failure was in a scalar load from a raw pointer. Normally, this means an out-of-bounds root CBV descriptor access.

First hint was that this was loading 8 dwords in one go, i.e. an image descriptor. I correlated the ISA with the Vulkan GLSL disassembly and it pointed to this code:

vec4 _1132 = textureLod(nonuniformEXT(samplerCube(_32[_1129], _56[_146])), vec3(_1097, _1098, _1099), _1123 * 4.0);

It was also bindless. Normally, my spider senses would immediately think that this was an out of bounds descriptor heap access.

The descriptor index was computed as root table offset + dynamic offset. Studying the ISA I realized that it was not actually the dynamic offset that was the culprit, but rather the root table offset. Figuring this out would have taken an eternity without SGPR dumps.

From the PM4 trace, I was then able to confirm that the SGPR root table offset correlated with vkCmdPushConstants on our end.

This was rather puzzling … I had a theory that maybe our root parameter flushing code had a bug, so I added extra instrumentation to our breadcrumbs …

Another crash later, I was able to prove that on a GPU fault:

  • Root table #10 was never set in the command list
  • The shader in question accessed a descriptor array which maps to root table #10
  • The root table #10 read an undefined offset as a result

Game bug, oops! Turns out this scenario does not trigger an error in D3D12 validation layers when I wrote some tests to study this UB scenario (yaaaay <_<). It’s possible to trigger GPU hangs on the native AMD D3D12 driver this way. Maybe they app-opt it on their end for RE Engine, we’ll never know 🙂

Our workaround was to emit offset 0 for every unset root table access and the crash went away. …

Conclusion

For hardcore GPU debugging on PC, I think RADV currently provides the best experience by far. If you’re stuck debugging hangs in D3D12 on Windows, maybe give RADV + vkd3d-proton a shot. GPU hang recovery on amdgpu is sadly still questionable on Linux, but I have a good time as long as the AMD GPU is not driving the desktop session. I suggest multi-GPU here for an enjoyable experience. I’m also hoping this functionality can be added to the newly released RGD by AMD.

Vulkan video shenanigans – FFmpeg + RADV integration experiments

Vulkan video is finally here and it’s a fierce battle to get things working fully. The leaders of the pack right now with the full release is RADV (Dave Airlie) and FFmpeg (Lynne).

In Granite, I’ve been wanting a solid GPU video decoding solution and I figured I’d work on a Vulkan video implementation over the holidays to try helping iron out any kinks with real-world application integration. The goal was achieving everything a 3D engine could potentially want out of video decode.

  • Hardware accelerated
  • GPU decode to RGB without round-trip through system memory (with optional mip generation when placed in a 3D world)
  • Audio decode
  • A/V synchronization

This blog is mostly here to demonstrate the progress in FFmpeg + RADV. I made a neat little sample app that fully uses Vulkan video to do a simple Sponza cinema. It supports A/V sync and seeking, which covers most of what a real media player would need. Ideally, this can be used as a test bench.

Place a video feed as a 3D object inside Sponza, why not?

Introduction blog post – read this first

This blog post by Lynne summarizes the state of Vulkan video at the time it was written. Note that none of this is merged upstream as of writing and APIs are changing rapidly.

Building FFmpeg + RADV + Granite

FFmpeg

Make sure to install the very latest Vulkan headers. On Arch Linux, install vulkan-headers-git from AUR for example.

Check out the branch in the blog and build. Make sure to install it in some throwaway prefix, e.g.

./configure --disable-doc --disable-shared --enable-static --disable-ffplay --disable-ffprobe --enable-vulkan --prefix=$HOME/ffmpeg-vulkan

Mesa

Check out https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-decode. Then build with:

mkdir build
cd build
meson setup .. -Dvideo-codecs=h264dec,h265dec --buildtype release
ninja

Granite

git clone https://github.com/Themaister/Granite
cd Granite
git submodule update --init
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DGRANITE_FFMPEG=ON -DGRANITE_AUDIO=ON -DGRANITE_FFMPEG_VULKAN=ON -G Ninja -DCMAKE_PREFIX_PATH=$HOME/ffmpeg-vulkan
ninja video-player

Running test app

Basic operation, a weird video player where the image is a flat 3D object floating in space. For fun the video is also mip-mapped and the plane is anisotropically filtered, because why not.

RADV_PERFTEST=video_decode GRANITE_FFMPEG_VULKAN=1 ./tests/video-player /tmp/test.mkv

Controls
  • WASD: move camera
  • Arrow keys: rotate camera
  • Space: Toggle pause
  • HJKL: Vim style for seeking

If you have https://github.com/KhronosGroup/glTF-Sample-Models checked out you can add a glTF scene as well for fun. I hacked it together with Sponza in mind, so:

RADV_PERFTEST=video_decode GRANITE_FFMPEG_VULKAN=1 ./tests/video-player $HOME/git/glTF-Sample-Models/2.0/Sponza/glTF/Sponza.gltf /tmp/test.mkv

and then you get the screenshot above with whatever video you’re using 🙂

Integration API

The Granite implementation can be found in https://github.com/Themaister/Granite/blob/master/video/ffmpeg_decode.cpp. It will probably be different in the final upstreamed version, so beware. I’m not an FFmpeg developer either FWIW, so take this implementation with a few grains of salt.

To integrate with Vulkan video, there are some steps we need to take. This assumes some familiarity with FFmpeg APIs. This is mostly interesting for non-FFmpeg developers. I had to figure this out with help from Lynne, spelunking in mpv and looking over the hardware decode samples in FFmpeg upstream.

Creating shared device

Before opening the decode context with:

avcodec_open2(ctx, codec, nullptr)

we will provide libavcodec with a hardware device context.

avcodec_get_hw_config(codec, index)

to scan through until you find a Vulkan configuration.

AVBufferRef *hw_dev = av_hwdevice_ctx_alloc(config->device_type);
auto *hwctx = reinterpret_cast<AVHWDeviceContext *>(hw_dev->data);
auto *vk = static_cast<AVVulkanDeviceContext *>(hwctx->hwctx);

hwctx->user_opaque = this; // For callbacks later.

To interoperate with FFmpeg, we have to provide it our own Vulkan device and lots of information about how we created the device.

vk->get_proc_addr = Vulkan::Context::get_instance_proc_addr();
vk->inst = device->get_instance();
vk->act_dev = device->get_device();
vk->phys_dev = device->get_physical_device();
vk->device_features = *device->get_device_features().pdf2;
vk->enabled_inst_extensions =
  device->get_device_features().instance_extensions;
vk->nb_enabled_inst_extensions =
  int(device->get_device_features().num_instance_extensions);
vk->enabled_dev_extensions =
  device->get_device_features().device_extensions;
vk->nb_enabled_dev_extensions =
  int(device->get_device_features().num_device_extensions);

Fortunately, I had most of this query scaffolding in place for Fossilize integration already. Vulkan 1.3 core is required here as well, so I had to bump that too when Vulkan video is enabled.

auto &q = device->get_queue_info();

vk->queue_family_index =
  int(q.family_indices[Vulkan::QUEUE_INDEX_GRAPHICS]);
vk->queue_family_comp_index =
  int(q.family_indices[Vulkan::QUEUE_INDEX_COMPUTE]);
vk->queue_family_tx_index =
  int(q.family_indices[Vulkan::QUEUE_INDEX_TRANSFER]);
vk->queue_family_decode_index =
  int(q.family_indices[Vulkan::QUEUE_INDEX_VIDEO_DECODE]);

vk->nb_graphics_queues = int(q.counts[Vulkan::QUEUE_INDEX_GRAPHICS]);
vk->nb_comp_queues = int(q.counts[Vulkan::QUEUE_INDEX_COMPUTE]);
vk->nb_tx_queues = int(q.counts[Vulkan::QUEUE_INDEX_TRANSFER]);
vk->nb_decode_queues = int(q.counts[Vulkan::QUEUE_INDEX_VIDEO_DECODE]);

vk->queue_family_encode_index = -1;
vk->nb_encode_queues = 0;

We need to let FFmpeg know about how it can query queues. Close match with Granite, but I had to add some extra APIs to make this work.

We also need a way to lock Vulkan queues:

vk->lock_queue = [](AVHWDeviceContext *ctx, int, int) {
   auto *self = static_cast<Impl *>(ctx->user_opaque);
   self->device->external_queue_lock();
};

vk->unlock_queue = [](AVHWDeviceContext *ctx, int, int) {
   auto *self = static_cast<Impl *>(ctx->user_opaque);
   self->device->external_queue_unlock();
};

For integration purposes, not making vkQueueSubmit internally synchronized in Vulkan was a mistake I think, oh well.

Once we’ve created a hardware context, we can let the codec context borrow it:

hw.device = av_hwdevice_ctx_init(hw_dev); // Unref later.
ctx->hw_device_ctx = av_buffer_ref(hw.device);

We also have to override get_format() and return the hardware pixel format.

ctx->opaque = this;
ctx->get_format = [](
    AVCodecContext *ctx,
    const enum AVPixelFormat *pix_fmts) -> AVPixelFormat {
  auto *self = static_cast<Impl *>(ctx->opaque);
  while (*pix_fmts != AV_PIX_FMT_NONE)
  {
    if (*pix_fmts == self->hw.config->pix_fmt)
      return *pix_fmts;
    pix_fmts++;
  }

  return AV_PIX_FMT_NONE;
};

This will work, but we’re also supposed to create a frames context before returning from get_format(). This also lets us configure how Vulkan images are created.

int ret = avcodec_get_hw_frames_parameters(
      ctx, ctx->hw_device_ctx,
      AV_PIX_FMT_VULKAN, &ctx->hw_frames_ctx);
// Check error.

auto *frames =
  reinterpret_cast<AVHWFramesContext *>(ctx->hw_frames_ctx->data);
auto *vk = static_cast<AVVulkanFramesContext *>(frames->hwctx);

vk->img_flags |= VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT;

ret = av_hwframe_ctx_init(ctx->hw_frames_ctx);
// Check error.

The primary motivation for overriding image creation was that I wanted to do YCbCr to RGB conversion in a more unified way, i.e. using individual planes. That would be compatible with non-Vulkan video as well, but taking plane views of an image requires VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT.

Using per-plane views is important, as we’ll see later. YCbCr samplers fall flat when dealing with practical video use cases.

Processing AVFrames

In FFmpeg, decoding works by sending AVPackets to a codec and it spits out AVFrame objects. If these frames are emitted by a software codec, we just poke at AVFrame::data[] directly, but with hardware decoders, AVFrame::pix_fmt is an opaque type.

There are two ways we can deal with this. For non-Vulkan hardware decoders, just read-back and upload planes to a VkBuffer staging buffer later, ewwww.

AVFrame *sw_frame = av_frame_alloc();

if (av_hwframe_transfer_data(sw_frame, av_frame, 0) < 0)
{
   LOGE("Failed to transfer HW frame.\n");
   av_frame_free(&sw_frame);
   av_frame_free(&av_frame);
}
else
{
   sw_frame->pts = av_frame->pts;
   av_frame_free(&av_frame);
   av_frame = sw_frame;
}

Each hardware pixel format lets you reinterpret AVFrame::data[] in a “magical” way if you’re willing to poke into low-level data structures. For VAAPI, VDPAU and APIs like that there are ways to use buffer sharing somehow, but the details are extremely hairy and is best left to experts. For Vulkan, we don’t even need external memory!

First, we need to extract the decode format:

auto *frames =
  reinterpret_cast<AVHWFramesContext *>(ctx->hw_frames_ctx->data);
active_upload_pix_fmt = frames->sw_format;

Then we can query the VkFormat if we want to stay multi-plane.

auto *hwdev =
  reinterpret_cast<AVHWDeviceContext *>(hw.device->data);
const VkFormat *fmts = nullptr;
VkImageAspectFlags aspects;
VkImageUsageFlags usage;
int nb_images;

int ret = av_vkfmt_from_pixfmt2(hwdev, active_upload_pix_fmt,
                                VK_IMAGE_USAGE_SAMPLED_BIT, &fmts,
                                &nb_images, &aspects, &usage);

However, this has some pitfalls in practice. Video frames tend to be aligned to a macro-block size or similar, meaning that the VkImage dimension might not be equal to the actual size we’re supposed to display. Even 1080p falls in this category for example since 1080 does not cleanly divide into 16×16 macro blocks. The only way to resolve this without extra copies is to view planes separately with VK_IMAGE_ASPECT_PLANE_n_BIT and do texture coordinate clamping manually. This way we avoid sampling garbage when converting to RGB. av_vkfmt_from_pixfmt can help here to deduce the per-plane Vulkan formats, but I just did it manually either way.

// Real output size.
ubo.resolution = uvec2(video.av_ctx->width, video.av_ctx->height);

if (video.av_ctx->hw_frames_ctx && hw.config &&
    hw.config->device_type == AV_HWDEVICE_TYPE_VULKAN)
{
   // Frames (VkImages) may be padded.
   auto *frames = reinterpret_cast<AVHWFramesContext *>(
       video.av_ctx->hw_frames_ctx->data);
   ubo.inv_resolution = vec2(
       1.0f / float(frames->width),
       1.0f / float(frames->height));
}
else
{
   ubo.inv_resolution = vec2(1.0f / float(video.av_ctx->width),
                             1.0f / float(video.av_ctx->height));
}

// Have to emulate CLAMP_TO_EDGE to avoid filtering against garbage.
ubo.chroma_clamp =
  (vec2(ubo.resolution) - 0.5f * float(1u << plane_subsample_log2[1])) *
  ubo.inv_resolution;

Processing the frame itself starts with magic casts:

auto *frames =
  reinterpret_cast<AVHWFramesContext *>(ctx->hw_frames_ctx->data);
auto *vk = static_cast<AVVulkanFramesContext *>(frames->hwctx);
auto *vk_frame = reinterpret_cast<AVVkFrame *>(av_frame->data[0]);

We have to lock the frame while accessing it, FFmpeg is threaded.

vk->lock_frame(frames, vk_frame);
// Do stuff
vk->unlock_frame(frames, vk_frame);

Now, we have to wait on the timeline semaphore (note that Vulkan 1.3 is required, so this is guaranteed to be supported).

// Acquire the image from FFmpeg.
if (vk_frame->sem[0] != VK_NULL_HANDLE && vk_frame->sem_value[0])
{
   // vkQueueSubmit(wait = sem[0], value = sem_value[0])
}

Create a VkImageView from the provided image. Based on av_vkfmt_from_pixfmt2 or per-plane formats from earlier, we know the appropriate Vulkan format to use when creating a view.

Queue family ownership transfer is not needed. FFmpeg uses CONCURRENT for sake of our sanity.

Transition the layout:

cmd->image_barrier(
    *wrapped_image,
    vk_frame->layout[0],
    VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL,
    VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT /* sem wait stage */, 0,
    VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
    VK_ACCESS_2_SHADER_SAMPLED_READ_BIT);

vk_frame->layout[0] = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;

Now, we can convert this to RGB as we desire. I went with an async compute formulation. If this were a pure video player we could probably blit this directly to screen with some fancy scaling filters.

When we’re done, we have to “release” the image back to FFmpeg.

// Release the image back to FFmpeg.
if (vk_frame->sem[0] != VK_NULL_HANDLE)
{
   vk_frame->sem_value[0] += 1;
   // vkQueueSubmit(signal = sem[0], value = sem_value[0]);
}

And that’s it!

Test results

I tried various codec configurations to see state of things.

RADV
  • H.264 – 8bit: Works
  • H.264 – 10bit: Not supported by hardware
  • H.265 – 8bit: Works
  • H.265 – 10bit: Works
nvidia
  • H.264: Broken
  • H.265: Seems to work
ANV

There’s a preliminary branch by Airlie again, but it doesn’t seem to have been updated for final spec yet.

Conclusion

Exciting times for Vulkan video. The API is ridiculously low level and way too complicated for mere graphics programming mortals, which is why having first class support in FFmpeg and friends will be so important to make the API usable.

My personal hell of translating DXIL to SPIR-V – finale

It has been quite a while since the last post in this series, for good reason. Finding the energy to write this down simply does not happen overnight. It is not a coincidence that these kinds of blog posts tend to appear at the end of a long summer vacation!

As teased in the previous post, it is time to finally tackle the real meat of structured control flow. Reading that introductory post is mandatory to have any hope understanding what is going on here, and to be fair, even with that knowledge, I’m not confident I can convey all the bizarre algorithms I made up to make this somewhat work.

Most of what I’m trying to explain here are ideas that have grown organically over time with little to no theoretical backing, and with hindsight there should be better approaches, but taking advantage of that likely needs a complete rewrite. Take this post with a grain (or three) of salt if you intend to write your own structurizer.

Ideally, I would have used an off-the-shelf implementation, but I never found one that satisfies:

  • Standalone
  • Convinces me that it will generate fast code, i.e. does not confuse the hell out of existing compilers with ridiculous branching patterns
  • Does not collapse into pure gibberish on mildly weird code (and DXC sure loves to emit weird code)
  • Understands the SPIR-V flavor of structured control flow
  • Has a concept of convergence
  • An algorithm I can actually understand and debug

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
— Brian W. Kernighan

My little DXIL hell blog series – previous entries

What the algorithm needs to do

The two major requirements of the structurizer is the need to rewrite control flow, and marking required structured information in SPIR-V, such as:

    • Selection merge: For any block that does conditional branch or switch
    • Loop merge + continue block: For any block that has a back-edge

If we can insert these constructs and make the SPIR-V validator pass, we’ve done our job and the code tends to “just werk”. In terms of quality of implementation, it is imperative that the control flow is as natural as possible. The output control flow (passed through SPIRV-Cross) is the only help we have in vkd3d-proton when debugging shaders, so anything that can help make the CFG more readable is extremely important. In more concrete terms, emitting the simplest possible structured construct is the goal.

Unfortunately, it is impossible to simply take the DXIL control flow, analyze it and insert structured information as necessary. Instead, we have to (heavily) modify the control flow graph first, so that it is possible to insert structured information while satisfying SPIR-V requirements later. Modifying a control flow graph is a massive topic, but before falling in that rabbit hole, I’ll try to outline some of the concepts at a higher level to help ramp up the brain clock frequency. In the sections after this, I’ll aim to provide concrete examples with graphviz illustrations.

Dummy basic blocks aiding merge constructs

I refer to these blocks as ladders. A requirement of SPIR-V is that a basic block can only be the merge target of one header. If multiple scopes want to merge to a specific block, we need ladders which can merge them one at a time.

There are exceptions to this rule. A loop merge can serve as a merge block for any selection construct contained within in the sense that selection constructs can just OpBranch to a loop merge as it pleases. It might be tempting to cheese the SPIR-V requirements by turning every complex selection construct into a block-like loop, but a significant hurdle is that there is no multi-break support. This is consistent with all higher level shading languages, since they don’t support goto. Even if we could, it would lead to unreadable garbage code which we need to avoid at all cost. Some structurization algorithms rely on this feature (WebAssembly), but we’ll need to tackle this on hard mode.

Remove fake merge blocks

Any block which has two or more predecessors is potentially hazardous and carries some semantic meaning. The implication is that we need to merge control flow. Of course, DXIL is completely unstructured and it is not well defined when control flow actually converges, but maximal convergence is a good initial assumption, where every block that has two or more predecessors must be a merge block. Unfortunately, this assumption will break down in certain edge cases. When we run into “impossible” scenarios that make no sense from a high level language point of view, we can take the foot off the pedal and reconsider if we really need to merge.

DXC tends to “merge” control flow in bizarre ways that does not necessarily reflect the intent of the original code. As discussed in the previous post, I covered the issue of subgroup operations in loops, but the issue can manifest in scenarios unrelated to subgroups as well.

Fortunately, the scenarios that I’ve seen are limited to basic blocks that exist only to emit a PHI opcode or two, and then actually perform the break. The solution is to simply split the block. It is generally safe to do this as long as the block is not control dependent, i.e. it has no derivative, barrier or subgroup operations.

Promote to loop blocks

A fairly common pattern is:

do
{
  blah();
  blah();
  if (foo) break;
  blah();
  blah();
} while(false);

There is no loop here, but in some situations we are kind of forced to insert a “block” construct. A block construct is a loop with a continue block that is not reachable in the CFG. This code is inevitable with inlined functions that return early.

A cute trick spirv-opt uses is an OpSwitch with one default case. It is also possible to break out of the switch this way. dxil-spirv did not go this route since we have to handle multi-break anyways, and a switch does not help with that.

Rewriting multiple breaks

This gets extremely hairy, and dealing with it is just another Tuesday.

This is also a problem spirv-opt deals with. When inlining, it is possible to return from two inner loops like such:

void func()
{
  while (cond1)
  {
    a();
    while (cond2)
    {
      b();
      if (foo) return; // y u do dis ;____;
      c();
    }
    d();
  }
}

I have seriously contemplated “un-inlining” code to support shenanigans like this, but that is a level of insanity I am not willing to engage with.

What is a break? A miserable pile of heuristics

In selection constructs, we’ll end up in a situation where there is no clear merge block, i.e. the header does not dominate the immediate post-dominator, yet we have to emit a merge block for any selection. If our block branches to A and B, we might have to select one of three situations:

  • The branch to A is a “breaking construct”, B is not. Merge to B. Path in A will leave the selection construct through an outer switch or loop break.
  • Vice-versa above.
  • Both paths can be considered to be breaking constructs. Merge to an OpUnreachable dummy node.

This is harder than it sounds. Tie-breaking between A and B paths can be quite the ordeal.

Transposing control flow

There are some extremely rare scenarios where we need to transpose code. For example, code that lives inside a loop must be moved outside the loop for structurization to be possible. These shaders cause serious brain damage, be warned!

Note that transposing control flow is a term I pulled out of the aether. Such is life when you don’t really know what you’re doing.

If we hit this scenario, we must give up any hope of debugging the code, it will look incomprehensible at this point.

Switch blocks with free-flowing fall-through

Switch blocks in unstructured control flow are truly cursed. They can fall-through case labels, but we also find scenarios where multiple blocks fall-through to the same case label. Spicy!

Rewrite PHI nodes

When rewriting control flow, we will break any PHI nodes. One solution would have been to de-SSA the shader, but I went with the more principled approach and retain SSA throughout. I’m not sure in hindsight that it was the correct solution due to its complexity, but de-SSA is not trivial either, especially with switch blocks that fall-through. SPIRV-Cross has some truly cursed code to deal with that, and it’s horrible enough that I did not want to replicate that here.

Infinite loop shenanigans

Infinite loops can be quite tricky to analyze since it is somewhat vague what it means to exit such a loop. Typically, it would be a return inside the loop. There are various horrible scenarios like these.

The transform passes

The structurizer is organized as a series of transform passes more or less. Once we have computed the CFG, we can perform a pass, at which point we need to recompute the CFG before the next pass. When we rewrite the CFG in any way, we technically have to recompute the CFG, but this is quite troublesome, since this kind of code would break:

for (auto &node : post_visit_order)
{
  if (something(node))
  {
    modify_cfg();
    recompute_cfg(); // Would affect post_visit_order
  }
}

Recomputing the CFG every time a branch changes would just be too slow, so the implementation tries to avoid recomputing the CFG until we have traversed through the entire CFG. To make this work, we have to recompute what we need in-place while modifying the CFG. For example, we have to patch in which blocks are immediate dominators and post-dominators. Reachability information can be inherited from other blocks by copying the post-visit order index. This can indeed lead to many subtle bugs, but I don’t know what a better solution would look like. It seems to work fine for simple scenarios where we don’t significantly rewrite the control flow and dominance and post-dominance relationships aren’t affected globally.

Some algorithms have extremely awkward transforms that must be implemented as a loop that terminates ones there are no special cases left to handle, but these scenarios are also extremely rare, e.g.:

do
{
  did_work = false;
  for (auto &node : post_visit_order)
  {
    if (need_complex_weird_thing(node))
    {
      rewrite_branches_in_a_horrible_way();
      recompute_cfg();
      did_work = true;
      break;
    }
  }
} while (did_work);

As you’ll quickly figure out reading this, this is really just a series of jank heuristics layered on top of each other. There is no strong unifying algorithm to speak of here, but the advantage is that we slowly and surely chip away at the insanity LLVM emits into something that ends up looking acceptable. If we throw generic algorithms at it, I think the result will end up looking nothing like the input whatsoever and collapse into noise.

Stage 1 – Cleanup breaking PHI constructs

void CFGStructurizer::cleanup_breaking_phi_constructs()

As the very first thing we do, we aim to remove merge blocks that should not be considered to be a merge as these blocks will create problems for us later. Let’s consider this shader:

Block 3 here is problematic. Both 2 and 4 break, but it is illogical for there to be a block 3 here that merges execution. There is no way to express this merge in high level code without rewriting everything, so what we want to do instead is to split up the node.

The heuristic here is fairly simple. We’re looking for a block that has no operations associated with them, only PHI nodes. It must have exactly one successor. We also avoid removing blocks that are good loop merge candidates. If a back edge branches to the candidate block, we should not remove it, since it is a clear loop merge candidate. With these checks in mind, we can rewrite the PHI in 3 where the inputs from 2 and 4 are added directly to 5. The merge is now gone and we don’t have to consider it for later.

Stage 2 – Create continue block ladders

void CFGStructurizer::create_continue_block_ladders()

It is often the case that continue block (i.e. back edges) have multiple predecessors. Making the continue block a merge target does not seem to be valid, so we need to create a dummy predecessor instead.

Using this shader as an example:

6 is the back-edge and the continue block (they both mean the same thing really). The 4 to 6 branch is a typical continue statement. We’d also like to merge execution in 6 since it has two preds, but to do this we’ll need a helper pred block of 6, as otherwise spirv-val complains. I’m not 100% sure if it’s illegal to merge to a continue construct, but it is very awkward for sure. Two quotes from SPIR-V 2.11.2 jump out that might explain this:

  • a loop construct: the blocks structurally dominated by a loop header, excluding both the loop header’s continue construct and the blocks structurally dominated by the loop header’s merge block
  • if a structured control-flow construct S contains the header block for a selection, loop or switch construct different from S, then S must also contain that construct’s merge block

Since the continue block is excluded from the loop construct, the inner selection construct cannot merge to it, since all merge targets must be contained within the enclosing loop construct.

In any case, the final shader ends up looking like:

As we can see here, 6.pred takes on the role of the merge block before we continue. 3.succ takes on the role of a loop block where continue is translated to inner break.

(Circles are loop headers, triangles are selection headers, dashed lines are merge blocks and dotted lines are continue blocks.)

There are also scenarios where we don’t do this. We abandon the idea of having merged execution in the continue construct, since we really need to use continue as a way to break out of weird control flow.

bool CFGStructurizer::continue_block_can_merge(CFGNode *node) const

The heuristics here are … questionable at best. The comments say it all:

// This algorithm is very arbitrary and should be seen as a nasty heuristic which solves real shaders
// we see in the wild. It's probably safe to block continue merge in far more cases than this, but we
// want to be maximally convergent as often as we can.

Here we see a real-world shader, where 17 is a continue block we probably shouldn’t consider to be a merge block as well. 24 is deduced to be a far more reasonable candidate for merging and we’d have to emit extremely bizarre code to make the merge-to-continue style work. Note that 24 can break out of the loop (27) as well as continue to 17. If we enclose 17 in a loop block, we would not be allowed to break anymore without some horrible rewrites due to the double break, and you probably see why the heuristic is in place. Generally, it’s very hard to say if a continue block should be maximally convergent or not (it’s not in SPIR-V as-is). We’re okay with hand-wavy heuristics here I think.

Final output:

Stage 3 – De-interleave unrolled loop breaks

This is a recent addition and I’ll defer explaining it for later due to its high complexity. I don’t want to cause brain damage just yet 🙂

Stage 4 – Recover nested selection constructs

void CFGStructurizer::split_merge_scopes()

This is where things start to get kinda difficult, but I’ll stick to simple examples. The gist of this pass is to study all blocks which are merge block candidates, and introduce intermediate merge blocks for nested selection constructs. Consider a basic nested if-else pattern:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  [branch]
  if (thr & 4)
  {
    [branch]
    if (thr & 6)
      RW[0] = 1;
    else
      RW[0] = 2;
  }
  else
  {
    [branch]
    if (thr & 12)
      RW[0] = 3;
    else
      RW[0] = 4;
  }
  RW[1] = 40;
}

void CFGStructurizer::rewrite_selection_breaks(
    CFGNode *header, CFGNode *ladder_to)

We could just give up and treat .entry as a loop header, where all branches to 7 are treated as break, but that won’t do. This is where we study any potential merge block. In this case, we find block 7. We identify .entry as its immediate dominator. From here we carve out a subset of the CFG that is dominated by .entry and can reach 7.

template <typename Op>
void CFGNode::traverse_dominated_blocks(const Op &op) const

This utility is used quite a bit, and we’re interested in finding selection headers that have a post dominance frontier of .entry. There are some other checks to avoid wild edge cases, but see code for details.

if (node->succ.size() >= 2 && !branch_is_loop_or_switch)
{
   auto *outer_header =
     get_post_dominance_frontier_with_cfg_subset_that_reaches(
       node, ladder_to, nullptr);

   if (outer_header == header)
      construct.insert(node);
}

We’re only interested in the next scope level here. The candidates we find are 1 and 2. For 1, we create a ladder block, and rewrite all branches to 7 which are dominated by 1, similar for 2.

// Stop rewriting once we hit a merge block.
traverse_dominated_blocks_and_rewrite_branch(inner_block,
    ladder_to, ladder, [inner_block](CFGNode *node) -> bool {
       return inner_block->selection_merge_block != node;
    });

This looks like:

There are some edge cases here. For example, if our input CFG was this one, we would end up creating redundant ladder blocks. To avoid this, we stop the traversal early if the selections in 1 and 2 end up merging their scopes. Ahead of time, we have done some basic analysis to mark selection merge blocks candidates which is used for this purpose.

// Do we keep going?
return inner_block->selection_merge_block != node;

This analysis then recurses into the inner selection constructs to resolve multiple nested scopes.

In hindsight, this approach of recursive traversal is not efficient at all, but there are also a lot of particular edge cases to consider where we skip rewriting branches. The details are a bit too esoteric for this kind of post, so again I’ll refer to the code instead.

Another common case is if-else ladders:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
   uint val;
   [branch]
   if (thr & 4)
      val = 1;
   else if (thr & 3)
      val = 2;
   else if (thr & 8)
      val = 5;
   else if (thr & 16)
      val = 21;
   else if (thr & 32)
      val = 20;
   else
      val = 50;
   RW[0] = val;
   RW[1] = 40;
}

Both of these scenarios are perfectly recovered, which is nice. When possible, we always want to avoid adding loop blocks, for our own sanity.

Stage 5 – Remove degenerate blocks

void CFGStructurizer::eliminate_degenerate_blocks()

This stage isn’t very interesting, but when adding ladders in stage 4, it’s very easy to end up with ladders that introduce merges that we did not intend to exist. Very similar ideas to stage 1. We might also introduce blocks which don’t do anything useful, i.e. they might have one predecessor and one successor with no opcodes and no PHI nodes. For sanity’s sake, we nuke those blocks as well.

if (noop && single_pred && single_succ)
   easy_choice();
else if (noop && merge_candidate_is_on_breaking_path(node))
   danger_zone();

A random example from a real world shader is 10.16.ladder.19.ladder.21.ladder which was introduced by stage 4. This node is on a breaking path, and we really should not merge execution here. There’s also a bunch of useless single pred/succ blocks we can nuke as well.

We have to be careful though. Not all no-op blocks are redundant, and we can only remove degenerate merge blocks if they are on a breaking path. I’ll cover it in more detail in the intermission.

Stage 6 – Duplicate impossible merge constructs

void CFGStructurizer::duplicate_impossible_merge_constructs()

Breaking constructs are a blight on my soul, and we’re still not done dealing with this. We’re now entering the kind of edge case where CFG attempts to merge execution in breaks, but this block contains actual code … what to do!?

Here’s another random example I found:

228 is a loop header, and 319 attempts to break out with a double break (._.). As discussed already, merging execution on a breaking path is questionable at best, and it is impossible to express this without completely rewriting the CFG. This time, there is actual code in 319, so we have to be extremely careful.

To solve this, we bring out the big guns and just duplicate the block. Fortunately, I have not run into a scenario where this kind of block is control dependent …

If the words here did not cause concern, the code comments probably do:

// WARNING: This check is EXTREMELY sensitive and microscopic changes to the implementation
// will dramatically affect codegen.
bool breaking = merge_candidate_is_on_breaking_path(node);

if (breaking && !node->ir.operations.empty() &&
    !block_is_control_dependent(node))
   duplicate_queue.push_back(node);

Intermission – what is a breaking path anyways?

So far, we have done a ton of work to consider breaking paths, trying to classify code into a “non-breaking” path where control flow can merge with maximal convergence, and breaking paths where we can avoid merging until we hit the outer merge block. As hinted to earlier, it can get very tricky to identify what a breaking path even is. For loops, this is somewhat easy, but for selection constructs, it gets pretty complicated.

bool CFGStructurizer::control_flow_is_escaping(
    const CFGNode *node, const CFGNode *merge) const

This function is load bearing for a lot of the code in the structurizer, so I should probably try to explain the rationale behind it. The merge node in this context is a post-dominator of node and the construct we’re analyzing. Node does not dominate merge.

Loops

If we recall what an unstructured loop is, the loop body is where we are dominated by a loop header, and we can still reach the back edge. The moment a branch enters a path of no return, we can consider it a break. The rules I came up with are:

  • Node has a loop header which dominates it (duh)
  • Node cannot reach the back-edge (otherwise it would be in the loop)
  • Back-edge cannot reach node (we have not merged the loop yet)
  • All post-domination frontiers of node can reach back-edge (a breaking construct can have control flow inside of it, and we only care about the break itself)

Using these rules is fairly straight forward, but it gets trickier!

Selections

In a selection, there is no guiding star like the loop header and back edge to analyze. Rather, we need heuristics to deduce if a branch looks and smells like a break. This is the kind of cruft that accumulates over time as games ship more and more difficult shaders …

Load bearing blocks – are we important?

If we’re a break block with two or more predecessors, we can consider that CFG reachability does not change in meaningful ways if we remove the block. In particular, we look at the immediate dominator and the merge block. We then perform a query: given reachability chain A -> B -> C, does removing B change reachability for A -> C? If we’re load bearing like this, it’s not a break.

bool CFGStructurizer::exists_path_in_cfg_without_intermediate_node(
   const CFGNode *start_block,
   const CFGNode *end_block,
   const CFGNode *stop_block) const

Analyze domination frontiers

Next heuristic is to see if node can reach the merge target without encountering any unrelated domination frontiers along the way. If there are domination frontiers other than merge, it probably means we’re still on a structured path and we end the heuristic.

If we get to this point, we need further checks to be sure we make the right decision. Heuristics are so much fun!

Did we just branch to a continue block?

If we didn’t emit a special block in stage 2, branching to the continue block is definitely a breaking construct. Accept immediately.

Post-dominance shenanigans

Here we adopt the same idea as loop analysis. If a post-dominance frontier can reach a domination frontier before reaching merge block, that is a very strong indication we’re on a breaking path. For example, if we consider this CFG:

Node 6 here attempts to break out to 9.pred, but 8 does not. This is obvious to human inspection, but why is that the case here? The key lies in post-domination frontiers. For 6, we have 4 as a PDF. This node has a domination frontier 8, so 4 is in a structured flow, thus we have shown that we went from a normal path (4) to a breaking path (6). For 8, 4 is in the post-domination frontier as well, but 4 does not dominate 8, so we don’t consider it. To understand what 8 is, we need to analyze further …

Do we post-dominate any useful work?

This is an esoteric check. This is mostly to catch more scenarios in stage 5, since accidental ladder merge blocks can get really weird. Essentially, if we don’t post-dominate anything useful, it’s probably a break construct.

Tie-break patterns

If we still don’t know what is going on (node 8), we attempt a final check. We back-trace from 9.pred and try to see if we can back-trace to a node that is not our candidate but can end up reaching 8. In this case, we will back-trace through 6, and find 4. 4 can reach 8. In some sense, we’re just trying to query if our neighbor predecessors of 9.pred are break blocks instead.

Stage 3 – De-interleave unrolled loop breaks

Time for more gnarly code that will cause mild brain damage (sorry!). Dealing with breaks isn’t bad enough, what if I told you it was possible to have 2 or more different break merge scopes? This is a situation where nothing will help except a complete rewrite of the CFG. Here’s an example:

RWStructuredBuffer<uint> RW : register(u0);

[numthreads(1, 1, 1)]
void main(uint id : SV_DispatchThreadID)
{
   uint v;
   uint w = 1;
   uint dummy = 0;

   [unroll]
   for (int i = 0; i < 4; i++)
   {
      InterlockedAdd(RW[0], w, v); w = v;

      [branch]
      if (w & 13)
      {
         InterlockedAdd(RW[0], w, v); w = v;
         dummy = 1;
         break;
      }

      [branch]
      if (w & 1)
      {
         [branch]
         if (w & 4)
            InterlockedOr(RW[0], w, v); w = v;
         dummy = 2;
         break;
      }

      [branch]
      if (w & 2)
      {
         InterlockedOr(RW[0], w, v); w = v;
         dummy = 3;
         break;
      }
   }

   InterlockedAdd(RW[0], w, v); w = v;
   InterlockedAdd(RW[0], dummy, v);
}

Spaghetti, yum! Fun fact, if I were to unroll this to 32 or more loop iterations, graphviz will take minutes to complete and without a proper fix dxil-spirv will explode, creating endless ladder blocks. It is truly the Stairway to Heaven of CFGs! I ran into this with a game that used 128 iterations instead, and the shader never completed. Of course, this was an SSR shader, and I cannot fathom why anyone would want to unroll a massive loop with several breaks 128 times, it’s pure and utter chaos, but here we are …

5 is the merge block of our “loop”. 2, 4 and 7 are all break blocks, and the predecessors interleave for every iteration. This is simply impossible to figure out with structured control flow as-is, so it’s time to take out the hammer, we simply have to rewrite this CFG into something extremely ugly instead.

bool CFGStructurizer::serialize_interleaved_merge_scopes()

We start similarly as stage 4 where we find selection merge candidates, identify the immediate dominator, and analyze the inner scopes. What we are looking for is two or more merge candidates that cannot reach each other with a post-domination frontier that partially overlaps. For example, 6 and 11 are PDFs of 2 and 10 is a PDF of 7. This is a bizarre case where we’re breaking in different directions all the time. In this case, we simply have to transpose the entire shader into this.

5.pred is a dispatch switch block. It creates a PHI variable where the predecessors signal which break block to dispatch to.

Technically, we could consider doing something similar in stage 6, where any merging break constructs are transposed outside the loop, but this transposition obliterates any hope of emitting readable code, so we reserve this only as a last resort.

Stage 7 – Rewrite transposed loops

More brain damage incoming, but just like stage 3, I’ll defer this until the bonus section since it’s ridiculous nonsense 🙂

Structurization passes

void CFGStructurizer::structurize(unsigned pass)

In all the passes above, we mostly focused on rewriting the CFG, tweaking it, etc, but now we’re mostly focused on emitting structure information. This is done twice. In pass 0, we focus on analysis, rewrite the CFG to resolve any issues, and in pass 1, we only mark structurization information. If we are forced to modify the CFG at that point, we have a bug in the implementation.

The natural merge and actual merge

Before digging into conditional branches, switches and loops, we have to elaborate on the system used in dxil-spirv to deal with multi-break.

Switches and loops are easy to identify, but in terms of structurization there is a difficult requirement that any merge block we decide must be dominated by the header. For loops, the merge block must also post-dominate the header, discounting any block that terminate the CFG like returns and DXR intersection termination. Switch blocks are a little more relaxed since it’s legal in SPIR-V to break from inside a switch block to the outer loop merge block.

For a more concrete example of what I’m blabbing on about:

3 is our loop header. 4 is clearly a breaking construct that breaks to the exit block, 5 is the back-edge and 6 is the “natural” loop merge if all breaks were removed. As expected, we cannot just say that 6 is the merge block of 3 and call it a day, since the branch from 4 to 1 is simply illegal in structured control flow.

For every loop, we essentially start considering two kinds of merges.

  • The natural merge (CFGNode::loop_ladder_block), where the loop should merge.
  • The actual merge (CFGNode::loop_merge_block), equal to the common post dominator of all loop exits.

If these two don’t match, we’ve got trouble. The idea is that in pass 0, we find these issues, fix them up, and if we do this analysis again in pass 1, these nodes must match.

Switch blocks have similar considerations where we find the merge point through common post domination analysis, and a natural merge, where it makes sense to merge.

auto *merge = find_common_post_dominator(node->succ);
auto *natural_merge = find_natural_switch_merge_block(node, merge);

For example, here is a monster switch where the natural merge seems to be 86, but 24 is the post-dominator. This is a funny case with multi-way case fall-through, which is not legal in structured control flow.

I’m sorry about these examples being a bit random and unfocused, but I find these by adding breakpoints where needed and run the entire game repro suite over it until some shader triggers it. The lucky winner is publicly shamed here 🙂 It’s very hard to come up with HLSL that will trigger these exact scenarios. AAA shaders are unhinged enough to trip all these cases.

Marking header and merge block relationships

The actual merge location needs to be recorded so that we can fix things up if there is a problem. The basic idea is that the header records which kind of merge it is (loop or selection), and points to the block in question. In the first pass, the header does not necessarily need to dominate the merge block.

The merge block has a list of headers associated with it. If that number is greater than two, we have a problem, which needs to be corrected later. With any kind of breaking construct, it is inevitable that we will see a merge block with two or more headers associated with it.

In this example, 1 will be associated with both 3 and .entry as relevant headers.

The passes

void CFGStructurizer::structurize(unsigned pass)
{
   if (find_switch_blocks(pass))
   {
      // Can rewrite the CFG in slightly weird ways.
      recompute_cfg();
      if (find_switch_blocks(pass))
      {
         LOGE("Fatal, detected infinite loop.\n");
         abort();
      }
   }

   find_loops();
   find_selection_merges(pass);
   fixup_broken_selection_merges(pass);
   if (pass == 0)
      split_merge_blocks();
}
Switch blocks

Switch blocks can get extremely tricky since they support fall-through, and fall-throughs look deceptively like a merge. The most interesting bit is analyzing case fall-through.

CFGNode *CFGStructurizer::find_natural_switch_merge_block(
    CFGNode *node, CFGNode *post_dominator) const
  • Sort all case labels (and default) by post visit index order.
  • If a case label can reach another case label, it must be sorted right before. (SPIR-V requirement)
  • If a conflict arises where two case labels can reach another case label, consider that the natural merge, since it’s the only thing that makes sense.
  • If a dominance frontier of a case label can be reached by another case label, and that dominance frontier is not another case label, we need to consider it a natural merge. In the massive example above, 86 is such a dominance frontier. 86 is the dominance frontier of 27, and 86 is also reachable by 26, uh oh!

Certainly, this cannot be robust against all possible insane code you can emit with unstructured switch statements if there are multiple such “natural merge” candidates, but I’ll have to tackle that when I see it. I’d go bald if I had to anticipate every possible scenario like that.

In this particular case, we can take advantage of the fact that it is valid to break out of a loop from inside a switch construct in SPIR-V. This is not allowed in high level shading languages, but SPIRV-Cross works around it.

Selections
void CFGStructurizer::find_selection_merges(unsigned pass)

This is pretty simple. Start at any block with more than 1 predecessor, find the idom and mark the selection. There can be conflicts however, if the idom is already marked as a loop or another selection, we need to do something.

In this shader, 9.pred and 8 are both merge candidates where 3 is the idom. Since 3 is already a loop header, we must create an inner block, 3.succ, which can serve that purpose. 8 is then analyzed, which finds 3.succ to be its idom. Since it already has a selection merge associated with it, we create yet another succ helper, 3.succ.succ. 3.succ is promoted to a loop block header. To remember that 3.succ is indeed a loop header despite having no back-edge, we use the freeze_structured_analysis flag.

In hindsight, it might have been possible to use a single block switch block for 3.succ here instead.

Fixing up broken selection merges
void CFGStructurizer::fixup_broken_selection_merges(unsigned pass)

While we have accounted for all merge blocks, not all selection headers have been marked in any way. This typically happens for break constructs, since these blocks don’t dominate much code at all. In the example above, 4 is a great example. We’re mostly interested in two things here:

  • Figure out if one branch is a “break” and the other path is not. We are forced to merge somewhere, and we must merge in the direction away from the break.
  • Find common post-dominator of all successors so that we can resolve weird breaks properly later. In this case, 4 would be marked as a header for 9.pred.

Unfortunately, there are half a million edge cases here as well, but I’ll omit them here since it’s just endless tie-break rules to figure out which branch (or neither) gets to be the lucky merge.

Loops

Loops are quite horrible since code quite often breaks out of multiple loops at once. A case study that covers a fair amount of interesting scenarios would be:

uint func(uint3 dispatch)
{
   [loop]
   for (int i = 0; i < 10; i++)
   {
      [loop]
      for (int j = 0; j < 20; j++)
      {
         [branch]
         if (dispatch.y < 10)
            return 50;

         [branch]
         if (dispatch.z < 10)
            return 70;
         dispatch.y++;
      }

      dispatch.x++;
   }

   return 80;
}

RWStructuredBuffer<uint> buf : register(u0);

[numthreads(1, 1, 1)]
void main(uint3 dispatch : SV_DispatchThreadID)
{
   buf[dispatch.x] = func(dispatch);
}

To begin, we find all blocks with a back edge. These must be loop headers. The meat of the analysis happens in

auto result = analyze_loop(node);

where we study all exits from the loop body, and deduce what kind of exit it is. As discussed earlier, a loop exit happens when we can no longer reach the back-edge.

We’re mostly interested in the question if an exit is a break or not. We aim to find one block that we consider the natural merge (CFGNode::loop_ladder_block), and the post-dominator of all exits (CFGNode::loop_merge_block). If the back-edge can exit the loop, that exit must be our natural merge, otherwise, we should choose a block that the loop header dominates. This can happen if the loop only exits from the header, or there is an infinite loop that only exits by breaking for example.

In this particular example:

  • Loop header 2, loop_ladder_block = 7, loop_merge_block = 5
  • Loop header 1, loop_ladder_block = 8, loop_merge_block = 5
Resolving merge conflicts
void CFGStructurizer::split_merge_blocks()

In the case where more than one header is associated with a merge block, we have to do some horrible things. 5 is the lucky winner this time and we get to show off triple breaks. The headers are {1, 2, 4}. The basic idea is to work inside out so we start at block 4. Looking at our loop scopes, we know that we can only break so far. Scanning through the header list from right to left, we find 2.

// Start from innermost scope,
// and rewrite all escape branches to a merge block
// which is dominated by the loop header in question.
// The merge block for the loop must have
// a ladder block before the old merge block.
// This ladder block will break to outer scope,
// or keep executing the old merge block.
for (size_t i = node->headers.size() - 1; i; i--)
{
   auto *current_node = node->headers[i];

   // Find innermost loop header scope we can
   // break to when resolving ladders.
   CFGNode *target_header =
     get_target_break_block_for_inner_header(node, i);
   ...
}

The end result ends up looking quite weird as seen below. I should mention that spirv-opt needs to do similar things when inlining functions, but of course, inlining already structured code is quite a bit simpler.

This is where we make use of the distinction of loop_ladder_block versus loop_merge_block. The basic idea is that we can create a pred helper of loop_ladder_block which then dispatches either another break or continues the “normal” path of branching to loop_ladder_block. If the outermost header is not a loop itself, we have to force one. 1.pred here is a helper block that allows the final break. The branch from 4 to 5 (via 3) therefore punches through the intermediate blocks. 7.pred.pred and 8.pred create PHI variables which signal if the break is a multi-break or not.

Mark branches that are expected

// Before we start splitting and rewriting branches,
// we need to know which preds are considered "normal",
// and which branches are considered ladder breaking branches
// (rewritten branches).
// This will influence if a pred block gets false or true
// when emitting ladder breaking blocks later.
Vector<UnorderedSet<const CFGNode *>> normal_preds(node->headers.size());
for (size_t i = 0; i < node->headers.size(); i++)
   if (node->headers[i]->loop_ladder_block)
      for (auto *pred : node->headers[i]->loop_ladder_block->pred)
         normal_preds[i].insert(pred);

Conditionally break

auto *ladder = create_helper_pred_block(loop_ladder);
// Merge to ladder instead.
traverse_dominated_blocks_and_rewrite_branch(header, node, ladder);
ladder->ir.terminator.type = Terminator::Type::Condition;
ladder->ir.terminator.conditional_id = module.allocate_id();
ladder->ir.terminator.false_block = loop_ladder;
// True target is the innermost loop construct.

PHI magic for predecessors after rewriting branches

for (auto *pred : ladder->pred)
{
   IncomingValue incoming = {};
   incoming.block = pred;
   bool is_breaking_pred = normal_preds.count(pred) == 0;
   incoming.id = module.get_builder().makeBoolConstant(is_breaking_pred);
   phi.incoming.push_back(incoming);
}
ladder->ir.phi.push_back(std::move(phi));

This is trivial for compilers to optimize away if they don’t need this pedantic level of structured control flow. The conditional branch in these blocks is a boolean PHI which takes constant inputs based on the predecessors. Thus, we can statically deduce that 4 -> 5 is a direct branch.

Of course, this is a well-behaved example without any serious edge cases, but the basic idea remains the same.

The idea of a ladder block like 7.pred.pred and 8.pred that dispatches shows up several places.

Structurization summary

This mostly concludes the structurization passes. It’s not exhaustive by any means, but the broad strokes should be covered, and I think I somehow managed to explain the overall architecture.

I don’t expect anyone to grasp all of this by just reading this. I can barely understand what is going on myself, and it’s not like any of this is backed by best practices. This approach just happened to work and has been refined one horrible AAA shader at a time.

Rewriting PHI nodes

After rewriting the CFG, PHI nodes will break since the original predecessors do not match reality anymore. Fortunately, this part of the algorithm isn’t as esoteric and has well known algorithms.

void CFGStructurizer::insert_phi()

Fixup broken value dominance

void CFGStructurizer::fixup_broken_value_dominance()

This isn’t related to PHI, but it is related to SSA. SSA values created in one block can be consumed in another as long as the creating block dominates the consumer. If that’s no longer the case, we have to do something. In my case, I just lowered it directly to an OpVariable with explicit OpStore / OpLoad instead. It’s quite rare for this to happen in practice.

This might be a hint that keeping SSA throughout was a mistake. (._.)

Reinserting PHI values

void CFGStructurizer::insert_phi(PHINode &node)

Given that we have N SSA values created in N blocks, we’ll have to insert new PHI nodes such that every predecessor of the original PHI node gets a proper value. Any mem2reg algorithm does this, but I’ll try to summarize the basic ideas. There’s also one special edge case that mem2reg would never consider.

End state

The algorithm is iterative, and we’re done when every predecessor of the target node can find an input value created by a node that dominates the predecessor, at which point we can emit the OpPhi. The algorithm will add more input values until we satisfy this condition.

Finding the next frontier

For all our current input nodes, look at their dominance frontiers. We pick the dominance frontier that’s earliest in the CFG (highest post visit index) and can reach the final node. Reachability is tricky when the PHI node is in a loop header. We have to consider the back-edge as well since it’s also an input.

A frontier is only considered if it makes forward progress, i.e. we ask the question if having a frontier here is useful at all.

bool CFGStructurizer::phi_frontier_makes_forward_progress(
   const PHI &phi, const CFGNode *frontier,
   const CFGNode *end_node) const

For example, if the frontier cannot reach any existing input, it makes forward progress. Also, if a frontier is such that any path from input to final node must go through the frontier, it also makes forward progress, because we can remove that input from consideration in the next iteration.

There’s probably a more elegant solution, but I couldn’t be arsed to study the algorithm in detail.

Placing a new input value at frontier

Look at all preds of the frontier and find a candidate input value. If there is no candidate (quite possible and expected in some cases), use OpUndef.

OpSelect workaround

Due how we rewrite branches through ladders, it’s possible that code which did not reach each other now do. Thus we can have a situation where a frontier dominates an existing input and the only way to resolve this is through OpSelect. Either we keep the existing input or use the PHI’d value from predecessors depending on the intent of the predecessor.

When testing this, I realized that this might be a bogus case that actually never happens, because I was not able to trigger it in any shader in the repro suite … Might as well just remove the dead code and see what happens 😀

Bonus round – brain damaging edge cases

I teased loop transposition as a thing, so I’ll try to explain the problem. I don’t think I can properly explain the solution though.

All of these cases are exceptionally rare. I’ve only seen a single example of each case across the entire repro shader corpus.

Outer loop transposition

void CFGStructurizer::rewrite_transposed_loop_outer(
   CFGNode *node, CFGNode *impossible_merge_target,                                                 
   const LoopMergeAnalysis &analysis)

This code is truly cursed, because it pokes a hole in the model of unstructured loops and structured loops. The loop in question is 6. The loop ladder block is 16 since the back-edge (14) branches there. However, we observe a bizarre merge to 12. Normally, we would be able to consider 12 to be a break target, but 12 does not post-dominate anything, so we’ll never deal with this in split_merge_blocks(). 5 is the post-dominator of 6‘s loop exits. We also cannot just split up 12 and treat it as a break block, because:

  • 16 can reach 12, clearly it’s outside the loop
  • It has a conditional branch on its own, duplicating the block brings its own set of problems

We’re in a situation where 12 is technically inside the loop in structured control flow (we never merged), but outside the loop in unstructured control flow. On top of this, 16 can reach 12, so in a structured sense, we’re branching back into our own loop … So much fun!

This is a similar approach as split_merge_blocks(). Create a ladder block, rewrite all branches there, and dispatch based on the predecessor. Leads to hilariously unreadable code, but it’s our only option here.

The final output also needs to deal with multi-break, and the resulting CFG collapses into white noise.

Inner loop transposition

void CFGStructurizer::rewrite_transposed_loop_inner(
   CFGNode *node, CFGNode *impossible_merge_target,
   const LoopMergeAnalysis &analysis)

This is very similar to outer loop transposition, but it is somehow more cursed.

This CFG is incomprehensible, and I honestly have no idea how this can possibly be generated from HLSL, but here we are. 113 and 121 are inner nested loops. The odd scenario here is that the normal loop exit of 120 actually breaks further out than what we would expect. 129 is a break block that breaks out of the two loops, but it doesn’t break as far as 120 does, since it ends up in 89. This is extremely weird. I solved this similarly to the outer loop scenario. Collect all branches in a ladder and dispatch …

Ladders, ladders everywhere! This CFG is pure non-sense, I’m just including it as an example of how cursed some shaders turn out to be.

Merge to outer loop continue

This is similar to inner loop transposition where we have to be very careful. The loop in 16 can exit in 22, and this confuses the algorithm, since we have to find the common post-dominator of all loop exits. The back-edge 25 does not have any forward successors so it confuses the analysis, but we specifically detect 22 and transpose the loop so that we can merge execution, before breaking out of the outer loop.

I omitted the final CFG here since it looks like a complete mess. A spider web of code if you will (ahem) :V

Conclusion

Joke’s on me, because with my luck, a new game probably ships that breaks everything, but for now, this ends my brain dump of the truly cursed corner of vkd3d-proton which is the DXIL -> SPIR-V conversion.

At some point, all of this probably has to be rewritten and I’ll have to abandon any hope of having somewhat debuggable codegen, but for now, this is my questionable take on structurization.

My personal hell of translating DXIL to SPIR-V – part 5

Time for despair

It has taken way too long to get to this point, but we must address the final challenge of translating DXIL, a claustrophobic mess that haunts me to this day, the most nightmarish set of poorly understood heuristics which come together to put pretty pixels on screen. It is of course, the act of untangling spaghetti and turning unstructured control flow into structured control flow.

True impostor syndrome kicks in

Doing compiler things without knowing what you’re doing is agonizing. You can never trust yourself and whatever I did for DXIL structurization has turned out to fail 30+ times, and each time I fix a critical bug and move on. The only thing I can say for sure is that it’s becoming asymptotically correct, and I have been able to make 100k+ AAA shaders compile and validate correctly, but every new insane scenario DXC (LLVM) throws at me could be the one to break everything, because if you cannot mathematically prove a compiler algorithm you might be extremely screwed because the edge cases are truly insane. The true compiler nerds are probably giggling in the corner 🙂

Goal for this post

The goal for this post is to go through the basic theory needed to understand the structurizer. The topic itself will take many posts to fully cover, so this is a good start.

GraphViz – visual debugging

GraphViz is a corner stone of debugging control flow, so the very first thing we need is to be able to dump out a CFG to a PDF. Fortunately, there is just the tool for the job.

For example we can start with an extremely basic shader from the test suite:

RWByteAddressBuffer Buf : register(u0);

[numthreads(1, 1, 1)]
void main(uint3 index : SV_DispatchThreadID)
{
  uint result = 0;

  // Single branch
  [branch]
  if (index.x == 10)
    result = 40;
  else
  result = 50;

  // 2-level branch
  [branch]
  if (index.x == 20)
  {
    [branch]
    if (index.y == 30)
      result += 20;
    else
      result *= result;
  }
  else
  {
    [branch]
    if (index.y == 40)
      result += 70;
    else
      result *= 10;
  }

  Buf.Store(0, result);
}

If we dump this input to GraphViz we get:

digraph {
1 [label="1", shape="box"];
2 [label=".entry", shape="box"];
2 -> 1;
3 [label="2", shape="box"];
2 -> 3;
3 -> 1;
4 [label="3", shape="box"];
1 -> 4;
5 [label="4", shape="box"];
1 -> 5;
6 [label="7", shape="box"];
5 -> 6;
7 [label="8", shape="box"];
5 -> 7;
8 [label="9", shape="box"];
7 -> 8;
6 -> 8;
9 [label="5", shape="box"];
4 -> 9;
10 [label="6", shape="box"];
4 -> 10;
10 -> 8;
9 -> 8;
}

Then, we have this trivial Python script to dump to PDF:

#!/usr/bin/env python3

from graphviz import Source
import sys

def main():
  s = Source.from_file(sys.argv[1])
  s.view()

if __name__ == '__main__':
  main()

For debugging purposes, we can dump out a GraphViz after every major transformation step, of which there are many. For reference, here’s the final structured output, which we will study in greater detail. Of course, this is baby mode compared to what we have to deal with later, but gotta start somewhere.

SPIRV-Tools also has a tool to extract a GraphViz from a SPIR-V module.

spirv-cfg -o test.graph test.spv

This is super handy as well! The graph can be visualized with the same Python script.

SPIRV-Tools is also critical since we lean heavily on the SPIR-V validator to verify that our SPIR-V output conforms to structured rules.

Basic theory for the non-compiler nerds

When trying to learn this stuff, you will encounter terminology from graph theory and any learning material surrounding this topic will usually come from CS textbooks or lectures which assume you have all that background knowledge (I have an EE background, not CS or fancy math). These concepts become the basic primitives of any algorithms we have to perform on the CFG. None of this is particularly difficult, sure, but needs to be understood at a deep level.

Reachability

Asks the question of “can block A reach block B”? In the example above, 4 can reach 9 by branching via either 7 or 8, but 4 cannot reach 6. This is quite intuitive.

Back-edge

In loops, a back-edge branches back to the beginning of the loop. This is detected when A can reach B and B branches to A.

Here is an example where 3 has a back-edge to itself.

Note: I tend to mix up back-edge as a block (continue statement) and the branch itself.

This makes the CFG cyclic. We can eliminate back-edges from a DXIL CFG and achieve an acyclic CFG.

I believe this property is what makes the CFG reducible. Irreducible CFGs can have completely random gotos that make it impossible to ensure an acyclic graph without some extra preprocessing. DXIL fortunately does not support irreducible CFGs. It is also its only saving grace …

In our implementations back-edges are treated as special branches that happen in loops, and for sake of reachability analysis we only consider reachability to be acyclic.

There can only be one back-edge branch to a particular block. I think this is what LLVM defines. I have never seen a counter example, but if it happens, we can trivially rewrite the graph to only have one back-edge.

Predecessors

Usually abbreviated “preds”. If A branches directly to B, A is a predecessor of B. Back-edges are not counted in our implementation, they are recorded separately as a pred_back_edge.

Successors

Usually abbreviated “succs”. If A branches directly to B, B is a successor of A. Back-edges are not counted in our implementation, they are recorded separately as a succ_back_edge.

Domination

Domination happens when we have two blocks A and B and in order to reach B, control flow must pass through A as well. A block dominates itself. Strict domination is when A != B.

Post-domination

Post-domination happens when we have two blocks A and B, and if control flow reaches block A, control flow must then flow into B.

Immediate dominator

Among all blocks which dominate a block A, finds the closest block which dominates A. Often called idom.

Immediate post-dominator

Same as idom, but post-dominance rather than dominance.

Domination frontier

Assume we have a block A which dominates a set of blocks B. If a block in B branches to a block C, and A does not dominate C, C is in the domination frontier set of A. This is somewhat abstract, but it’s an extremely useful analysis tool for later. E.g.:

For block 4, 9 is in the domination frontier. For block 7, 9.4.ladder is in the domination frontier.

Post-domination frontier

Same idea, but for post-dominance. E.g. for block 7, 4 would be in the post-domination frontier. For 9.4.ladder, 1 would be in the frontier, etc. The post-domination frontier is particularly useful as an analysis tool to figure out which control flow decisions contribute to a block being executed.

Common (post-)dominator

Given candidates B and C, finds the closest block A which (post-)dominates both B and C.

Loop analysis

Loops in unstructured control flow can be very awkward to reason about, but we have a way to specify loops precisely.

The loop header

A loop header is any block which has a back-edge branching to it. In reducible CFGs which DXIL requires, the loop header must dominate the back-edge.

The continue block

The back-edge block is the continue block. SPIR-V allows a “continuing construct” which means it can span multiple blocks, but we don’t use that.

Defining the loop body

When we have a loop header and a continue block, we can deduce the loop body as any block which is dominated by header and can reach the continue block. Once we can no longer reach the continue block, we are breaking out of the loop. We call these blocks “loop exits”.

For example, 3 is a loop header, 6 is a continue block. The loop body is {3, 4, 5, 7, 8, 6}. 9 is the only loop exit.

Issues for later …

As we will see later, this definition of a loop is fundamentally flawed when we consider subgroup operations in non-uniform control flow. A single-threaded CPU execution model that LLVM caters to does not care about any of this, but GPU execution certainly does since semantics change depending on how we define convergence.

Computing the CFG

In the implementation, we will recompute the CFG at many points, usually after performing transformations on the CFG. Here are some of the things we need to do:

Compute post-visit order

Essentially, a depth-first traversal, where each node is assigned a visitation index. The rule is if A can reach B, post_visit_index(A) > post_visit_index(B). We also detect back edges this way. A node is only visited once.

If we don’t visit a node, it means it’s not reachable anymore (can happen after rewriting the CFG), and we remove the node, which means purging dead predecessors.

We can also build an array of nodes which we can traverse through and know that we will walk the tree in an appropriate order. Iterating over the post-visit order backwards will ensure that dominators are visited before our dominated blocks for example. This is used when we …

Compute immediate dominators

The algorithm for this is pretty easy once we have post-visit orders in place.

void CFGNode::recompute_immediate_dominator()
{
  if (pred.empty())
  {
    // For entry block only.
    immediate_dominator = this;
  }
  else
  {
    immediate_dominator = nullptr;

    for (auto *edge : pred)
    {
      if (immediate_dominator)
        immediate_dominator =
          CFGNode::find_common_dominator(immediate_dominator, edge);
      else
        immediate_dominator = edge;
    }
  }
}

Finding a common dominator (simplified) is done by:

while (a != b)
{
  if (a->post_visit_order < b->post_visit_order)
    a = a->immediate_dominator;
  else
    b = b->immediate_dominator;
}
Compute reachability

Here we build a simple LUT where we can test if post_visit_order A can reach post_visit_order B.

void CFGStructurizer::visit_reachability(const CFGNode &node)
{
  uint32_t *dst_reachability =
    &reachability_bitset[node.forward_post_visit_order *
                         reachability_stride];

  for (auto *succ : node.succ)
  {
    // Inherit reachability from all successors.
    const uint32_t *src_reachability =
      &reachability_bitset[succ->forward_post_visit_order *
                           reachability_stride];
    for (unsigned i = 0; i < reachability_stride; i++)
      dst_reachability[i] |= src_reachability[i];
  }

  // We can reach ourselves.
  dst_reachability[node.forward_post_visit_order / 32] |=
    1u << (node.forward_post_visit_order & 31u);
}

void CFGStructurizer::build_reachability()
{
  reachability_stride = (forward_post_visit_order.size() + 31) / 32;
  reachability_bitset.clear();
  reachability_bitset.resize(reachability_stride *
                             forward_post_visit_order.size());
  for (auto *node : forward_post_visit_order)
    visit_reachability(*node);
}

It’s nice how the structures we build feed into each other. We compose this as saying that if B can reach C, and A can reach B, A can reach C.

Compute dominance frontiers

Similarly, the dominance frontier is built bottom-up.

void CFGStructurizer::recompute_dominance_frontier(CFGNode *node)
{
  // If we don't dominate our successor, it's a frontier.
  for (auto *succ : node->succ)
    if (succ->immediate_dominator != node)
      node->mark_dominance_frontier(succ);

  // Inherit frontiers.
  if (auto *idom = node->immediate_dominator)
    for (auto *frontier_node : node->dominance_frontier)
      if (!idom->dominates(frontier_node))
        idom->mark_dominance_frontier(frontier_node);
}

The recursive definition here is that if A dominates B, then any dominance frontier of B is also in the dominance frontier of A.

Congratulations! You can now say dominance frontier casually with a straight face :V

The backwards traversal

To compute post-dominance and post-dominance frontiers, we have to do the same thing, just backwards. Flip successors into predecessors and vice versa, but we have some problems. The normal traversal has one unique entry node, but there can be multiple leaf nodes (think returns from a branch)! This means there are many entry blocks when we traverse backwards, and this breaks the algorithm.

Another problem is loops. Consider this example, which demonstrates both cases:

Here, 6 is not reachable through backwards traversal from 1. 7 is a return block. What we need to solve the backwards traversal is to ensure we have a unique start node. For this purpose, we will invent an EXIT node. Any return will be considered to branch to that EXIT node. Back-edges are special. For backwards traversal, we will instead invent a “fake” branch from the back-edge to the common post-dominator of all loop exits. (That is a mouthful.)

This is the first of many silly ideas I came up with … After this point, assume I have no idea what I’m talking about. 😐

Dotted lines are fake branches we can use for backwards traversal.

What does it mean to be structured?

A structured CFG basically means the control flow closely maps to higher level languages without gotos. Rather than spaghetti, the control flow is like an onion. It’s somewhat difficult to precisely define it (study the SPIR-V specification), but the basic gist is something like:

Every block exists in a scope

A block can be enclosed by many scopes. A scope could be a selection construct or a loop construct.

Selection construct

When a node has more than one successors, we begin a selection construct, where that node is a header. A selection construct has a merge node where execution reconvenes. A header must dominate its merge node. The selection construct is whatever the header dominates minus whatever the merge node dominates. Intuitively:

// HEADER
if (blah)
{
  // IN CONSTRUCT
  if_true();
}
else
{
  // IN CONSTRUCT
  if_false();
}
// MERGE, not part of construct

No crossing streams

A block from outside a construct must not branch into the construct, except through the header block.

No shared merge block

A block cannot be the merge block for multiple headers.

Loop construct

A loop header begins the construct. The loop construct contains all blocks which are dominated by the loop header, minus whatever is dominated by the merge block. See example earlier.

Notice how this is (fatally) different from unstructured loops. Unstructured loops leave the construct when they can no longer reach the continue block. Structured loops don’t exit until they reach the merge block.

Leaving a construct

A loop defines a “break” target as well as a “continue” target. The only valid ways to leave a selection construction is through its own merge block, the innermost break, or innermost continue target.

Unlike some languages, SPIR-V does not allow “multi-break”. A loop construct functions as a barrier to how far we can break. We can only break out of one loop construct at a time (except with return, but DXIL code is all inlined). Solving this is … an ordeal 🙂

Switch constructs

Switch blocks are selection constructs, but they have a special property where the merge target of switch also serves as a break block. Oddly enough, you can break directly out of a switch and a loop at the same time, which is not allowed in normal high level languages, but oh well.

Edge cases for selection constructs

It is possible that a selection construct has one branch being a “break” or “continue”, and the other branch is as normal. This special case does not require a selection construct.

Another situation is when both paths of a selection construct ends up breaking. Execution will never merge in this case, so it is possible to setup the merge block as an unreachable block with OpUnreachable instead.

Why do merge blocks matter?

In a world with SIMT-style execution of GPUs, two sets of invocations may branch off, but at some point they might want to join back again. Defining this can be done naturally through merge blocks. Without a notion of merge blocks, there is no way to meaningfully distinguish

from

In a scalar world such as CPUs this has same behavior, but certainly not in a SIMT world. If 4 does anything using subgroups or just normal texturing with implicit LOD, well … everything breaks.

From my understanding, OpenCL has a notion of divergence and convergence, but any well defined convergence must come through a merge block. In completely unstructured control flow, convergence is not well defined. From SPIR-V:

Uniform Control Flow: Uniform control flow (or converged control flow) occurs if all invocations (in the invocation group, unless otherwise stated) execute the same dynamic instance of an instruction. Uniform control flow is the initial state at the entry point, and lasts until a conditional branch takes different control paths for different invocations (non-uniform or divergent control flow). Such divergence can reconverge, with all the invocations once again executing the same control-flow path, and this re-establishes the existence of uniform control flow. If control flow is uniform upon entry into a structured loop or selection, and all invocations leave that loop or selection via the header block’s declared merge block, then control flow reconverges to be uniform at that merge block.

So here, we have explicit wording that we must use merge blocks to get guaranteed re-convergence. I couldn’t find any strong wording about this in CUDA, but the closest from skimming the docs was this:

For divergent control flow, the optimizing code generator automatically determines points of re-convergence

This is interesting, because this is exactly what we are going to have to do as well in dxil-spirv, but this is not exactly a strong guarantee. In practice, it should work fine, but I think this presents a good reason why structured control flow is a thing. I just wish IRs did not take the lazy route of just derping everything into unstructured control flow and expect implementations to guess the semantics. Sigh …

The classic subgroup breakage – maximal convergence gone awry

It’s time to go through the classic example of why we can reach catastrophic derpiness with unstructured control flow.

StructuredBuffer<uint> RO : register(t0);
RWStructuredBuffer<uint> RW : register(u0);

[numthreads(16, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
    uint v = RO[thr];
    uint result;
    while (true)
    {
        uint first = WaveReadLaneFirst(v);
        if (v == first)
        {
            result = WaveActiveSum(v);
            break;
        }
    }

    RW[thr] = result;
}

It’s crystal clear what the programmer’s intention is here. We partition a wave into sets, which then participate in a wave-op in isolation, then they exit the loop. The loop is iterated N times. This is a very common waterfall pattern, but any experienced shader programmer will avoid this exact implementation of said pattern because it breaks over half the compilers out there … The real problem here is that according to unstructured control flow, the WaveActiveSum() is not inside the loop, because it cannot reach the continue block. Therefore we end up with a completely broken:

define void @main() {
...
  br label %6

; The loop, it does not do anything.
  %7 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %5)
  %8 = icmp eq i32 %5, %7
  br i1 %8, label %9, label %6

; Outside the loop.
  %10 = call i32 @dx.op.waveActiveOp.i32(i32 119, i32 %5, i8 0, i8 1)
  call void @dx.op.bufferStore.i32(...)
  ret void
}

The wave operation now happens outside the loop, with all threads active, no matter what. It is impossible to deduce the programmer’s intention at this point. Structured control flow does not have these ambiguities, because the loop does not end until we reach the merge block.

The high level code would look something like:

// A degenerate loop.
for (;;)
{
  if (RO[thr] == subgroupBroadcastFirst(RO[thr]))
  {
    break;
  }
  else
  {
  }
}
RW[thr] = subgroupAdd(RO[thr]);

Indeed, the ACO ISA is devoid of branches here as the loop does nothing.

Funnily enough, while writing this, I found that the latest DXC versions actually go to great length to try (and fail!) to make this work somehow, through an absolutely disgusting trick:

@dx.break.cond = internal constant [1 x i32] zeroinitializer

define void @main() {
  %1 = load i32, i32* getelementptr inbounds ([1 x i32], [1 x i32]* @dx.break.cond, i32 0, i32 0)
  %2 = icmp eq i32 %1, 0
  %3 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 1, i32 0, i32 0, i1 false)
  %4 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false)
  %5 = call i32 @dx.op.threadId.i32(i32 93, i32 0)
  %6 = call %dx.types.ResRet.i32 @dx.op.bufferLoad.i32(i32 68, %dx.types.Handle %4, i32 %5, i32 0)
  %7 = extractvalue %dx.types.ResRet.i32 %6, 0
  br label %8

; <label>:8 ; preds = %13, %0
  %9 = call i32 @dx.op.waveReadLaneFirst.i32(i32 118, i32 %7)
  %10 = icmp eq i32 %7, %9
  br i1 %10, label %11, label %13

; <label>:11 ; preds = %8
  %12 = call i32 @dx.op.waveActiveOp.i32(i32 119, i32 %7, i8 0, i8 1)
  br i1 %2, label %14, label %13

; <label>:13 ; preds = %11, %8
  br label %8

; <label>:14 ; preds = %11
  call void @dx.op.bufferStore.i32(i32 69, %dx.types.Handle %3, i32 %5, i32 0, i32 %12, i32 undef, i32 undef, i32 undef, i8 1)
  ret void
}

What on earth is going on here? So, somehow DXC recognizes the horrible mistake of unstructured control flow, and tries to hack around it by forcing the breaking path to have a reachable path to the continue block. It does this by conditionally branching to continue or breaking depending on a value that is always 0. Apparently, it seems to do this in the most ridiculous way possible, by reading from a 1-element Private array which is zeroinitialized. This is probably done in an attempt to avoid optimizations which remove the branch. The high-level language now avoids the maximal convergence.

const uint _13[1] = uint[](0u);

bool _21 = _13[0u] == 0u; // Always evaluates to true.
for (;;)
{
  _52 = _46.value == subgroupBroadcastFirst(_46.value);
  if (_52)
  {
    _53 = subgroupAdd(_46.value);
    if (_21)
      break;
    else
      continue;
  }
}

However, we now have a situation where convergence depends entirely on how smart the backend compiler is. A compiler can detect that _21 is constant and eliminate the continue branch as dead code. At the same time however, this also means that we regress back to maximal convergence and everything breaks.

This particular case works as expected on AMD/Intel D3D12 drivers, but apparently NVIDIA’s D3D12 driver optimizes this back to maximal convergence and even this nasty hack fails. Fun times … I’m not even sure I can call that a bug. It’s not like any of this is well defined in DXIL 🙂

This isn’t a fantasy example, Microsoft’s own examples trip this scenario as well:

// Is equivalent to writing a loop like this:
while (true) {
  if (WaveReadLaneFirst(expr) == expr) {
    sum = WavePrefixSum(value));
    break;
  }
}

Sigh … Loops are a real PITA.

Conclusion

This concludes CFG 101. Unfortunately, we still have not reached the actual juicy algorithms, but it took way too many words just to go through the bare minimum of CFG theory.

 

My personal hell of translating DXIL to SPIR-V – part 4

It has been quite a while since the last post where I discussed the binding model shenanigans of D3D12, and as promised, it’s time to look at AMD code-gen for various API patterns we find in D3D12 when translated to Vulkan and the ACO compiler in RADV.

At a fundamental level, the AMD ISAs haven’t changed all that much since the GCN 1.0 days, for good reason. Arguably, it’s still the cleanest and most flexible descriptor model of any GPU architecture and it’s a very useful architecture to understand at a low level.

As discussed last time, the same DXIL code can translate to many different code patterns in SPIR-V. The Root Signature object in D3D12 not only concerns itself with mapping binding points, but also controls the physical implementation of how resources are accessed. This means we have to deal with all of this junk in dxil-spirv somehow.

Using Fossilize as our cross-vendor RGA – fossilize-synth and fossilize-disasm

For the ISA experiments we conduct here, we’ll use fossilize-synth and fossilize-disasm. These CLI tools are part of the Fossilize project and they can be used to conduct ISA experiments like these quite handily. Unlike RGA, fossilize-synth can automatically generate a compatible VkPipelineLayout, VkRenderPass and PSO state for given shaders and record this as a Fossilize archive. We can then replay it with fossilize-disasm, target ISA output and output ISA on any driver that supports VK_KHR_pipeline_executable_properties fully.

$ dxil-spirv --output test.spv test.dxil
$ fossilize-synth --output test.foz --comp test.spv
Fossilize INFO: Successfully synthesized a FOZ archive to /tmp/test.foz.
$ fossilize-disasm --target isa --output /tmp test.foz
Fossilize INFO: Dumping disassembly to: /tmp/888210fc26602b56.main.a2d87808e64b9164.comp

How many ways can we implement Constant Buffer View?

Welcome to hell.

Constant buffers can be implemented a million different ways, so let’s have a look at what we generate for an RDNA2 chip and some alternatives we have to use for other vendors. As a test shader, we start with this HLSL compute shader:

cbuffer Cbuf : register(b0)
{
  uint4 v;
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  RW[thr] *= v;
}

1. Root constant – Global Root Signature

Root constants translate naturally to push constants in Vulkan. The Vulkan GLSL representation of dxil-spirv output is:

#version 450
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _13;

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
} registers;

void main()
{
  uvec4 _33 = uvec4(registers._m0, registers._m1,
    registers._m2, registers._m3);
  _13._m0[gl_GlobalInvocationID.x] = 
    uvec4(_13._m0[gl_GlobalInvocationID.x].x * _33.x,
          _13._m0[gl_GlobalInvocationID.x].y * _33.y,
          _13._m0[gl_GlobalInvocationID.x].z * _33.z,
          _13._m0[gl_GlobalInvocationID.x].w * _33.w);
}
s_mov_b32 s0, s3
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], 0x0
v_lshl_add_u32 v0, s7, 6, v0
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)

// Interesting work begins here
buffer_load_dwordx4 v[4:7], v0, s[8:11], 0 offen
s_waitcnt vmcnt(0)
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s4
v_mul_lo_u32 v10, v6, s5
v_mul_lo_u32 v11, v7, s6
buffer_store_dwordx4 v[8:11], v0, s[8:11], 0 offen
s_endpgm

The first thing to understand about GCN and RDNA is the execution model. On basically any modern GPU, threads are executed in groups. “Waves”, “warps” and “subgroups” are all names for the same thing. “Wave” is the preferred AMD naming, so we’ll use that. An instruction here is either a wide vector instruction or a scalar instruction. Scalar instructions run once in the small scalar unit, and vector instructions run wide.

Scalar registers and vector registers are similar here. v registers are per-lane (called VGPR), and scalar s registers are stored once per wave (called SGPR). To translate this to x86-speak, SGPRs are normal registers like rax, rbx, etc. VGPRs would be AVX-2048 monsters, crunching 64 operations in parallel.

A critical performance consideration when targeting AMD is “scalarization”. When data is the same across a wave, we can place it in scalar registers. This keeps register pressure down. We have scalar registers in abundance, but using too many vector registers leads to poor occupancy. Shifting vector registers to scalar registers is therefore a sound strategy.

With this in mind, let’s look at the interesting bit:

s_mov_b32 s0, s3
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s4
v_mul_lo_u32 v10, v6, s5
v_mul_lo_u32 v11, v7, s6

This is the meat of the work. When execution is begun, the root constants are already placed in s3, s4, s5 and s6. This is ideal. No need to load anything explicitly. This is possible because of “User SGPRs”. There is however, a strict limit to how much “User SGPR” data we can hold. If we have too many constants, the driver is forced to insert indirection.

For example, consider:

cbuffer Cbuf : register(b0)
{
  uint4 v0;
  uint4 v1;
  uint4 v2;
  uint4 v3;
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  RW[thr] *= v0;
  RW[thr] *= v1;
  RW[thr] *= v2;
  RW[thr] *= v3;
}

Now we end up with:

s_mov_b32 s0, s3
s_movk_i32 s1, 0x8000
s_clause 0x1
s_load_dwordx4 s[16:19], s[0:1], 0x20
s_load_dwordx4 s[20:23], s[0:1], 0x30
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[0:3], s[2:3], 0x0
v_lshl_add_u32 v0, s12, 6, v0
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)
buffer_load_dwordx4 v[4:7], v0, s[0:3], 0 offen
s_waitcnt vmcnt(0)
v_mul_lo_u32 v4, v4, s4
v_mul_lo_u32 v5, v5, s5
v_mul_lo_u32 v6, v6, s6
v_mul_lo_u32 v1, v7, s7
v_mul_lo_u32 v4, v4, s8
v_mul_lo_u32 v5, v5, s9
v_mul_lo_u32 v2, v6, s10
v_mul_lo_u32 v1, v1, s11
v_mul_lo_u32 v3, v4, s16
v_mul_lo_u32 v4, v5, s17
v_mul_lo_u32 v2, v2, s18
v_mul_lo_u32 v1, v1, s19
v_mul_lo_u32 v8, v3, s20
v_mul_lo_u32 v9, v4, s21
v_mul_lo_u32 v10, v2, s22
v_mul_lo_u32 v11, v1, s23
buffer_store_dwordx4 v[8:11], v0, s[0:3], 0 offen
s_endpgm

Some of the root constants were promoted to User SGPR, but some were not and the driver had to spill those to memory somewhere, and load them.

s_load_dwordx4 s[16:19], s[0:1], 0x20
s_load_dwordx4 s[20:23], s[0:1], 0x30

Since the constants are the same for all threads, we can load them directly into SGPRs instead of doing silly broadcasts.

Scalar loads on AMD have special properties. First, they target the constant cache (K$) instead of normal data caches. The K$ is not coherent with the data caches, so we can use the K$ if the address to load is scalar and we know that we do not need to alias with other writes. SSBO loads can actually use this path too if these constraints are met, which will come in handy in the next section.

2. Root descriptor – Global Root Signature – Buffer Device Address (BDA)

Root descriptors in D3D12 are designed to be raw pointers to data and they take up 64 bits in the root signature. It is up to the implementation to figure out what to do about the raw pointer, and in vkd3d-proton we prefer to translate this to push constant BDA on AMD.

Let’s first try with dynamically uniform address:

cbuffer Cbuf : register(b0)
{
  uint4 v[8];
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= v[wg];
}

The output Vulkan GLSL is … funky:

#version 450
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

struct AddCarry
{
  uint _m0;
  uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerUint4NonWrite;
layout(buffer_reference, buffer_reference_align = 16, std430)
readonly buffer PhysicalPointerUint4NonWrite
{
  uvec4 value;
};

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14;

layout(push_constant, std430) uniform RootConstants
{
  uvec2 _m0;
} registers;

void main()
{
  AddCarry _33;

  // Pointer arithmetic is fun >:|
  _33._m0 = uaddCarry(registers._m0.x, gl_WorkGroupID.x * 16u, _33._m1);
  PhysicalPointerUint4NonWrite _40 =
    PhysicalPointerUint4NonWrite(uvec2(_33._m0, registers._m0.y +
    _33._m1));
  _14._m0[gl_GlobalInvocationID.x] =
    uvec4(_14._m0[gl_GlobalInvocationID.x].x * _40.value.x,
          _14._m0[gl_GlobalInvocationID.x].y * _40.value.y,
          _14._m0[gl_GlobalInvocationID.x].z * _40.value.z,
          _14._m0[gl_GlobalInvocationID.x].w * _40.value.w);
}

In the BDA path of dxil-spirv, we encode the BDA in push constants, perform address calculation, cast that to a pointer and load. The critical part here is NonWritable decoration. This signals the compiler that the memory region can be assumed to not have been written to, even with aliasing in this shader module. This is subtly different from C const. Const can still observe aliasing through other pointer, but not in SPIR-V. This allows ACO to generate ideal code even from raw pointers:

s_mov_b32 s0, s3                   // S[3:4] holds BDA
s_lshl4_add_u32 s0, s5, s0         // S5 holds SV_GroupID.x here
s_add_u32 s1, s4, src_scc          // Handle carry for 64-bit + 32-bit
s_load_dwordx4 s[0:3], s[0:1], 0x0 // Scalar load from pointer
s_waitcnt vmcnt(0) lgkmcnt(0)
v_mul_lo_u32 v8, v4, s0
v_mul_lo_u32 v9, v5, s1
v_mul_lo_u32 v10, v6, s2
v_mul_lo_u32 v11, v7, s3

Nice. If we remove the NonWritable from SPIR-V assembly we get the depressing vector load instead, because compiler can no longer prove there is no potential aliased write (we’re doing pointer arithmetic after all), so we have to do it in the vector unit instead:

 global_load_dwordx4 v[4:7], v[2:3], off

Of course, if the shader is loading uniforms with varying offset we also need vector loads, e.g.:

cbuffer Cbuf : register(b0)
{
  uint4 v[8];
};

RWStructuredBuffer<uint4> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= v[thr & 7];
}
v_lshl_add_u32 v0, s5, 6, v0
v_and_b32_e32 v1, 7, v0
v_add_nc_u32_e32 v2, s0, v1                // Vectorized 64-bit + 32-bit
v_add_co_u32 v1, s[0:1], s0, v1
v_add_co_ci_u32_e64 v3, vcc, 0, s4, s[0:1]
global_load_dwordx4 v[4:7], v[2:3], off

The code-gen isn’t perfect here, but it accomplishes what we need for Root CBV. I am considering rewriting the pointer arithmetic for CBV to just use access chains instead, but that’s for another time …

(The existing codegen was designed for Root SRV / UAV which can do unaligned loads and stores in weird ways, and it is easier to do it this way with pointer arithmetic ._.)

(The reason we use uvec2 here is quirky. At the time, ACO only promoted 32-bit uints to User SGPRs, not 64-bit.)

3. Root constant – Global Root Signature – INLINE_UNIFORM_BUFFER workaround

If the driver does not support 256 byte push constants, we might be forced to fall back to inline UBO blocks instead of using push constants for massive root signatures. The generated code would look more or less the same as the BDA scenario shown above. Inline UBO on AMD is implemented in a way where the UBO data is stored at an offset from descriptor set pointer.

Now that BDA is widespread we might want to nuke this workaround and instead implement the workaround as a push constant BDA to memory we allocate ourselves. This removes the need to deal with annoying descriptor sets just to handle this case.

4. Root descriptor – Global Root Signature – Plain Descriptor

Push descriptors aren’t very special on AMD, so it gets lumped in with the others here. It just means the driver needs to allocate descriptor memory behind the scenes, which means extra CPU overhead. Descriptor paths on AMD have two indirections. This is why we prefer using BDA. For the example above with dynamically uniform address:

 RW[thr] *= v[wg];
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], 0x0
s_buffer_load_dwordx4 s[0:3], s[8:11], s0

In SGPRs, pointers to descriptor sets can be pushed. Descriptor memory lives in a 4 GB virtual memory range, so RADV only needs to push a 32-bit pointer to store a descriptor set and the upper half is synthesized from a constant. From here, we load a UBO descriptor. With the descriptor in SGPR, we load the actual UBO data.

A thing to note here is how trivial it is to extend this to bindless / descriptor indexing. Just adjust the offset when loading the UBO descriptor and that’s all we need. No magic involved, descriptors are just memory in this model.

This path is preferred on NVIDIA however. Push descriptors are more “magical” there and the BDA path is very slow compared to that. 🙁

5. Root table – Global Root Signature

In root tables, we need to tap into descriptor indexing. The table parameter in D3D12 works like an offset into the ID3D12DescriptorHeap. This offset is placed in push constants, which we then use to offset into the heap. E.g. given this HLSL shader:

struct C { uint4 value; };
ConstantBuffer<C> CBVs[] : register(b10);
RWStructuredBuffer<uint4> RW : register(u1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBVs[wg].value;
}

we end up with Vulkan GLSL looking like:

#version 450
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

// If VK_VALVE_mutable_descriptor_type, SSBO and BindlessCBV can share
// same descriptor binding.
layout(set = 4, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14[];

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _22[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
  uint _m4;
  uint _m5;
  uint _m6;
  uint _m7;
} registers;

void main()
{
  uint _29 = registers._m4 + 1u;
  uvec4 _52 = floatBitsToUint(_22[registers._m5 +
    (gl_WorkGroupID.x + 10u)]._m0[0u]);
  _14[_29]._m0[gl_GlobalInvocationID.x] =
    uvec4(_14[_29]._m0[gl_GlobalInvocationID.x].x * _52.x,
          _14[_29]._m0[gl_GlobalInvocationID.x].y * _52.y,
          _14[_29]._m0[gl_GlobalInvocationID.x].z * _52.z,
          _14[_29]._m0[gl_GlobalInvocationID.x].w * _52.w);
}

The ISA looks like:

v_lshl_add_u32 v0, s6, 6, v0 // Compute DispatchThreadID.x
s_add_u32 s0, s6, s5         // Add GroupID.x + root table offset
s_lshl_b32 s0, s0, 4         // sizeof(UBO) == 16
s_mov_b32 s6, s3             // Build pointer to set
s_movk_i32 s7, 0x8000

// Load descriptor by index.
  // Missing constant offset here. LLVM disassembler bug I think ...
  // Raw opcode is f4080203 000000a0
s_load_dwordx4 s[8:11], s[6:7], s0

...
s_buffer_load_dwordx4 s[4:7], s[8:11], 0x0

Fundamentally, this is no different from legacy descriptors. The offset is just no longer constant.

6. Root table – Global Root Signature – NonUniformResourceIndex

Adding more spice, we’ll make the resource index not uniform. Descriptors on AMD have to live in scalar registers for good reasons, so for this case we will see a common pattern in advanced compute, waterfall loops.

struct C { uint4 value; };
ConstantBuffer<C> CBVs[] : register(b10);
RWStructuredBuffer<uint4> RW : register(u1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBVs[NonUniformResourceIndex(thr % 3)].value;
}

This just translates to nonuniformEXT() in Vulkan GLSL, so there’s nothing interesting to see here, but the ISA is unrecognizable.

BB0:
s_lshl_b32 s4, s4, 4
s_mov_b32 s0, s3
s_movk_i32 s3, 0x8000
s_load_dwordx4 s[8:11], s[2:3], s4
v_lshl_add_u32 v0, s6, 6, v0
v_mul_hi_u32 v1, v0, 0xaaaaaaab
v_lshrrev_b32_e32 v1, 1, v1
v_lshl_add_u32 v1, v1, 1, v1
v_sub_nc_u32_e32 v1, v0, v1
v_lshlrev_b32_e32 v0, 4, v0
s_waitcnt lgkmcnt(0)
buffer_load_dwordx4 v[4:7], v0, s[8:11], 0 offen
v_add_lshl_u32 v1, s5, v1, 4
v_add_nc_u32_e32 v1, 0xa0, v1

// REGION OF INTEREST - BEGIN
s_mov_b64 s[2:3], exec
BB1:
v_readfirstlane_b32 s1, v1
v_cmp_eq_i32_e32 vcc, s1, v1
s_and_saveexec_b64 s[4:5], vcc
s_cbranch_execz BB9
BB2:
s_mov_b32 s6, s0
s_movk_i32 s7, 0x8000
s_load_dwordx4 s[12:15], s[6:7], s1
s_waitcnt lgkmcnt(0)
s_buffer_load_dwordx4 s[12:15], s[12:15], 0x0
s_waitcnt lgkmcnt(0)
v_mov_b32_e32 v1, s15
v_mov_b32_e32 v2, s14
v_mov_b32_e32 v3, s13
v_mov_b32_e32 v8, s12
s_andn2_b64 s[4:5], s[4:5], exec
s_cbranch_scc0 BB10
BB9:
s_mov_b64 exec, s[4:5]
s_branch BB1

// REGION OF INTEREST - END
BB10:
s_mov_b64 exec, s[2:3]
s_waitcnt vmcnt(0)

// Actually do the math.
v_mul_lo_u32 v8, v4, v8
v_mul_lo_u32 v9, v5, v3
v_mul_lo_u32 v10, v6, v2
v_mul_lo_u32 v11, v7, v1

// Store result.
buffer_store_dwordx4 v[8:11], v0, s[8:11], 0 offen
s_endpgm

This is … a bit more complicated, but now we get to study the corner stone of SIMD-style control flow – execution masking – in more detail.

The principle behind waterfall loops is that we need to partition a wave into sets which have the same value. The more divergence in a wave, the more loop iterations are needed. For sake of simplicity in the example below, we can assume a wave size of 8 (it’s actually 64 here).

The descriptor indices could be something like: index = { 0, 1, 2, 0, 1, 2, 0, 1 }. The waterfall loop looks like this in pseudo-code:

uint dynamic_index = thr % 3;
for (;;)
{
  uint index = subgroupBroadcastFirst(dynamic_index);
  if (index == dynamic_index)
  {
    desc = load_descriptor(index);
    data = load_data_from_descriptor(desc);
    break; // Stay tuned for structurization blogs, harharharhar :V
  }
}

In the first iteration, we look at our active thread mask (0xff), find the first bit set, which is 0. Lane 0 holds the value 0.

v_readfirstlane_b32 s1, v1 // A compute enthusiast's best friend!

Any other lane with the same value can go ahead, so lane 0, 3 and 6 will enter the branch.

v_cmp_eq_i32_e32 vcc, s1, v1
s_and_saveexec_b64 s[4:5], vcc

The index is now known to be constant over the wave, so we can now load the UBO descriptor and data, but the result must be moved to vector registers since we don’t load from wave uniform resource indices here:

s_buffer_load_dwordx4 s[12:15], s[12:15], 0x0
// The v_mov is only executed for active lanes.
v_mov_b32_e32 v1, s15
v_mov_b32_e32 v2, s14
v_mov_b32_e32 v3, s13
v_mov_b32_e32 v8, s12

The lanes which participate are now masked out, and the exec mask would be 0xff & ~(0x01 | 0x08 | 0x40). In the next iteration, we look at the first active lane, which has the value 1, and so we roll on until all unique values have been processed. Neat!

s_cbranch_scc0 BB10 // Keep going until we've masked out all exec bits.

7. ResourceDescriptorHeap – SM 6.6

This is trivially similar to the root table case, so I’ll omit ISA here. The main problem is DXIL having to define a completely different set of opcodes for SM 6.6, but that’s another story <_<. Overall it is simpler. Root tables are translated to SM 6.6 automatically in dxil-spirv after all.

struct C { uint4 value; };
RWStructuredBuffer<uint4> RW : register(u0, space1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  ConstantBuffer<C> CBV = ResourceDescriptorHeap[wg];
  RW[thr] *= CBV.value;
}
#version 450
#extension GL_EXT_nonuniform_qualifier : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _13[];

layout(set = 1, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _18;

void main()
{
  uvec4 _34 = floatBitsToUint(_13[gl_WorkGroupID.x]._m0[0u]);
  _18._m0[gl_GlobalInvocationID.x] =
    uvec4(_18._m0[gl_GlobalInvocationID.x].x * _34.x,
          _18._m0[gl_GlobalInvocationID.x].y * _34.y,
          _18._m0[gl_GlobalInvocationID.x].z * _34.z,
          _18._m0[gl_GlobalInvocationID.x].w * _34.w);
}

8. Root table / ResourceDescriptorHeap – Pascal workaround

On older NVIDIA hardware, we cannot use bindless UBO at all, as hardware does not support it. The workaround is to use bindless SSBO instead, with hilarious performance loss as a result :’)

9. Root constant – Local Root Signature

Local root signatures are an abomination. There is no fossilize-synth support for RTPSOs at this time, so I’ll resort to just showing the Vulkan GLSL equivalent. In Vulkan, the shader record block works just like an SSBO, but D3D12 insists on the abstracted model of root signatures, so when emitting code we need to consider if a resource belongs to the local root signature or global root signature.

struct CBVData { float4 v; };
ConstantBuffer<CBVData> SBTRootConstant : register(b0, space15);
ConstantBuffer<CBVData> SBTRootConstant2 : register(b1, space15);

struct Payload
{
  float4 color;
  int index;
};

[shader("miss")]
void RayMiss(inout Payload payload)
{
  payload.color = SBTRootConstant.v;
  payload.color += SBTRootConstant2.v;
}
#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct _19
{
  vec4 _m0;
  uint _m1;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  // ... Local root signature parameters.
  // The exact binary layout is implied by the root signature
  // parameter order.
} SBT;

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  // ... Global root signature parameters
} registers;

layout(location = 0) rayPayloadInEXT _19 payload;

vec4 _67;

void main()
{
  vec4 _41 = uintBitsToFloat(uvec4(SBT._m0[0u], SBT._m0[1u], SBT._m0[2u], SBT._m0[3u]));
  vec4 _57 = uintBitsToFloat(uvec4(SBT._m1[0u], SBT._m1[1u], SBT._m1[2u], SBT._m1[3u]));
  // Ah, yes, the classic LLVM composite insert pattern from undef vector.
  // You thought DXIL didn't support composites? Think again! :D
  vec4 _66 = _67;
  _66.x = _41.x + _57.x;
  vec4 _68 = _66;
  _68.y = _41.y + _57.y;
  vec4 _69 = _68;
  _69.z = _41.z + _57.z;
  vec4 _70 = _69;
  _70.w = _41.w + _57.w;
  payload._m0 = _70;
}

10. Root descriptor – Local Root Signature

Unlike the global root signature, we are forced to use BDA here. It is physically impossible to translate this to push descriptors since we have no chance to observe this pointer on CPU side. The GPU can generate this pointer on its own. It’s a bit hilarious that we can use raw pointers like this on D3D12 since forever, but we cannot actually use the pointer directly in the shader 😐

The SPIR-V is similar to global root signature, we just have to source the BDA from record block instead.

AddCarry _31;
_31._m0 = uaddCarry(SBT._m6.x, 0u * 16u, _31._m1);
PhysicalPointerFloat4NonWrite _38 =
  PhysicalPointerFloat4NonWrite(uvec2(_31._m0, SBT._m6.y + _31._m1));
vec4 _57 = _58;
_57.x = payload._m0.x + _38.value.x;
vec4 _59 = _57;
_59.y = payload._m0.y + _38.value.y;
vec4 _60 = _59;
_60.z = payload._m0.z + _38.value.z;
vec4 _61 = _60;
_61.w = payload._m0.w + _38.value.w;
payload._m0 = _61;

11. Root table – Local Root Signature

Cursed mode. In global root signatures, a table costs 1 DWORD, but in local it’s 2, because we are expected to consume a full GPU VA which points to descriptors. Since we work with offsets, we synthesize a fake GPU VA pointer instead in ID3D12DescriptorHeap::GetGPUVirtualAddress(), which we can trivially convert to an offset and then sample the global heap, SM 6.6 style … The GPU VA is faked with the formula (unique_value << 32) + (offset * DescriptorIncrement)

struct CBVData { float4 v; };
ConstantBuffer<CBVData> SBTCBVs[] : register(b4, space15);

struct Payload
{
  float4 color;
  int index;
};

[shader("miss")]
void RayMiss(inout Payload payload)
{
  payload.color += SBTCBVs[payload.index].v;
}
#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct _25
{
  vec4 _m0;
  uint _m1;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  uvec2 _m2;
  uvec2 _m3;
  uvec2 _m4;
  uvec2 _m5;
  uvec2 _m6;
  uvec2 _m7;
  uvec2 _m8;
  uvec2 _m9;
  uvec2 _m10;
} SBT;

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
  vec4 _m0[4096];
} _24[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  // ... Global root parameters
} registers;

layout(location = 0) rayPayloadInEXT _25 payload;

vec4 _62;

void main()
{
  // Extract offset in descriptor heap from VA.
  // The shift amount is configurable.
  uint _42 = ((SBT._m9.x >> 5u) + 13u) + payload._m1;
  vec4 _61 = _62;
  // Force nonuniformEXT here when indexing from SBT
  // since it's a bit unclear what dynamically uniform even means for RT.
  _61.x = payload._m0.x + _24[nonuniformEXT(_42)]._m0[0u].x;
  vec4 _63 = _61;
  _63.y = payload._m0.y + _24[nonuniformEXT(_42)]._m0[0u].y;
  vec4 _64 = _63;
  _64.z = payload._m0.z + _24[nonuniformEXT(_42)]._m0[0u].z;
  vec4 _65 = _64;
  _65.w = payload._m0.w + _24[nonuniformEXT(_42)]._m0[0u].w;
  payload._m0 = _65;
}

So there we have it, at least 11 different ways we can implement CBV in Vulkan from the exact same DXIL input … Oh never mind, make that 12 if we add the -no-legacy-cbuf-layout flag to DXC.

12. CBufferLoad vs LoadLegacy

Non-legacy loads force scalar loads for everything 🙁 It’s also horribly buggy in DXC and drivers, but that’s a story for another time. With legacy model, we only really had to consider loading 128 bits at once with a clean index into float4 array, but now we have a byte-offset based interface. Fortunately, basically no content uses this feature. It’s still opt-in in DXC despite being introduced with SM 6.0.

struct C { uint4 value; };
struct C64 { uint64_t4 value; };
RWStructuredBuffer<uint4> RW : register(u2);
ConstantBuffer<C> CBV : register(b0);
ConstantBuffer<C64> CBV64 : register(b1);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID, uint wg : SV_GroupID)
{
  RW[thr] *= CBV.value;
  RW[thr] *= uint4(CBV64.value);
}
#version 460
#extension GL_ARB_gpu_shader_int64 : require
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_scalar_block_layout : require
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 4, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _14[];

// We got descriptor aliasing, no problem!
// We also have actual scalar packing of uniforms :)
layout(set = 5, binding = 0, scalar) uniform BindlessCBV
{
  float _m0[16384];
} _21[];

layout(set = 5, binding = 0, scalar) uniform _25_28
{
  double _m0[8192];
} _28[];

layout(push_constant, std430) uniform RootConstants
{
  uint _m0;
  uint _m1;
  uint _m2;
  uint _m3;
  uint _m4;
  uint _m5;
  uint _m6;
  uint _m7;
} registers;

void main()
{
  uint _35 = registers._m4 + 2u;
  uint _42 = registers._m5 + 1u;

  // Please make it stop ...

  _14[_35]._m0[gl_GlobalInvocationID.x] = uvec4(
    _14[_35]._m0[gl_GlobalInvocationID.x].x *
      floatBitsToUint(_21[registers._m5]._m0[0u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].y *
      floatBitsToUint(_21[registers._m5]._m0[1u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].z *
      floatBitsToUint(_21[registers._m5]._m0[2u]),
    _14[_35]._m0[gl_GlobalInvocationID.x].w *
      floatBitsToUint(_21[registers._m5]._m0[3u]));

  _14[_35]._m0[gl_GlobalInvocationID.x] = uvec4(
    _14[_35]._m0[gl_GlobalInvocationID.x].x *
      uint(doubleBitsToUint64(_28[_42]._m0[0u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].y *
      uint(doubleBitsToUint64(_28[_42]._m0[1u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].z *
      uint(doubleBitsToUint64(_28[_42]._m0[2u])),
   _14[_35]._m0[gl_GlobalInvocationID.x].w *
      uint(doubleBitsToUint64(_28[_42]._m0[3u])));
}

Ugh … There isn’t too much we can do though, the DXIL is hardcoded to be scalar here despite all other buffer load interfaces being vectorized to some degree at least.

%5 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 0, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
// Note the broken alignment, cute. :')
%6 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 4, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%7 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 8, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%8 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 12, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

Without robustness, ACO manages to vectorize this garbage at least:

s_buffer_load_dwordx4 s[4:7], s[8:11], 0x0
s_buffer_load_dwordx4 s[8:11], s[12:15], 0x0
s_buffer_load_dwordx4 s[12:15], s[12:15], 0x10

This also holds even for robustness2, but if we start doing dynamically indexing the CBV like this, things collapse into noise since the DXIL mode is doing insane things with the byte offset here.

%5 = shl i32 %4, 4
%6 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %5, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%7 = or i32 %5, 4
%8 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %7, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

// ÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆ. I'm out!
%9 = add nsw i32 %7, 4

%10 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %9, i32 8) ; CBufferLoad(handle,byteOffset,alignment)
%11 = or i32 %5, 12
%12 = call i32 @dx.op.cbufferLoad.i32(i32 58, %dx.types.Handle %3, i32 %11, i32 8) ; CBufferLoad(handle,byteOffset,alignment)

Sad compiler noises …

Bonus round

Other descriptor types?

At a fundamental level, once we have studied CBVs, there isn’t that much more to cover, except for some key edge cases. The only real differences is how big the descriptors are and the opcodes used with the descriptors, like image_sample, image_load, image_write, etc, but that’s not terribly interesting for this post, so I’ll leave it here.

Raw SRV / UAV vectorization

DXIL raw buffer opcodes are in general quite unfriendly to human readable representations, and it maps very poorly to SSBOs. The robustness rules are also horrible to work with, and making this all work was not fun. After a significant rework, dxil-spirv is somewhat competent now, I think.

The raw buffer interfaces do make sense from an ISA-targeting compiler point-of-view though, I just wish it was a bit more structured for the sake of my sanity … :V

Byte address buffers

ByteAddressBuffers are very annoying. In HLSL, there is only scalar alignment required for a load-store with no way for HLSL shaders to signal intended alignment. There is also a very strange robustness rule. Even for a vector load-store, robustness is per-component at a 16 byte granularity. (why …)

With a simple shader like this, we can do a good job:

RWByteAddressBuffer RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  float4 values = RW.Load<float4>(thr * 16 + 16);
  values += 1.0;
  RW.Store<float4>(thr * 16 + 16, values);
}

When we attempt to vectorize these loads and stores, we must be able to prove that the SSA expression can be divided by the load-store size, here 16. We can split the addition and compute the result: thr + 1. We then declare an aliased SSBO variant as float4[].

#version 460
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

layout(set = 0, binding = 0, std430) buffer SSBO
{
  uvec4 _m0[];
} _10;

void main()
{
  vec4 _29 = uintBitsToFloat(_10._m0[gl_GlobalInvocationID.x + 1u]);
  _10._m0[gl_GlobalInvocationID.x + 1u] = uvec4(
    floatBitsToUint(_29.x + 1.0),
    floatBitsToUint(_29.y + 1.0),
    floatBitsToUint(_29.z + 1.0),
    floatBitsToUint(_29.w + 1.0));
}

The way we used to implement this was through 4 unrolled scalar load-stores. E.g. if address is changed to something weird like thr * 4 + 20 for the store, we cannot prove vectorization and we fall back into the old and bad code paths.

#version 460
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;

// These are declared with Aliased decorations in SPIR-V.
layout(set = 0, binding = 0, std430) buffer SSBO
{
  uint _m0[];
} _9;

layout(set = 0, binding = 0, std430) buffer _12_14
{
  uvec4 _m0[];
} _14;

void main()
{
  uint _22 = gl_GlobalInvocationID.x << 4u;
  uvec4 _30 = _14._m0[gl_GlobalInvocationID.x + 1u];
  vec4 _33 = uintBitsToFloat(_30);
  uint _46 = gl_GlobalInvocationID.x + 5u;
  _9._m0[_46] = floatBitsToUint(_33.x + 1.0);
  _9._m0[_46 + 1u] = floatBitsToUint(_33.y + 1.0);
  _9._m0[_46 + 2u] = floatBitsToUint(_33.z + 1.0);
  _9._m0[_46 + 3u] = floatBitsToUint(_33.w + 1.0);
}

This is a case where robustness2 can matter quite a lot actually … Without robustness, ACO will generate:

v_add_f32_e32 v8, 1.0, v4
v_add_f32_e32 v9, 1.0, v5
v_add_f32_e32 v10, 1.0, v6
v_add_f32_e32 v11, 1.0, v7
buffer_store_dwordx4 v[8:11], v0, s[4:7], 0 offen offset:20

Memory operations on AMD are very flexible w.r.t. alignment fortunately! With robustness2 however, we cannot safely vectorize 🙁

buffer_store_dword v4, v0, s[4:7], 0 offen offset:20
buffer_store_dword v5, v0, s[4:7], 0 offen offset:24
buffer_store_dword v6, v0, s[4:7], 0 offen offset:28
buffer_store_dword v1, v0, s[4:7], 0 offen offset:32

If we emit the vectorized load-store ourselves, we can get vectorized load-store as expected.

The case with float3[] vectorization is very special. With scalar block layout we can implement it, but we need to consider the 16 byte robustness rule. On AMD, we get per-component robustness in hardware, but not on NVIDIA it seems, so for ByteAddressBuffer, float3[] can only conditionally be vectorized. We’re probably going a bit out of spec on Vulkan here I think, but it’s backed up by tests in the vkd3d-proton test suite.

RW.Store<float3>(thr * 12, values.xyz);

ends up as

layout(set = 0, binding = 0, scalar) buffer _12_14
{
  uvec3 _m0[];
} _14;

 _14._m0[gl_GlobalInvocationID.x] = uvec3(
  floatBitsToUint(_41.x + 1.0),
  // Individual components might be sliced here. >_<
  floatBitsToUint(_41.y + 1.0),
  floatBitsToUint(_41.z + 1.0));
buffer_store_dwordx3 v[4:6], v0, s[4:7], 0 offen
Structured buffer

Structured buffers are a bit simpler, but still annoying in practice. While the HLSL declares something like:

struct T { float4 a; float3 b; float c; };
RWStructuredBuffer<T> Buf;

This information is lost in the raw DXIL form except in some reflection metadata that we cannot fully trust. The interface for structured buffers is just:

  • Index (usually not constant)
  • Offset into structured element (usually constant)
  • Load-store size
  • Alignment (ignored, DXC emits nonsense here)

Ideally, we would be able to emit the equivalent in SSBO form, but the raw nature of DXIL doesn’t really care anymore about the data types. We can only get effective vectorization for composites like this if the stride aligns with the vector size.

An example here would be:

struct T { float4 a; float3 b; float c; };
RWStructuredBuffer<T> RW : register(u0);

[numthreads(64, 1, 1)]
void main(uint thr : SV_DispatchThreadID)
{
  T t = RW[thr];
  t.a += 1.0;
  t.b += 2.0;
  t.c += 3.0;
  RW[thr] = t;
}
// Alignment of 4? Sure ... Except we trivially know it's 32 ...
// Told you we cannot trust DXC here! <_<
%3 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 0, i8 15, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

%8 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 16, i8 7, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

%12 = call %dx.types.ResRet.f32 @dx.op.rawBufferLoad.f32(i32 139, %dx.types.Handle %1, i32 %2, i32 28, i8 1, i32 4) ; RawBufferLoad(srv,index,elementOffset,mask,alignment)

Since stride is 32, we can vectorize the float4, but not float3.

// float4 load
vec4 _29 = uintBitsToFloat(_14._m0[gl_GlobalInvocationID.x * 2u]);

// float3 load
uint _37 = (gl_GlobalInvocationID.x * 8u) + 4u;
uint _40 = _9._m0[_37];
uint _44 = _9._m0[_37 + 1u];
uint _47 = _9._m0[_37 + 2u];
vec3 _50 = uintBitsToFloat(uvec3(_40, _44, _47));

// float load
uint _58 = _9._m0[(gl_GlobalInvocationID.x * 8u) + 7u];

// float4 store
_14._m0[gl_GlobalInvocationID.x * 2u] = uvec4(
  floatBitsToUint(_29.x + 1.0), floatBitsToUint(_29.y + 1.0),
  floatBitsToUint(_29.z + 1.0), floatBitsToUint(_29.w + 1.0));

// float3 store
uint _79 = (gl_GlobalInvocationID.x * 8u) + 4u;
_9._m0[_79] = floatBitsToUint(_50.x + 2.0);
_9._m0[_79 + 1u] = floatBitsToUint(_50.y + 2.0);
_9._m0[_79 + 2u] = floatBitsToUint(_50.z + 2.0);

// float store
_9._m0[(gl_GlobalInvocationID.x * 8u) + 7u] =
  floatBitsToUint(uintBitsToFloat(_58) + 3.0);
buffer_store_dwordx4 v[12:15], v0, s[4:7], 0 offen
buffer_store_dwordx4 v[4:7], v0, s[4:7], 0 offen offset:16

ACO is being an absolute champ here though, and even with robustness2 it can handle it. I think this happens because it can prove the address increments cannot overflow due to a POT index multiplier.

One saving grace of structured buffers is that it is undefined to straddle an element boundary. This means there is no byte address buffer hell of having to support per-component robustness. We can safely vectorize e.g.

RWStructuredBuffer<float3> Buf;

Root descriptor of ray tracing acceleration structure

If you ever wonder why OpConvertUToAccelerationStructureKHR exists, you can thank yours truly for creating a truly horrible monster.

struct Payload
{
  float4 color;
};

RaytracingAccelerationStructure AS_RootDesc : register(t0, space0);

[shader("raygeneration")]
void RayGen()
{
  RayDesc ray;
  ray.Origin = float3(1, 2, 3);
  ray.Direction = float3(0, 0, 1);
  ray.TMin = 1.0;
  ray.TMax = 4.0;

  Payload p;
  p.color = float4(1, 2, 3, 4);
  TraceRay(AS_RootDesc, RAY_FLAG_NONE, 0, 0, 0, 0, ray, p);
}
#version 460
#extension GL_EXT_ray_tracing : require
#extension GL_EXT_nonuniform_qualifier : require

struct _16
{
  vec4 _m0;
};

layout(shaderRecordEXT, std430) buffer SBTBlock
{
  uint _m0[5];
  uint _m1[6];
  uvec2 _m2; // Why hello there. I'm a cursed RTAS.
} SBT;

layout(location = 0) rayPayloadEXT _16 _18;

void main()
{
  _18._m0 = vec4(1.0, 2.0, 3.0, 4.0);
  traceRayEXT(accelerationStructureEXT(SBT._m2),
    0u, 0u, 0u, 0u, 0u,
    vec3(1.0, 2.0, 3.0), 1.0,
    vec3(0.0, 0.0, 1.0), 4.0, 0);
}

Oh no!

Conclusion

Now you know probably a little too much about descriptors and you can go bug IHVs about this with confidence. D:

There’s a million different ways to do things when it comes to descriptors and D3D12 and Vulkan took completely different approaches here. Developers asked how things work and D3D12 said “… yes”. In response I say ÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆÆ.

My personal hell of translating DXIL to SPIR-V – part 3

It’s time we tackle one of the big problems of DXIL, the binding model. The D3D12 binding model is completely foreign to most people who know the Vulkan binding model, and vice versa. I don’t think there are that many people in the world who can claim to fully grasp the binding models in both APIs. Translating every last detail of the D3D12 binding model to Vulkan is extremely painful, and I feel D3D12 made some critical design mistakes which bite us (and native drivers?) hard. Whenever I hear people naively claim D3D12 and Vulkan is basically the same API, I cringe hard and cry a little inside. Translating low level APIs is hellish when the details don’t map 1:1 exactly and the binding model is the perfect vehicle to demonstrate it.

I hope this blog post can serve as a definitive document on the insanity we need to go through in vkd3d-proton to make all of this work well. We have landed on a solution I feel is quite solid for AMD, but perhaps less so on other IHVs … A lot of credit here goes to Doitsujin who went through the insane task of rewriting the entire binding model in vkd3d-proton to the full TIER_3 binding model last year.

Earlier posts in the series:

The starting point of D3D12 – allow reuse of D3D11-era shaders

Before we can discuss D3D12, we must have a basic understanding of the D3D11 binding style we find in SM 5.0. From there, we will see how this was extended to the frankenstein monster that is SM 5.1 DXBC and how this was cribbed wholesale in SM 6.0+.

Binding resources to slots

The binding model in D3D11 was extremely simple, simple enough that we have no problems implementing that in Vulkan 1.0 as-is. Essentially, you can bind a fixed amount of:

  • SRV (t#): Generic read-only resources, sampled images, read-only SSBO and uniform texel buffers
  • UAV (u#): Generic read-write resources, storage images, read-write SSBO and storage texel buffers
  • Samplers (s#): Separate samplers, same as Vulkan
  • CBV (b#): Constant buffer views, i.e. UBOs in Vulkan

Each resource in the HLSL declares which register it binds to, easy enough! This maps very easily to Vulkan. The main difference is that the bindings are per stage in D3D11, but these details are easy to work around. Another awkward detail of D3D11 is that SRVs and UAVs are catch alls for any kind of read-only or read-write resource, even if they supports many different descriptor types. Sampled images, raw buffers and typed buffers are quite different things! Of course, with a slot based binding model, the implementation can resolve these details easily since we know all accessed resources up front. No bindless hell to deal with.

Now, we imagine that game developers are sitting on a large repository of D3D11-era shaders and D3D12 rolls around. Very few were going to risk going D3D12 exclusive any time soon, so it had to be possible to ship the exact same DXBC shaders to target both APIs. Even to this day, most new D3D12 games ship DXBC to target both APIs! (Maintaining two completely different shader backends is soooo much fun and productive use of time …) The center of gravity is slowly shifting towards DXIL, and it’ll probably take a few more years before DXIL is the main bytecode games ship. I still haven’t seen a single Unreal Engine title shipping DXIL-exclusive yet for example.

One key aspect of the D3D11 model is that the shaders say nothing about how data is accessed. Especially buffer data can be implemented in many different ways, and it would have been up to the D3D11 driver to figure out how to implement constant buffers efficiently for example. Of course, with the highly managed model of that API, the driver had ample opportunity to do so, at the cost of CPU overhead. With explicit APIs like D3D12 and Vulkan, there is far less room for drivers to optimize because the application developer is given certain freedoms, but they are expected to do some work in return. In reality, the IHV will probably add hacks behind your back if your performance is important enough to them, but that’s another story.

With buffer data like constant buffers for example, there are three primary indirection levels where we can trade off speed vs. space:

  • Preloaded registers: The absolute fastest mode of execution. If a constant buffer is tiny (a few u32s), it might as well fit entirely inside the register bank of the shader core, no memory operation required in the shader!
  • Store a pointer in preloaded registers: The second fastest mode. We can take one indirection to be able to access a larger chunk of data with a simple load instruction from pointer.
  • Access a descriptor: The “slowest” mode, with two indirections. First, we need to load a descriptor, then, we load data based on that descriptor. We also get robustness guarantees, which is another special kind of hell to handle.

As we see here, these three styles of buffer access have different trade-offs. We can get more efficient access the smaller data we need to consider. A few u32s of data? Preload registers. A few KiB? Store a pointer to it in registers. Many different buffers with different base pointers? Descriptor is the way to go.

The awkward part of this simplified model is that on certain hardware, using descriptors might be the fast path, so we’ll have to be careful. IHVs have optimized these paths to death over the decades.

The critical part for D3D12 here is that there is no way to express these concerns in the HLSL shader itself, whereas we certainly can in GLSL / SPIR-V!

A “virtual” vs. “physical” binding model

In Vulkan, we can express concerns like these in the SPIR-V for the most part.

  • Preloaded registers: Push constants is designed to map directly to this scheme. Some drivers might also be able to promote INLINE_UNIFORM_BUFFER to registers, especially if the descriptor set is not UPDATE_AFTER_BIND.
  • Store a pointer in preloaded registers: In Vulkan, we can place buffer device addresses in a push constant block and load directly from that. This is somewhat esoteric, but we use this a lot in vkd3d-proton. INLINE_UNIFORM_BUFFER also maps to this scheme. If we consider a descriptor set to be a pointer, inline UBO data can be placed directly in that descriptor memory, and we get one indirection. Implementation details between IHVs tends to be wildly different for inline UBO.
  • Access a descriptor: Just normal descriptor sets. If the descriptor set is not UPDATE_AFTER_BIND, a driver might be able to promote a buffer to a pointer in preloaded registers. This model has two indirections. One to load a descriptor, one to load the data from that descriptor.

Since we’re able to express ourselves fairly explicitly in Vulkan in the shader itself, I’ll call this a “physical” binding model, or at least a close enough approximation. A true physical binding model would be one where descriptors are accessed through raw pointers, but good luck convincing N IHVs to agree on how that should work 🙂 Inline UBO is the odd exception to this rule where the pipeline layout specifies access patterns which are not expressed in SPIR-V.

One fair criticism of Vulkan’s model is that there are a lot of ways to do effectively the same thing. Should you use push constants, inline UBO, UBO, UBO with dynamic offset, push constant buffer device address, push descriptors or normal pool descriptors? Hard to say without lots of experience and profiling. At least we have the tools available in the API, even if it’s not always obvious what the optimal choice is. In vkd3d-proton we use almost all of these, depending on the context.

To lament the state of graphics programming a bit, loading constant data optimally is still an unsolved problem in graphics, but we have accelerated ray traced global illumination at least, so yay? 🙂 To be fair, it’s hard to go wrong with plain old Vulkan 1.0 UBOs. Any perf gains beyond that tends to be minor and highly situational.

The abstract nature of D3D12 binding model

D3D12’s model is far more abstract, but at the same time, it maps to extremely specific restrictions on an implementation. It’s a weird model where it feels like an abstraction, but it actually isn’t. We can see right through it. We’ll explore this point in the root signature section …

Shader model 5.1 – I wanna bindless too!

5.1 is a weird version of DXBC. It changes some critical things:

  • Resources can be declared as arrays with dynamic indexing. Unbounded array size is also supported. Bindless, baby!
  • register() bindings now take an optional space parameter. This seems similar to DescriptorSet in SPIR-V, but it actually isn’t. It’s meaningless on its own. It is necessary however when using a lot of unbounded array size declarations. In DXBC, all t# or u# registers for example would be exhausted by unbounded array sizes, but by using different register spaces we can declare multiple unbounded arrays. When referring to a resource, we need to consider its binding, space, shader stage and type (SRV, UAV, CBV vs Sampler).

5.1 is kinda obsolete now that DXIL (6.0) has been around for a while, but it is still shipped in various titles.

The root signature

As mentioned earlier, HLSL shaders give no hints to the driver about how resources are actually accessed, so this information has to come from somewhere. In D3D12, we provide all this information in the root signature. The root signature is analogous to Vulkan’s VkPipelineLayout, but the details are completely different.

I think it’s helpful to understand D3D12 root signatures from a top-down view, and the point I would like to get across is that we should see the root signature as a very fancy push constant layout declaration.

The root signature defines an ABI for shaders. We lay down up to 256 bytes in memory, and then define how those 256 bytes can be used to access resources in various ways. The weird way root signatures work is that there’s a very explicit limit of a certain number of bytes, but we cannot actually access those bytes ourselves in a shader. The 256 bytes are organized in terms of 64 DWORDs. Each DWORD can be viewed in a certain way.

  • 32_BIT_CONSTANT: Consume a certain number of DWORDs and assign them to a CBV with specific register binding. That CBV’s data cannot be accessed with any dynamic array indices. This enforces the hidden implementation detail that the data can be mapped directly to registers. As an interesting tidbit, Vulkan push constants actually don’t have this restriction. In Vulkan, you can still dynamically index push constants as long as the access is dynamically uniform (yes, some hardware can dynamically index registers, it’s neat!)
  • Root descriptor: Consume 2 DWORDs to store a 64-bit GPU virtual address to any kind of raw buffer. CBV, UAV and SRV all work as long as they are untyped. The key thing to note here is that there is no room for a buffer size, so buffer robustness is completely disabled for root descriptors. All of this is just a very roundabout way to say that we store a pointer and access it as a pointer. This is basically a very restricted form of VK_KHR_buffer_device_address which is hidden from the developer. The driver is of course free to turn this into something like a Vulkan push descriptor if it so chooses. In vkd3d-proton, we’d love if games actually made use of root CBVs, but oh well 🙁 Some engines got the memo though 🙂
  • Table pointer: The cost is 1 DWORD, even though it should be 2 (we’ll get to this later, it gets confusing!). This is where the D3D12 binding model starts to get esoteric and diverge completely from Vulkan. We’ll need a separate section to cover this.

The global descriptor heaps and table pointers

Descriptors are managed in a completely different way than Vulkan and this is the point that gives us a pretty big headache.

In D3D12, you’re expected to allocate one heap which contains N descriptors. N goes up to 1,000,000 in the highest binding tier. All descriptors you access in a shader must exist inside this heap, and the heap must be explicitly bound to a command list before use. For a given command list, you’re not really expected to change it, but you can suballocate, and 1 million descriptors should be enough unless we take the Nanite meme far enough where we need to render one material per pixel. Samplers have their own heap, and there’s a limit of a few thousand here.

From the global heap, you’re able to bind sub-ranges of that heap. These sub-ranges are represented as the table pointers in the root signature. Essentially, a table pointer encodes a u32 offset into the global heap – even if you give the API a GPU VA – which is why it costs 1 DWORD instead of 2 (for now …).

Apparently, this idea is a workaround for certain GPUs. There exists a literal “descriptor palette” in some hardware and we can only read descriptors from this “palette” essentially. D3D12 seems to cater to that model, at the cost of flexibility. You’re asked as a developer to own that palette / heap and allocate memory on top of that.

View objects work completely differently as well. In D3D12, a view object is created directly into a descriptor heap, rather than having separate VkImageView and VkBufferView objects. Copying descriptors is a critical part of D3D12 descriptor management. It’s possible to create non-shader visible descriptor heaps, which is basically just a fancy malloc(). Here, it’s intended that we can stream out descriptors to the shader-visible descriptor heap by calling CopyDescriptors over and over, often over 10000 times per frame! Making this path efficient enough has been a ton of work and has placed some serious restrictions on our implementation. There is only one way to do it “fast” and correct, it just took 3-4 iterations to get there.

In Vulkan, the approach is quite different. We create view objects as standalone objects, which are then written into descriptor sets with vkUpdateDescriptorSets. If we squint enough, the Vulkan view objects represent the non-shader visible descriptor heaps, and vkUpdateDescriptorSets is closer to CopyDescriptors (non-shader visible → shader visible) than Create*View. Profiling native drivers shows this clearly. Create*View() is ~50-100x slower than CopyDescriptors, similar to Vulkan if you look at vkCreate*View() vs. vkUpdateDescriptorSets.

In the Vulkan model, there is no concept of a “descriptor heap” which has to be bound, however, we can figure out what is really going on under the hood. We can allocate a lot of view objects in Vulkan, but eventually, we will actually hit out-of-memory conditions on some hardware. This is a good indication that we have actually exhausted the internal descriptor “palette”. The views merely contain references to that global palette instead.

Another headache is that view objects in D3D12 do not have a lifetime or their own. The lifetime is tied to the descriptor heap itself, as a descriptor is essentially treated as plain old data. This is not the case with Vulkan. The unfortunate side effect of this scenario is that we need to maintain a per ID3D12Resource hash-map in vkd3d-proton where we keep view objects alive until we know for sure that we can destroy the views, i.e. at resource destruction time. The older implementation in vkd3d-proton used reference counted views, but as you can expect, when games copy 10000+ descriptors per frame in many threads concurrently, the overhead was unusably large. The shit hit the fan so to say in Death Stranding where we spent >80% of CPU time copying descriptors, not a good look. The only solution was to use VkCopyDescriptors functionality of vkUpdateDescriptorSets, which is considered quite esoteric in Vulkan.

So how do shaders access the global descriptor heap? Well, through a lot of fixed function jank, that’s how! This jank is more or less fixed in SM 6.6 as we’ll get to later, but now we just have yet another API to implement, support and test, sigh …

First, we start with the table parameter that is pushed to the GPU through the command list API. This represents an offset into the descriptor heap. A table entry has a certain number of descriptor ranges associated with it. These ranges specify that a certain subset of descriptors will be consumed as descriptors. For example, consider a Texture2D[] bound at register(t10, space4). A table entry in the root signature might say “Parameter #5 has an SRV range which begins at space #4, register #8, contains unbounded number of descriptors, with a constant offset of 15 descriptors from the table entry.” When accessing this resource, we should access the descriptor heap at descriptor #tableEntry5 + (10 – 8) + 15 + dynamicIndex in shader. As we see, the registers and spaces in the HLSL have no physical meaning until the root signature gives it meaning.

Codegen to SPIR-V

We can map the entire root signature into two parts, the root parameters and immutable sampler declaration.

For example, take this shader:

cbuffer CPush : register(b0) { float4 cpush; };
cbuffer C : register(b1) { float4 c; };
cbuffer C2 : register(b2) { float4 c2; };

Texture2D<float4> T[] : register(t40);
SamplerState S : register(s30);

float4 main(uint index : INDEX, float uv : UV) : SV_Target
{
	return T[NonUniformResourceIndex(index)].Sample(S, uv) +
		cpush + c + c2;
}

A typical codegen here would be something like (via SPIRV-Cross):

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct AddCarry
{
    uint _m0;
    uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerFloat4NonWrite;
layout(buffer_reference, std430) readonly buffer PhysicalPointerFloat4NonWrite
{
    vec4 value;
};

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
    vec4 _m0[4096];
} _21[];

// The root parameters map to push constants.
layout(push_constant, std430) uniform RootConstants
{
    uvec2 _m0; // Root CBV for b1
    uint _m1;  // Various table offset root parameters
    uint _m2;  // ...
    ...
    uint _m9;  // Root constants
    uint _m10;
    uint _m11;
    uint _m12;
} registers;

layout(set = 0, binding = 0) uniform texture2D _14[];
layout(set = 2, binding = 0) uniform sampler _25[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in float UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    uint _46 = registers._m6 + 2u;

    // Textures can only be accessed through heap.
    // Here we see address computation being based on all three inputs.
    // Push constant + constant + dynamic index
    // All of this is wrapped in a nonuniform thing for extra spice.
    vec4 _66 = texture(nonuniformEXT(sampler2D(_14[registers._m1 + (INDEX + 40u)], _25[registers._m3 + 30u])), vec2(UV));

    // Root constants, directly copy from push constants.
    // The cbufferLoad instruction gives us constant offsets, so we can
    // directly access push constant members here.
    vec4 _84 = uintBitsToFloat(uvec4(
        registers._m8, registers._m9, registers._m10, registers._m11));

    // Root CBV, here implemented with buffer device address,
    // kinda ugly SPIR-V, but gotta go fast.
    // This works great on AMD, not so much on NVIDIA … but we can
    // Use a normal plain UBO instead if we want.
    // Ah, the luxury of giving the implementation a choice … :)
    
    AddCarry _98;
    _98._m0 = uaddCarry(registers._m0.x, 0u * 16u, _98._m1);
    PhysicalPointerFloat4NonWrite _105 = PhysicalPointerFloat4NonWrite(uvec2(_98._m0, registers._m0.y + _98._m1));

    // Bindless UBO action, table CBV here we go …
    // Not a great look for performance on some GPUs.
    // On Pascal in particular, we have to emit bindless SSBO instead, rip perf.
    // Not much we can do sadly.
    SV_Target.x = ((_84.x + _66.x) + _105.value.x) + _21[_46]._m0[0u].x;
    SV_Target.y = ((_84.y + _66.y) + _105.value.y) + _21[_46]._m0[0u].y;
    SV_Target.z = ((_84.z + _66.z) + _105.value.z) + _21[_46]._m0[0u].z;
    SV_Target.w = ((_84.w + _66.w) + _105.value.w) + _21[_46]._m0[0u].w;
}

 

It’s common to think of the Table parameter as equivalents to Vulkan descriptor sets, but that is misleading. The correct way to think about it is that D3D12 has one huge (technically two for Sampler heap) descriptor set, and for convenience you can push offsets into that heap as desired. In Vulkan, descriptor sets can be allocated and bound in complete isolation. Shader model 6.6 removes all of this pretense and shows it how it really is. Everything is just an index to the heap, and the pre-6.6 world tried to hide this fact.

My main problems with the binding model

Very loose coupling between root signature and table descriptor access

The only thing the root signature does is to define how offsets into the heap is computed. There is no knowledge about the array-ed-ness of a resource, nor if every resource is even valid at record time (exception here is STATIC descriptors, but alas, extremely rare in the wild). Unfortunately, this loose nature of root signature and shader matching complicates descriptor hoisting, and when you complicate hoisting, you better pray the target hardware has pristine bindless support (pro-tip, they don’t).

A root signature could say that “we have 100 CBVs in this range”, but shaders might just happen to use a few CBVs from that range. Doing descriptor work per shader is pretty gross and goes against the spirit of the modern graphics APIs. Aggressive hoisting and repacking is CPU overhead that we shouldn’t consider except in exceptional circumstances. A native driver might have a much easier time dealing with these things since they can modify their own command streams at the last minute if they want. In D3D12, SIMULTANEOUS_ACCESS command lists don’t exist, so the driver has this option. Perhaps this restriction was put in place precisely for these scenarios? Who knows, we can only armchair these things!

The D3D12 docs also encourage developers to use somewhat generic root signatures and rarely change them, so a root signature is likely going to be quite “fat” compared to an ideal Vulkan pipeline layout, making useful workarounds even more difficult …

Awkward and unnecessary aliasing

In a D3D12 descriptor heap, there is no descriptor type. Descriptor #10 can be an SRV and descriptor #11 can be a CBV, it’s all a few bits between friends. Surely, games are not going to commit the grave sin of placing the wrong descriptor type in the wrong index, right? … Right?! Of course they do. Debugging insanity like this was the prelude to vkd3d-proton’s fast descriptor QA checking mechanism. It’s actually quite fast. In AAA games, we have code like this running at 40 FPS usually. More than good enough I’d say when hunting for bugs.

void descriptor_qa_report_fault(uint fault_type, uint heap_offset, uint cookie, uint heap_index, uint descriptor_type, uint actual_descriptor_type, uint instruction)
{
    uint _63 = atomicAdd(QAGlobalData.fault_atomic, 1u);
    if (_63 == 0u)
    {
        QAGlobalData.failed_cookie = cookie;
        QAGlobalData.failed_offset = heap_offset;
        QAGlobalData.failed_heap = heap_index;
        QAGlobalData.failed_descriptor_type_mask = descriptor_type;
        QAGlobalData.actual_descriptor_type_mask = actual_descriptor_type;
        QAGlobalData.failed_instruction = instruction;
        QAGlobalData.failed_shader_hash = uvec2(291u, 0u);
        memoryBarrierBuffer();
        QAGlobalData.fault_type = fault_type;
    }
}

uint descriptor_qa_check(uint heap_offset, uint descriptor_type_mask, uint instruction)
{
    uint _98 = QAHeapData.descriptor_count;
    uint _100 = QAHeapData.heap_index;
    uvec2 _102 = QAHeapData.cookies_descriptor_info[heap_offset];
    uint _110 = QAGlobalData.live_status_table[_102.x >> 5u];
    uint _121 = (uint(heap_offset >= _98) | (((_102.y & descriptor_type_mask) == descriptor_type_mask) ? 0u : 2u)) | (((_110 & (1u << (_102.x & 31u))) != 0u) ? 0u : 4u);
    if (_121 != 0u)
    {
        descriptor_qa_report_fault(_121, heap_offset, _102.x, _100, descriptor_type_mask, _102.y, instruction);
        return _98;
    }
    return heap_offset;
}

void main()
{
    uint _45 = descriptor_qa_check(registers._m1 + (INDEX + 40u), 1u, 1u);
    vec4 _132 = texture(nonuniformEXT(sampler2D(_14[_45], _18[registers._m3 + 30u])), vec2(UV));
    SV_Target.x = _132.x;
    SV_Target.y = _132.y;
    SV_Target.z = _132.z;
    SV_Target.w = _132.w;
}

 

With this scheme we can check for:

  • Is the descriptor type correct?
  • Is the resource destroyed already?
  • Did we index the heap out of bounds?

If we hit failures we get a neat atomic cookie in host memory we can sample at regular intervals on CPU. Works surprisingly well and tailor made for our needs, which is likely why it’s so fast compared to generic alternatives. We also only report the first failure. With this mode enabled, we allocate N + 1 descriptors, and reserve the last descriptor as a NULL descriptor which we can OpSelect on failure to avoid GPU hangs. (well, if someone did a faulty atomic compswap loop on a NULL descriptor I guess we’d timeout, but whatever, not much we can do about that).

Forcing the addition of VK_VALVE_mutable_descriptor_type

The combination of weirdly aliased descriptor types and full bindless is pretty deadly since Vulkan has typed descriptors, for good reason! This unfortunate mismatch forced us to create many descriptor sets, one per type, which looks very dumb. It consumes enough memory that we actually ran into serious performance issues on AMD hardware in the beginning. To fix all the immediate issues, we developed VK_VALVE_mutable_descriptor_type, which is kinda gross, but it lets us implement the D3D12 weird aliased descriptors directly, and all the problems went away. It’s a pretty goofy model though and not something Vulkan developers should be using directly. Not all descriptors have the same size in hardware, so there is gaps between descriptors, which is what I’m pretty sure is the case for native D3D12 as well, rip K$ … The more natural model would have been to have descriptor heaps where equally sized objects were allocated in different heaps, but we have to deal with whatever problems D3D12 throws our way and make it work …

Another “nice” side effect of using mutable was that some GPU hangs went away. If the game used the wrong descriptor type, it seemed to at least not read a descriptor that pointed to already freed memory, but rather just a descriptor of wrong type. Somehow, this helped certain games to run around the time the extension was released. It is deeply disturbing that games can ship in this state. :\

A game engine targeting Vulkan can be smarter about all these things. For example, having 1M sampled images seems reasonable, as that’s the number one thing you’d use bindless for, but do you need 1M CBVs, fully bindless? I doubt it. I’m far more in favor of a model where normal, packable, hoistable descriptors are used by default, and bindless is only tapped into as needed.

Enforces a fully bindless implementation

When implementing this binding model in other APIs, we’re forced to make everything fully bindless. Everything is accessed in terms of the global descriptor heap, and even if 99% of shaders use a very traditional D3D11-style binding model, we cannot do anything smart here because of one critical design flaw of D3D12 …

The great VOLATILE mistake of Root Signature 1.0

Root Signature 1.0 really doubles down on forcing full bindless everywhere, even when shaders don’t actually need it. All descriptors are considered VOLATILE, which means that we require all the Vulkan flags that are opt-in:

  • UPDATE_AFTER_BIND
  • UPDATE_WHILE_PENDING
  • PARTIALLY_BOUND
  • VARIABLE_COUNT

This fact completely blocks us from hoisting descriptors on the fly. Even if the shader is simple enough, we’re royally screwed because there is no guarantee that descriptors are actually valid at the time of command recording. Almost all games still cling to Root Signature 1.0, and probably rely on driver heroics to work around any performance implication. Table CBV on Pascal GPUs for example is disgustingly slow in vkd3d-proton. In Vulkan, these GPUs don’t even support UPDATE_AFTER_BIND UBOs! I can only wonder what utter depravities the D3D12 driver engineers had to do to make this work well … Hoist descriptors late with device generated commands based on the PSO? Bleh. The API forces us to implement these as bindless SSBO, and I die a little inside every time I have to think about this. Root CBV works perfectly fine, yet games insist on using table CBVs with VOLATILE just because …

On AMD, everything is fully bindless anyways, as descriptors are just memory, but it’s unfortunate that the D3D12 API went VOLATILE by default. It really should have been opt-in. It’s not like the poor Pascal souls can actually buy new GPUs these days even if they wanted to 🙂

The second mistake of Root Signature 1.1 – STATIC

The mistake of VOLATILE-by-default was recognized by the time 1.1 rolled around, and STATIC was now made the default. STATIC is the nicest mode possible for drivers, since it fully allows hoisting of descriptors, but you actually lose robustness guarantees (!). Most big game engines probably looked at this for 2 seconds and noped out.

D3D in general has gone to extreme lengths to ensure robustness guarantees, and engines of course rely on every esoteric OOB scenario to not fail. This is hopefully not by design, but engine bugs get masked for years if they never cause issues in practice. Losing that guarantee means no one is going to risk random GPU crashes in the wild for just a potential performance increase.

The actual fix that no game ever uses – STATIC + BOUNDS_CHECKED

The actual fix is fairly obscure. At some point a more reasonable mode was added which preserves bounds checking. Of course, no game I know of actually uses this. vkd3d-proton can take advantage of STATIC / BOUNDS_CHECKED CBVs in some cases where we will hoist them to push descriptors for significant NV perf gains, but alas, when no games use the APIs like intended, it might as well not exist …

Immutable sampler jank

The docs imply that the driver is supposed to implement this with an internal hashmap of sampler objects. Oddly enough, these sampler objects live outside the normal sampler heap you bind to the command list, and in fact, you don’t have to bind anything to use immutable samplers, it’s intended to be implemented through driver magic.

This magic is certainly odd, but in Vulkan we have to allocate and bind immutable samplers to a command buffer. This is because very little hardware actually supports true immutable samplers in the sense that they are embedded in the shader code itself. While we still have to allocate descriptor sets, at least we don’t have to actually write the immutable samplers to the descriptor set.

On AMD, the descriptors are read directly from scalar registers, so there’s nothing stopping a motivated compiler from emitting a bunch of immediate register moves and go to town, but sadly, most hardware don’t work like this, and I don’t think drivers even take advantage of this possibility yet.

My big question is how this is even supposed to work on D3D12 if the samplers have to live somewhere in memory. If the shader is simply storing constant offsets into some hidden sampler heap, then surely those constants are not stable across different runs, and thus shader caching fails. Patching in new constants seems pretty gross, but I can only speculate … Either way, in vkd3d-proton we end up creating a VkDescriptorSet per root signature which holds all immutable samplers for a given root signature. This is automatically bound when flushing dirty state in a draw or dispatch call.

Painful and awkward raw buffer types

In Vulkan we are blessed with the SSBO, the flexible and versatile buffer. D3D12 is a bit more weird and has extremely specific alignment and robustness guarantees which don’t map cleanly to anything.

First, we have ByteAddressBuffer.

  • Alignment requirement of buffer binding, 16 bytes (good)
  • Load1,2,3,4 variants, with 4 byte alignment requirement (2 for 16-bit). SM 6.2 improved this a fair bit, with a templated type, but for us in DXIL land, very little changes.
  • Robustness is checked per-component (why ._.)
  • Loads and stores are done at specific byte offsets, which is super ugly, but mostly for shader authors

The only natural way to implement this is a uint[] array, where we unroll loads and store per-component. Fun times. I would be surprised if hardware actually supported vectorized load-store with per-component robustness at scalar alignment …

The second style is StructuredBuffer which represents a buffer as a T data[] array. One deeply frustrating aspect of this is that data structures are tightly packed, meaning T could be float3, and we’d get 12 byte stride. StructuredBuffers must be bound in terms of elements, not byte offsets, so it’s perfectly valid to bind a structured buffer at 12 byte offset. This cannot be implemented safely on GPUs which report 16 byte SSBO alignment sadly. Due to all this, it’s almost impossible to cleanly declare a vec3 data[]; inside the SPIR-V, even if we have scalar block layout support. Just like ByteAddressBuffer, we kinda have to unroll loads and stores with uint data[] again, shame … We have gone to great lengths to support an “offset” buffer, which means we can bind an SSBO at 16 byte alignment, and nudge the offset inside the shader to fix things up, but it is horrible.

There are some mind-melting optimizations in flight for dxil-spirv which can vectorize these things, but it’s not a perfect solution. Weird robustness and alignment rules forces us to do some pretty horrible things, and we only have a few escape hatches which we can exploit, and dxil-spirv will try hard to squeeze every last legal vectorization opportunity out of the existing rules.

For example, StructuredBuffers have a rule saying that robustness only needs to be checked once, based on the structure element that is being accessed. For dynamic access into the element itself, it is undefined behavior to straddle the element boundary. Think StructuredBuffer<float4> where you do a dynamic access into the components, and index overflows.

I’m wondering if hardware actually has a special descriptor type they use to implement structured buffers. Perhaps the same ones which are used for vertex attribute fetch? And yes, even AMD uses tbuffer with stride pulled from descriptors, so saying it’s fully programmable fetch is a bit misleading I think.

Texel buffers are not a valid solution either. They can be bound at scalar alignment, which is nice, but we cannot safely use vectorized load-store either. Texel buffers can only be read and written to fully. No write masks allowed. Another problem is mix and match of 16-bit load-store which completely broke any hope of using texel buffers. In Vulkan, we can redeclare an SSBO multiple times with different data layouts, which is basically a human readable version of ByteAddressBuffer.

It’s a bit frustrating to deal with these problems, because almost no sane application is going to run into the edge cases, or need all the esoteric robustness features, but we have to make every possible edge case work. We cannot take shortcuts. To demonstrate the insanity we have to go through on certain GPUs …

StructuredBuffer<float3> B;

float3 main(uint index : INDEX) : SV_Target
{
	return B[index];
}

Since this can be bound at offset 12, we might see this horrible code. The offset buffer is associated with the descriptor heap, and we allocate a separate side channel buffer for this purpose.

layout(set = 1, binding = 0, std430) restrict readonly buffer SSBO_Offsets
{
    uvec2 _m0[];
} _13;

layout(set = 1, binding = 1, std430) restrict readonly buffer SSBO
{
    uint _m0[];
} _18[];

layout(push_constant, std430) uniform RootConstants
{
   // root parameters
} registers;

layout(location = 0) flat in uint INDEX;
layout(location = 0) out vec3 SV_Target;

void main()
{
    // Load extra offset + actual range. We cannot bind the SSBO tightly enough
    // on some GPUs.
    // Since the access is not nonuniform, we can give a strong hint to compiler
    // that the offsets can be loaded as a broadcast.
    uvec2 _37 = _13._m0[subgroupBroadcastFirst(registers._m1)] >> uvec2(2u);
    uint _41 = INDEX * 3u;
    // Trip explicit OOB if we fail.
    uint _47 = (_41 < _37.y) ? (_41 + _37.x) : 1073741820u;
    // Unroll 3 loads. ;_;
    vec3 _60 = uintBitsToFloat(uvec3(_18[registers._m1]._m0[_47], _18[registers._m1]._m0[_47 + 1u], _18[registers._m1]._m0[_47 + 2u]));
    SV_Target.x = _60.x;
    SV_Target.y = _60.y;
    SV_Target.z = _60.z;
}

The horrible workarounds

We have a fair amount of issues to deal with. A lot of invalid API behavior in D3D12 happens to work on native drivers, and many games ship with subtle bugs which force us to do horrible things. IHVs are going to have a fun time in 5 years when hardware details change 🙂

What is a raw buffer anyways?

A surprisingly common bug is that D3D12 does not complain if you create a typed buffer, but actually intended to create a raw buffer (ByteAddressBuffer) and vice versa. The difference is just one flag away and is easily missed. For us, that means SSBO vs texel buffer, a catastrophic error, but apparently, D3D12 does not care and drivers will happily consume typed buffers as raw and vice versa. I have no idea how this works, but somehow it does (I wrote tests)! Either way, it got bad enough that when emitting buffer descriptors, we have to emit both an SSBO variant and texel buffer variant and then the shader can pick the correct descriptor it wants. Of course, this being bindless and decoupled, we cannot pick the correct descriptor at draw time, sigh … Sometimes I think bindless all the things is going to be considered a huge mistake in 10 years from a software ecosystem point of view.

What is a NULL descriptor anyways?

A reasonably common bug is the NULL descriptor with wrong type. Technically in D3D12, NULL descriptors are still typed and you have to pass in a valid resource desc when creating one, but mistakes happen all the time. Using the wrong type works perfectly fine on existing hardware in almost all cases (I wrote a test!). For this reason, vkd3d-proton takes the conservative approach, and writing a NULL descriptor means splatting out NULL descriptors to all the different types of descriptors we have in the heap. For MUTABLE, this means two sets, for non-mutable, six! Voila, various GPU hangs disappeared, just like that.

VOLATILE is really bad for validation

Having simple validation is critical for development, and I think the real hidden mistake of VOLATILE being default, is that only GPU-assisted validation can catch any bugs! GPU validation is very heavy and usually avoided when possible. I feel a ton of these bugs could have been caught if descriptors could have been validated on CPU timeline. As long as an invalid descriptor is not dynamically accessed by a shader, it’s all OK! Dynamically accessed is the critical wording here. With this requirement, we can only validate on submit time if we can prove ahead of time that a shader will generate invocations which then statically access an invalid descriptor. Good luck proving that fragment shader invocations happen 100% of the time! Don’t get me wrong, bindless is very powerful and great for the use cases it enables, but derping everything into bindless doesn’t seem like the correct approach to me.

Local root signatures – one great kludge to rule them all

Local root signatures add yet another layer of hell to the binding model. It was annoying enough to implement a dozen different ways to access resources, and now we have do it all over again, but this time, replacing push constants with SBTs + some additional edge cases that are specific to local root signatures!

Local root signatures are used specifically for the shader record tables in ray tracing pipelines. In Vulkan, this data is accessed directly as a buffer, but D3D12 went the way of describing a view instead where each parameter is laid out in memory one after the other. As usual, table pointers is a thing, but this time, they consume 2 DWORDs instead of 1. Why on earth was table pointers described in terms of GPU VA and not just a simple u32 “offset into heap”? Sometimes I find D3D12’s obsession with GPU VA-s quite bizarre and counter-productive. It doesn’t even have buffer device address support, yet we’re using them everywhere, even when it doesn’t make much sense at all. In vkd3d-proton, we had to do the janky thing of returning artificial GPU VA-s for descriptor heaps so that it was fast to convert the VA into an offset.

While root descriptors in global root signatures can be implemented as descriptors, we are forced to use buffer device addresses since the VA is literally sourced from buffer memory. The fun thing is that you can even place RT acceleration structures here as a literal VA! And this is how OpConvertUToAccelerationStructureKHR was born … 🙂

Local root signatures also add the headache of also supporting immutable samplers! This time, we have to defer the creation of a second immutable sampler descriptor set to RTPSO creation, which is super annoying. At bind time, we therefore might run the risk of having to bind one immutable sampler set for the global root signature, then one which is tied to the union of all local root signatures for the RTPSO, so much fun! >_<

Implementing SM 6.6 bindless

SM 6.6 is actually straight forward. Our model in vkd3d-proton was always to map pre-6.6 style bindings to a SM 6.6 style “access the heap directly” model, so this translation is actually not that bad at all. The main annoyance is how there are now 5 (!) opcodes in DXIL to deal with resource handle creation, each with their own idiosyncrasies. I’m pretty sure all of this could have been squeezed into 2 …

  • CreateHandle
  • CreateHandleForLib
  • CreateHandleFromHeap
  • CreateHandleFromBinding
  • AnnotateHandle

Fun times indeed …

float4 main(uint index : INDEX, float2 uv : UV) : SV_Target
{
	Texture2D<float4> T = ResourceDescriptorHeap[index];
	return T.Load(int3(uv, 0));
}

#version 460
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_samplerless_texture_functions : require

layout(set = 0, binding = 0) uniform texture2D _9[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in vec2 UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    vec4 _32 = texelFetch(_9[INDEX], ivec2(uvec2(uint(int(UV.x)), uint(int(UV.y)))), int(0u));
    SV_Target.x = _32.x;
    SV_Target.y = _32.y;
    SV_Target.z = _32.z;
    SV_Target.w = _32.w;
}

Nice. I recently completed SM 6.6 support for vkd3d-proton, have fun with that!

How this maps to AMD GPUs (teaser for next post)

If anything, it feels like the D3D12 binding model is tailored pretty well to AMD hardware. It probably makes sense given the Mantle heritage. The binding model on GCN is extremely neat and tidy and was designed with full bindless in mind since the early days, quite forward looking, indeed. Descriptors are just memory that is loaded into scalar registers and passed directly to various operations which take descriptors. Very little magic is going on here and once you understand how this work, the modern API binding models start making some sense. Root parameters and push constants map nicely to user SGPRs, and as long as we don’t use too many u32s, we have a very direct mental model of how this is going to map to hardware.

Where D3D12 suffers I think is on hardware where there are still remnants of legacy descriptor slots in hardware. We cannot take advantage of this due to poor design decisions made in D3D12, but it is what it is. There is only so much we can do as a translation layer.

We’ll explore this topic in a future post …

Conclusion

It has taken many, many months, if not over a year to get where we are today in vkd3d-proton. Handling all of this insanity pushed us to the breaking point, but somehow it works pretty well. There are no shortcuts we can take.

Despite the mammoth length of this post, I don’t think we’re quite done with the subject of binding models. I’ll probably need another post to cover codegen examples for AMD ISA as well as an overview of the million ways we can convert the same DXIL code to different SPIR-V depending on the root signature.