My personal hell of translating DXIL to SPIR-V – part 3

It’s time we tackle one of the big problems of DXIL, the binding model. The D3D12 binding model is completely foreign to most people who know the Vulkan binding model, and vice versa. I don’t think there are that many people in the world who can claim to fully grasp the binding models in both APIs. Translating every last detail of the D3D12 binding model to Vulkan is extremely painful, and I feel D3D12 made some critical design mistakes which bite us (and native drivers?) hard. Whenever I hear people naively claim D3D12 and Vulkan is basically the same API, I cringe hard and cry a little inside. Translating low level APIs is hellish when the details don’t map 1:1 exactly and the binding model is the perfect vehicle to demonstrate it.

I hope this blog post can serve as a definitive document on the insanity we need to go through in vkd3d-proton to make all of this work well. We have landed on a solution I feel is quite solid for AMD, but perhaps less so on other IHVs … A lot of credit here goes to Doitsujin who went through the insane task of rewriting the entire binding model in vkd3d-proton to the full TIER_3 binding model last year.

Earlier posts in the series:

The starting point of D3D12 – allow reuse of D3D11-era shaders

Before we can discuss D3D12, we must have a basic understanding of the D3D11 binding style we find in SM 5.0. From there, we will see how this was extended to the frankenstein monster that is SM 5.1 DXBC and how this was cribbed wholesale in SM 6.0+.

Binding resources to slots

The binding model in D3D11 was extremely simple, simple enough that we have no problems implementing that in Vulkan 1.0 as-is. Essentially, you can bind a fixed amount of:

  • SRV (t#): Generic read-only resources, sampled images, read-only SSBO and uniform texel buffers
  • UAV (u#): Generic read-write resources, storage images, read-write SSBO and storage texel buffers
  • Samplers (s#): Separate samplers, same as Vulkan
  • CBV (b#): Constant buffer views, i.e. UBOs in Vulkan

Each resource in the HLSL declares which register it binds to, easy enough! This maps very easily to Vulkan. The main difference is that the bindings are per stage in D3D11, but these details are easy to work around. Another awkward detail of D3D11 is that SRVs and UAVs are catch alls for any kind of read-only or read-write resource, even if they supports many different descriptor types. Sampled images, raw buffers and typed buffers are quite different things! Of course, with a slot based binding model, the implementation can resolve these details easily since we know all accessed resources up front. No bindless hell to deal with.

Now, we imagine that game developers are sitting on a large repository of D3D11-era shaders and D3D12 rolls around. Very few were going to risk going D3D12 exclusive any time soon, so it had to be possible to ship the exact same DXBC shaders to target both APIs. Even to this day, most new D3D12 games ship DXBC to target both APIs! (Maintaining two completely different shader backends is soooo much fun and productive use of time …) The center of gravity is slowly shifting towards DXIL, and it’ll probably take a few more years before DXIL is the main bytecode games ship. I still haven’t seen a single Unreal Engine title shipping DXIL-exclusive yet for example.

One key aspect of the D3D11 model is that the shaders say nothing about how data is accessed. Especially buffer data can be implemented in many different ways, and it would have been up to the D3D11 driver to figure out how to implement constant buffers efficiently for example. Of course, with the highly managed model of that API, the driver had ample opportunity to do so, at the cost of CPU overhead. With explicit APIs like D3D12 and Vulkan, there is far less room for drivers to optimize because the application developer is given certain freedoms, but they are expected to do some work in return. In reality, the IHV will probably add hacks behind your back if your performance is important enough to them, but that’s another story.

With buffer data like constant buffers for example, there are three primary indirection levels where we can trade off speed vs. space:

  • Preloaded registers: The absolute fastest mode of execution. If a constant buffer is tiny (a few u32s), it might as well fit entirely inside the register bank of the shader core, no memory operation required in the shader!
  • Store a pointer in preloaded registers: The second fastest mode. We can take one indirection to be able to access a larger chunk of data with a simple load instruction from pointer.
  • Access a descriptor: The “slowest” mode, with two indirections. First, we need to load a descriptor, then, we load data based on that descriptor. We also get robustness guarantees, which is another special kind of hell to handle.

As we see here, these three styles of buffer access have different trade-offs. We can get more efficient access the smaller data we need to consider. A few u32s of data? Preload registers. A few KiB? Store a pointer to it in registers. Many different buffers with different base pointers? Descriptor is the way to go.

The awkward part of this simplified model is that on certain hardware, using descriptors might be the fast path, so we’ll have to be careful. IHVs have optimized these paths to death over the decades.

The critical part for D3D12 here is that there is no way to express these concerns in the HLSL shader itself, whereas we certainly can in GLSL / SPIR-V!

A “virtual” vs. “physical” binding model

In Vulkan, we can express concerns like these in the SPIR-V for the most part.

  • Preloaded registers: Push constants is designed to map directly to this scheme. Some drivers might also be able to promote INLINE_UNIFORM_BUFFER to registers, especially if the descriptor set is not UPDATE_AFTER_BIND.
  • Store a pointer in preloaded registers: In Vulkan, we can place buffer device addresses in a push constant block and load directly from that. This is somewhat esoteric, but we use this a lot in vkd3d-proton. INLINE_UNIFORM_BUFFER also maps to this scheme. If we consider a descriptor set to be a pointer, inline UBO data can be placed directly in that descriptor memory, and we get one indirection. Implementation details between IHVs tends to be wildly different for inline UBO.
  • Access a descriptor: Just normal descriptor sets. If the descriptor set is not UPDATE_AFTER_BIND, a driver might be able to promote a buffer to a pointer in preloaded registers. This model has two indirections. One to load a descriptor, one to load the data from that descriptor.

Since we’re able to express ourselves fairly explicitly in Vulkan in the shader itself, I’ll call this a “physical” binding model, or at least a close enough approximation. A true physical binding model would be one where descriptors are accessed through raw pointers, but good luck convincing N IHVs to agree on how that should work 🙂 Inline UBO is the odd exception to this rule where the pipeline layout specifies access patterns which are not expressed in SPIR-V.

One fair criticism of Vulkan’s model is that there are a lot of ways to do effectively the same thing. Should you use push constants, inline UBO, UBO, UBO with dynamic offset, push constant buffer device address, push descriptors or normal pool descriptors? Hard to say without lots of experience and profiling. At least we have the tools available in the API, even if it’s not always obvious what the optimal choice is. In vkd3d-proton we use almost all of these, depending on the context.

To lament the state of graphics programming a bit, loading constant data optimally is still an unsolved problem in graphics, but we have accelerated ray traced global illumination at least, so yay? 🙂 To be fair, it’s hard to go wrong with plain old Vulkan 1.0 UBOs. Any perf gains beyond that tends to be minor and highly situational.

The abstract nature of D3D12 binding model

D3D12’s model is far more abstract, but at the same time, it maps to extremely specific restrictions on an implementation. It’s a weird model where it feels like an abstraction, but it actually isn’t. We can see right through it. We’ll explore this point in the root signature section …

Shader model 5.1 – I wanna bindless too!

5.1 is a weird version of DXBC. It changes some critical things:

  • Resources can be declared as arrays with dynamic indexing. Unbounded array size is also supported. Bindless, baby!
  • register() bindings now take an optional space parameter. This seems similar to DescriptorSet in SPIR-V, but it actually isn’t. It’s meaningless on its own. It is necessary however when using a lot of unbounded array size declarations. In DXBC, all t# or u# registers for example would be exhausted by unbounded array sizes, but by using different register spaces we can declare multiple unbounded arrays. When referring to a resource, we need to consider its binding, space, shader stage and type (SRV, UAV, CBV vs Sampler).

5.1 is kinda obsolete now that DXIL (6.0) has been around for a while, but it is still shipped in various titles.

The root signature

As mentioned earlier, HLSL shaders give no hints to the driver about how resources are actually accessed, so this information has to come from somewhere. In D3D12, we provide all this information in the root signature. The root signature is analogous to Vulkan’s VkPipelineLayout, but the details are completely different.

I think it’s helpful to understand D3D12 root signatures from a top-down view, and the point I would like to get across is that we should see the root signature as a very fancy push constant layout declaration.

The root signature defines an ABI for shaders. We lay down up to 256 bytes in memory, and then define how those 256 bytes can be used to access resources in various ways. The weird way root signatures work is that there’s a very explicit limit of a certain number of bytes, but we cannot actually access those bytes ourselves in a shader. The 256 bytes are organized in terms of 64 DWORDs. Each DWORD can be viewed in a certain way.

  • 32_BIT_CONSTANT: Consume a certain number of DWORDs and assign them to a CBV with specific register binding. That CBV’s data cannot be accessed with any dynamic array indices. This enforces the hidden implementation detail that the data can be mapped directly to registers. As an interesting tidbit, Vulkan push constants actually don’t have this restriction. In Vulkan, you can still dynamically index push constants as long as the access is dynamically uniform (yes, some hardware can dynamically index registers, it’s neat!)
  • Root descriptor: Consume 2 DWORDs to store a 64-bit GPU virtual address to any kind of raw buffer. CBV, UAV and SRV all work as long as they are untyped. The key thing to note here is that there is no room for a buffer size, so buffer robustness is completely disabled for root descriptors. All of this is just a very roundabout way to say that we store a pointer and access it as a pointer. This is basically a very restricted form of VK_KHR_buffer_device_address which is hidden from the developer. The driver is of course free to turn this into something like a Vulkan push descriptor if it so chooses. In vkd3d-proton, we’d love if games actually made use of root CBVs, but oh well 🙁 Some engines got the memo though 🙂
  • Table pointer: The cost is 1 DWORD, even though it should be 2 (we’ll get to this later, it gets confusing!). This is where the D3D12 binding model starts to get esoteric and diverge completely from Vulkan. We’ll need a separate section to cover this.

The global descriptor heaps and table pointers

Descriptors are managed in a completely different way than Vulkan and this is the point that gives us a pretty big headache.

In D3D12, you’re expected to allocate one heap which contains N descriptors. N goes up to 1,000,000 in the highest binding tier. All descriptors you access in a shader must exist inside this heap, and the heap must be explicitly bound to a command list before use. For a given command list, you’re not really expected to change it, but you can suballocate, and 1 million descriptors should be enough unless we take the Nanite meme far enough where we need to render one material per pixel. Samplers have their own heap, and there’s a limit of a few thousand here.

From the global heap, you’re able to bind sub-ranges of that heap. These sub-ranges are represented as the table pointers in the root signature. Essentially, a table pointer encodes a u32 offset into the global heap – even if you give the API a GPU VA – which is why it costs 1 DWORD instead of 2 (for now …).

Apparently, this idea is a workaround for certain GPUs. There exists a literal “descriptor palette” in some hardware and we can only read descriptors from this “palette” essentially. D3D12 seems to cater to that model, at the cost of flexibility. You’re asked as a developer to own that palette / heap and allocate memory on top of that.

View objects work completely differently as well. In D3D12, a view object is created directly into a descriptor heap, rather than having separate VkImageView and VkBufferView objects. Copying descriptors is a critical part of D3D12 descriptor management. It’s possible to create non-shader visible descriptor heaps, which is basically just a fancy malloc(). Here, it’s intended that we can stream out descriptors to the shader-visible descriptor heap by calling CopyDescriptors over and over, often over 10000 times per frame! Making this path efficient enough has been a ton of work and has placed some serious restrictions on our implementation. There is only one way to do it “fast” and correct, it just took 3-4 iterations to get there.

In Vulkan, the approach is quite different. We create view objects as standalone objects, which are then written into descriptor sets with vkUpdateDescriptorSets. If we squint enough, the Vulkan view objects represent the non-shader visible descriptor heaps, and vkUpdateDescriptorSets is closer to CopyDescriptors (non-shader visible → shader visible) than Create*View. Profiling native drivers shows this clearly. Create*View() is ~50-100x slower than CopyDescriptors, similar to Vulkan if you look at vkCreate*View() vs. vkUpdateDescriptorSets.

In the Vulkan model, there is no concept of a “descriptor heap” which has to be bound, however, we can figure out what is really going on under the hood. We can allocate a lot of view objects in Vulkan, but eventually, we will actually hit out-of-memory conditions on some hardware. This is a good indication that we have actually exhausted the internal descriptor “palette”. The views merely contain references to that global palette instead.

Another headache is that view objects in D3D12 do not have a lifetime or their own. The lifetime is tied to the descriptor heap itself, as a descriptor is essentially treated as plain old data. This is not the case with Vulkan. The unfortunate side effect of this scenario is that we need to maintain a per ID3D12Resource hash-map in vkd3d-proton where we keep view objects alive until we know for sure that we can destroy the views, i.e. at resource destruction time. The older implementation in vkd3d-proton used reference counted views, but as you can expect, when games copy 10000+ descriptors per frame in many threads concurrently, the overhead was unusably large. The shit hit the fan so to say in Death Stranding where we spent >80% of CPU time copying descriptors, not a good look. The only solution was to use VkCopyDescriptors functionality of vkUpdateDescriptorSets, which is considered quite esoteric in Vulkan.

So how do shaders access the global descriptor heap? Well, through a lot of fixed function jank, that’s how! This jank is more or less fixed in SM 6.6 as we’ll get to later, but now we just have yet another API to implement, support and test, sigh …

First, we start with the table parameter that is pushed to the GPU through the command list API. This represents an offset into the descriptor heap. A table entry has a certain number of descriptor ranges associated with it. These ranges specify that a certain subset of descriptors will be consumed as descriptors. For example, consider a Texture2D[] bound at register(t10, space4). A table entry in the root signature might say “Parameter #5 has an SRV range which begins at space #4, register #8, contains unbounded number of descriptors, with a constant offset of 15 descriptors from the table entry.” When accessing this resource, we should access the descriptor heap at descriptor #tableEntry5 + (10 – 8) + 15 + dynamicIndex in shader. As we see, the registers and spaces in the HLSL have no physical meaning until the root signature gives it meaning.

Codegen to SPIR-V

We can map the entire root signature into two parts, the root parameters and immutable sampler declaration.

For example, take this shader:

cbuffer CPush : register(b0) { float4 cpush; };
cbuffer C : register(b1) { float4 c; };
cbuffer C2 : register(b2) { float4 c2; };

Texture2D<float4> T[] : register(t40);
SamplerState S : register(s30);

float4 main(uint index : INDEX, float uv : UV) : SV_Target
{
	return T[NonUniformResourceIndex(index)].Sample(S, uv) +
		cpush + c + c2;
}

A typical codegen here would be something like (via SPIRV-Cross):

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct AddCarry
{
    uint _m0;
    uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerFloat4NonWrite;
layout(buffer_reference, std430) readonly buffer PhysicalPointerFloat4NonWrite
{
    vec4 value;
};

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
    vec4 _m0[4096];
} _21[];

// The root parameters map to push constants.
layout(push_constant, std430) uniform RootConstants
{
    uvec2 _m0; // Root CBV for b1
    uint _m1;  // Various table offset root parameters
    uint _m2;  // ...
    ...
    uint _m9;  // Root constants
    uint _m10;
    uint _m11;
    uint _m12;
} registers;

layout(set = 0, binding = 0) uniform texture2D _14[];
layout(set = 2, binding = 0) uniform sampler _25[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in float UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    uint _46 = registers._m6 + 2u;

    // Textures can only be accessed through heap.
    // Here we see address computation being based on all three inputs.
    // Push constant + constant + dynamic index
    // All of this is wrapped in a nonuniform thing for extra spice.
    vec4 _66 = texture(nonuniformEXT(sampler2D(_14[registers._m1 + (INDEX + 40u)], _25[registers._m3 + 30u])), vec2(UV));

    // Root constants, directly copy from push constants.
    // The cbufferLoad instruction gives us constant offsets, so we can
    // directly access push constant members here.
    vec4 _84 = uintBitsToFloat(uvec4(
        registers._m8, registers._m9, registers._m10, registers._m11));

    // Root CBV, here implemented with buffer device address,
    // kinda ugly SPIR-V, but gotta go fast.
    // This works great on AMD, not so much on NVIDIA … but we can
    // Use a normal plain UBO instead if we want.
    // Ah, the luxury of giving the implementation a choice … :)
    
    AddCarry _98;
    _98._m0 = uaddCarry(registers._m0.x, 0u * 16u, _98._m1);
    PhysicalPointerFloat4NonWrite _105 = PhysicalPointerFloat4NonWrite(uvec2(_98._m0, registers._m0.y + _98._m1));

    // Bindless UBO action, table CBV here we go …
    // Not a great look for performance on some GPUs.
    // On Pascal in particular, we have to emit bindless SSBO instead, rip perf.
    // Not much we can do sadly.
    SV_Target.x = ((_84.x + _66.x) + _105.value.x) + _21[_46]._m0[0u].x;
    SV_Target.y = ((_84.y + _66.y) + _105.value.y) + _21[_46]._m0[0u].y;
    SV_Target.z = ((_84.z + _66.z) + _105.value.z) + _21[_46]._m0[0u].z;
    SV_Target.w = ((_84.w + _66.w) + _105.value.w) + _21[_46]._m0[0u].w;
}

 

It’s common to think of the Table parameter as equivalents to Vulkan descriptor sets, but that is misleading. The correct way to think about it is that D3D12 has one huge (technically two for Sampler heap) descriptor set, and for convenience you can push offsets into that heap as desired. In Vulkan, descriptor sets can be allocated and bound in complete isolation. Shader model 6.6 removes all of this pretense and shows it how it really is. Everything is just an index to the heap, and the pre-6.6 world tried to hide this fact.

My main problems with the binding model

Very loose coupling between root signature and table descriptor access

The only thing the root signature does is to define how offsets into the heap is computed. There is no knowledge about the array-ed-ness of a resource, nor if every resource is even valid at record time (exception here is STATIC descriptors, but alas, extremely rare in the wild). Unfortunately, this loose nature of root signature and shader matching complicates descriptor hoisting, and when you complicate hoisting, you better pray the target hardware has pristine bindless support (pro-tip, they don’t).

A root signature could say that “we have 100 CBVs in this range”, but shaders might just happen to use a few CBVs from that range. Doing descriptor work per shader is pretty gross and goes against the spirit of the modern graphics APIs. Aggressive hoisting and repacking is CPU overhead that we shouldn’t consider except in exceptional circumstances. A native driver might have a much easier time dealing with these things since they can modify their own command streams at the last minute if they want. In D3D12, SIMULTANEOUS_ACCESS command lists don’t exist, so the driver has this option. Perhaps this restriction was put in place precisely for these scenarios? Who knows, we can only armchair these things!

The D3D12 docs also encourage developers to use somewhat generic root signatures and rarely change them, so a root signature is likely going to be quite “fat” compared to an ideal Vulkan pipeline layout, making useful workarounds even more difficult …

Awkward and unnecessary aliasing

In a D3D12 descriptor heap, there is no descriptor type. Descriptor #10 can be an SRV and descriptor #11 can be a CBV, it’s all a few bits between friends. Surely, games are not going to commit the grave sin of placing the wrong descriptor type in the wrong index, right? … Right?! Of course they do. Debugging insanity like this was the prelude to vkd3d-proton’s fast descriptor QA checking mechanism. It’s actually quite fast. In AAA games, we have code like this running at 40 FPS usually. More than good enough I’d say when hunting for bugs.

void descriptor_qa_report_fault(uint fault_type, uint heap_offset, uint cookie, uint heap_index, uint descriptor_type, uint actual_descriptor_type, uint instruction)
{
    uint _63 = atomicAdd(QAGlobalData.fault_atomic, 1u);
    if (_63 == 0u)
    {
        QAGlobalData.failed_cookie = cookie;
        QAGlobalData.failed_offset = heap_offset;
        QAGlobalData.failed_heap = heap_index;
        QAGlobalData.failed_descriptor_type_mask = descriptor_type;
        QAGlobalData.actual_descriptor_type_mask = actual_descriptor_type;
        QAGlobalData.failed_instruction = instruction;
        QAGlobalData.failed_shader_hash = uvec2(291u, 0u);
        memoryBarrierBuffer();
        QAGlobalData.fault_type = fault_type;
    }
}

uint descriptor_qa_check(uint heap_offset, uint descriptor_type_mask, uint instruction)
{
    uint _98 = QAHeapData.descriptor_count;
    uint _100 = QAHeapData.heap_index;
    uvec2 _102 = QAHeapData.cookies_descriptor_info[heap_offset];
    uint _110 = QAGlobalData.live_status_table[_102.x >> 5u];
    uint _121 = (uint(heap_offset >= _98) | (((_102.y & descriptor_type_mask) == descriptor_type_mask) ? 0u : 2u)) | (((_110 & (1u << (_102.x & 31u))) != 0u) ? 0u : 4u);
    if (_121 != 0u)
    {
        descriptor_qa_report_fault(_121, heap_offset, _102.x, _100, descriptor_type_mask, _102.y, instruction);
        return _98;
    }
    return heap_offset;
}

void main()
{
    uint _45 = descriptor_qa_check(registers._m1 + (INDEX + 40u), 1u, 1u);
    vec4 _132 = texture(nonuniformEXT(sampler2D(_14[_45], _18[registers._m3 + 30u])), vec2(UV));
    SV_Target.x = _132.x;
    SV_Target.y = _132.y;
    SV_Target.z = _132.z;
    SV_Target.w = _132.w;
}

 

With this scheme we can check for:

  • Is the descriptor type correct?
  • Is the resource destroyed already?
  • Did we index the heap out of bounds?

If we hit failures we get a neat atomic cookie in host memory we can sample at regular intervals on CPU. Works surprisingly well and tailor made for our needs, which is likely why it’s so fast compared to generic alternatives. We also only report the first failure. With this mode enabled, we allocate N + 1 descriptors, and reserve the last descriptor as a NULL descriptor which we can OpSelect on failure to avoid GPU hangs. (well, if someone did a faulty atomic compswap loop on a NULL descriptor I guess we’d timeout, but whatever, not much we can do about that).

Forcing the addition of VK_VALVE_mutable_descriptor_type

The combination of weirdly aliased descriptor types and full bindless is pretty deadly since Vulkan has typed descriptors, for good reason! This unfortunate mismatch forced us to create many descriptor sets, one per type, which looks very dumb. It consumes enough memory that we actually ran into serious performance issues on AMD hardware in the beginning. To fix all the immediate issues, we developed VK_VALVE_mutable_descriptor_type, which is kinda gross, but it lets us implement the D3D12 weird aliased descriptors directly, and all the problems went away. It’s a pretty goofy model though and not something Vulkan developers should be using directly. Not all descriptors have the same size in hardware, so there is gaps between descriptors, which is what I’m pretty sure is the case for native D3D12 as well, rip K$ … The more natural model would have been to have descriptor heaps where equally sized objects were allocated in different heaps, but we have to deal with whatever problems D3D12 throws our way and make it work …

Another “nice” side effect of using mutable was that some GPU hangs went away. If the game used the wrong descriptor type, it seemed to at least not read a descriptor that pointed to already freed memory, but rather just a descriptor of wrong type. Somehow, this helped certain games to run around the time the extension was released. It is deeply disturbing that games can ship in this state. :\

A game engine targeting Vulkan can be smarter about all these things. For example, having 1M sampled images seems reasonable, as that’s the number one thing you’d use bindless for, but do you need 1M CBVs, fully bindless? I doubt it. I’m far more in favor of a model where normal, packable, hoistable descriptors are used by default, and bindless is only tapped into as needed.

Enforces a fully bindless implementation

When implementing this binding model in other APIs, we’re forced to make everything fully bindless. Everything is accessed in terms of the global descriptor heap, and even if 99% of shaders use a very traditional D3D11-style binding model, we cannot do anything smart here because of one critical design flaw of D3D12 …

The great VOLATILE mistake of Root Signature 1.0

Root Signature 1.0 really doubles down on forcing full bindless everywhere, even when shaders don’t actually need it. All descriptors are considered VOLATILE, which means that we require all the Vulkan flags that are opt-in:

  • UPDATE_AFTER_BIND
  • UPDATE_WHILE_PENDING
  • PARTIALLY_BOUND
  • VARIABLE_COUNT

This fact completely blocks us from hoisting descriptors on the fly. Even if the shader is simple enough, we’re royally screwed because there is no guarantee that descriptors are actually valid at the time of command recording. Almost all games still cling to Root Signature 1.0, and probably rely on driver heroics to work around any performance implication. Table CBV on Pascal GPUs for example is disgustingly slow in vkd3d-proton. In Vulkan, these GPUs don’t even support UPDATE_AFTER_BIND UBOs! I can only wonder what utter depravities the D3D12 driver engineers had to do to make this work well … Hoist descriptors late with device generated commands based on the PSO? Bleh. The API forces us to implement these as bindless SSBO, and I die a little inside every time I have to think about this. Root CBV works perfectly fine, yet games insist on using table CBVs with VOLATILE just because …

On AMD, everything is fully bindless anyways, as descriptors are just memory, but it’s unfortunate that the D3D12 API went VOLATILE by default. It really should have been opt-in. It’s not like the poor Pascal souls can actually buy new GPUs these days even if they wanted to 🙂

The second mistake of Root Signature 1.1 – STATIC

The mistake of VOLATILE-by-default was recognized by the time 1.1 rolled around, and STATIC was now made the default. STATIC is the nicest mode possible for drivers, since it fully allows hoisting of descriptors, but you actually lose robustness guarantees (!). Most big game engines probably looked at this for 2 seconds and noped out.

D3D in general has gone to extreme lengths to ensure robustness guarantees, and engines of course rely on every esoteric OOB scenario to not fail. This is hopefully not by design, but engine bugs get masked for years if they never cause issues in practice. Losing that guarantee means no one is going to risk random GPU crashes in the wild for just a potential performance increase.

The actual fix that no game ever uses – STATIC + BOUNDS_CHECKED

The actual fix is fairly obscure. At some point a more reasonable mode was added which preserves bounds checking. Of course, no game I know of actually uses this. vkd3d-proton can take advantage of STATIC / BOUNDS_CHECKED CBVs in some cases where we will hoist them to push descriptors for significant NV perf gains, but alas, when no games use the APIs like intended, it might as well not exist …

Immutable sampler jank

The docs imply that the driver is supposed to implement this with an internal hashmap of sampler objects. Oddly enough, these sampler objects live outside the normal sampler heap you bind to the command list, and in fact, you don’t have to bind anything to use immutable samplers, it’s intended to be implemented through driver magic.

This magic is certainly odd, but in Vulkan we have to allocate and bind immutable samplers to a command buffer. This is because very little hardware actually supports true immutable samplers in the sense that they are embedded in the shader code itself. While we still have to allocate descriptor sets, at least we don’t have to actually write the immutable samplers to the descriptor set.

On AMD, the descriptors are read directly from scalar registers, so there’s nothing stopping a motivated compiler from emitting a bunch of immediate register moves and go to town, but sadly, most hardware don’t work like this, and I don’t think drivers even take advantage of this possibility yet.

My big question is how this is even supposed to work on D3D12 if the samplers have to live somewhere in memory. If the shader is simply storing constant offsets into some hidden sampler heap, then surely those constants are not stable across different runs, and thus shader caching fails. Patching in new constants seems pretty gross, but I can only speculate … Either way, in vkd3d-proton we end up creating a VkDescriptorSet per root signature which holds all immutable samplers for a given root signature. This is automatically bound when flushing dirty state in a draw or dispatch call.

Painful and awkward raw buffer types

In Vulkan we are blessed with the SSBO, the flexible and versatile buffer. D3D12 is a bit more weird and has extremely specific alignment and robustness guarantees which don’t map cleanly to anything.

First, we have ByteAddressBuffer.

  • Alignment requirement of buffer binding, 16 bytes (good)
  • Load1,2,3,4 variants, with 4 byte alignment requirement (2 for 16-bit). SM 6.2 improved this a fair bit, with a templated type, but for us in DXIL land, very little changes.
  • Robustness is checked per-component (why ._.)
  • Loads and stores are done at specific byte offsets, which is super ugly, but mostly for shader authors

The only natural way to implement this is a uint[] array, where we unroll loads and store per-component. Fun times. I would be surprised if hardware actually supported vectorized load-store with per-component robustness at scalar alignment …

The second style is StructuredBuffer which represents a buffer as a T data[] array. One deeply frustrating aspect of this is that data structures are tightly packed, meaning T could be float3, and we’d get 12 byte stride. StructuredBuffers must be bound in terms of elements, not byte offsets, so it’s perfectly valid to bind a structured buffer at 12 byte offset. This cannot be implemented safely on GPUs which report 16 byte SSBO alignment sadly. Due to all this, it’s almost impossible to cleanly declare a vec3 data[]; inside the SPIR-V, even if we have scalar block layout support. Just like ByteAddressBuffer, we kinda have to unroll loads and stores with uint data[] again, shame … We have gone to great lengths to support an “offset” buffer, which means we can bind an SSBO at 16 byte alignment, and nudge the offset inside the shader to fix things up, but it is horrible.

There are some mind-melting optimizations in flight for dxil-spirv which can vectorize these things, but it’s not a perfect solution. Weird robustness and alignment rules forces us to do some pretty horrible things, and we only have a few escape hatches which we can exploit, and dxil-spirv will try hard to squeeze every last legal vectorization opportunity out of the existing rules.

For example, StructuredBuffers have a rule saying that robustness only needs to be checked once, based on the structure element that is being accessed. For dynamic access into the element itself, it is undefined behavior to straddle the element boundary. Think StructuredBuffer<float4> where you do a dynamic access into the components, and index overflows.

I’m wondering if hardware actually has a special descriptor type they use to implement structured buffers. Perhaps the same ones which are used for vertex attribute fetch? And yes, even AMD uses tbuffer with stride pulled from descriptors, so saying it’s fully programmable fetch is a bit misleading I think.

Texel buffers are not a valid solution either. They can be bound at scalar alignment, which is nice, but we cannot safely use vectorized load-store either. Texel buffers can only be read and written to fully. No write masks allowed. Another problem is mix and match of 16-bit load-store which completely broke any hope of using texel buffers. In Vulkan, we can redeclare an SSBO multiple times with different data layouts, which is basically a human readable version of ByteAddressBuffer.

It’s a bit frustrating to deal with these problems, because almost no sane application is going to run into the edge cases, or need all the esoteric robustness features, but we have to make every possible edge case work. We cannot take shortcuts. To demonstrate the insanity we have to go through on certain GPUs …

StructuredBuffer<float3> B;

float3 main(uint index : INDEX) : SV_Target
{
	return B[index];
}

Since this can be bound at offset 12, we might see this horrible code. The offset buffer is associated with the descriptor heap, and we allocate a separate side channel buffer for this purpose.

layout(set = 1, binding = 0, std430) restrict readonly buffer SSBO_Offsets
{
    uvec2 _m0[];
} _13;

layout(set = 1, binding = 1, std430) restrict readonly buffer SSBO
{
    uint _m0[];
} _18[];

layout(push_constant, std430) uniform RootConstants
{
   // root parameters
} registers;

layout(location = 0) flat in uint INDEX;
layout(location = 0) out vec3 SV_Target;

void main()
{
    // Load extra offset + actual range. We cannot bind the SSBO tightly enough
    // on some GPUs.
    // Since the access is not nonuniform, we can give a strong hint to compiler
    // that the offsets can be loaded as a broadcast.
    uvec2 _37 = _13._m0[subgroupBroadcastFirst(registers._m1)] >> uvec2(2u);
    uint _41 = INDEX * 3u;
    // Trip explicit OOB if we fail.
    uint _47 = (_41 < _37.y) ? (_41 + _37.x) : 1073741820u;
    // Unroll 3 loads. ;_;
    vec3 _60 = uintBitsToFloat(uvec3(_18[registers._m1]._m0[_47], _18[registers._m1]._m0[_47 + 1u], _18[registers._m1]._m0[_47 + 2u]));
    SV_Target.x = _60.x;
    SV_Target.y = _60.y;
    SV_Target.z = _60.z;
}

The horrible workarounds

We have a fair amount of issues to deal with. A lot of invalid API behavior in D3D12 happens to work on native drivers, and many games ship with subtle bugs which force us to do horrible things. IHVs are going to have a fun time in 5 years when hardware details change 🙂

What is a raw buffer anyways?

A surprisingly common bug is that D3D12 does not complain if you create a typed buffer, but actually intended to create a raw buffer (ByteAddressBuffer) and vice versa. The difference is just one flag away and is easily missed. For us, that means SSBO vs texel buffer, a catastrophic error, but apparently, D3D12 does not care and drivers will happily consume typed buffers as raw and vice versa. I have no idea how this works, but somehow it does (I wrote tests)! Either way, it got bad enough that when emitting buffer descriptors, we have to emit both an SSBO variant and texel buffer variant and then the shader can pick the correct descriptor it wants. Of course, this being bindless and decoupled, we cannot pick the correct descriptor at draw time, sigh … Sometimes I think bindless all the things is going to be considered a huge mistake in 10 years from a software ecosystem point of view.

What is a NULL descriptor anyways?

A reasonably common bug is the NULL descriptor with wrong type. Technically in D3D12, NULL descriptors are still typed and you have to pass in a valid resource desc when creating one, but mistakes happen all the time. Using the wrong type works perfectly fine on existing hardware in almost all cases (I wrote a test!). For this reason, vkd3d-proton takes the conservative approach, and writing a NULL descriptor means splatting out NULL descriptors to all the different types of descriptors we have in the heap. For MUTABLE, this means two sets, for non-mutable, six! Voila, various GPU hangs disappeared, just like that.

VOLATILE is really bad for validation

Having simple validation is critical for development, and I think the real hidden mistake of VOLATILE being default, is that only GPU-assisted validation can catch any bugs! GPU validation is very heavy and usually avoided when possible. I feel a ton of these bugs could have been caught if descriptors could have been validated on CPU timeline. As long as an invalid descriptor is not dynamically accessed by a shader, it’s all OK! Dynamically accessed is the critical wording here. With this requirement, we can only validate on submit time if we can prove ahead of time that a shader will generate invocations which then statically access an invalid descriptor. Good luck proving that fragment shader invocations happen 100% of the time! Don’t get me wrong, bindless is very powerful and great for the use cases it enables, but derping everything into bindless doesn’t seem like the correct approach to me.

Local root signatures – one great kludge to rule them all

Local root signatures add yet another layer of hell to the binding model. It was annoying enough to implement a dozen different ways to access resources, and now we have do it all over again, but this time, replacing push constants with SBTs + some additional edge cases that are specific to local root signatures!

Local root signatures are used specifically for the shader record tables in ray tracing pipelines. In Vulkan, this data is accessed directly as a buffer, but D3D12 went the way of describing a view instead where each parameter is laid out in memory one after the other. As usual, table pointers is a thing, but this time, they consume 2 DWORDs instead of 1. Why on earth was table pointers described in terms of GPU VA and not just a simple u32 “offset into heap”? Sometimes I find D3D12’s obsession with GPU VA-s quite bizarre and counter-productive. It doesn’t even have buffer device address support, yet we’re using them everywhere, even when it doesn’t make much sense at all. In vkd3d-proton, we had to do the janky thing of returning artificial GPU VA-s for descriptor heaps so that it was fast to convert the VA into an offset.

While root descriptors in global root signatures can be implemented as descriptors, we are forced to use buffer device addresses since the VA is literally sourced from buffer memory. The fun thing is that you can even place RT acceleration structures here as a literal VA! And this is how OpConvertUToAccelerationStructureKHR was born … 🙂

Local root signatures also add the headache of also supporting immutable samplers! This time, we have to defer the creation of a second immutable sampler descriptor set to RTPSO creation, which is super annoying. At bind time, we therefore might run the risk of having to bind one immutable sampler set for the global root signature, then one which is tied to the union of all local root signatures for the RTPSO, so much fun! >_<

Implementing SM 6.6 bindless

SM 6.6 is actually straight forward. Our model in vkd3d-proton was always to map pre-6.6 style bindings to a SM 6.6 style “access the heap directly” model, so this translation is actually not that bad at all. The main annoyance is how there are now 5 (!) opcodes in DXIL to deal with resource handle creation, each with their own idiosyncrasies. I’m pretty sure all of this could have been squeezed into 2 …

  • CreateHandle
  • CreateHandleForLib
  • CreateHandleFromHeap
  • CreateHandleFromBinding
  • AnnotateHandle

Fun times indeed …

float4 main(uint index : INDEX, float2 uv : UV) : SV_Target
{
	Texture2D<float4> T = ResourceDescriptorHeap[index];
	return T.Load(int3(uv, 0));
}

#version 460
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_samplerless_texture_functions : require

layout(set = 0, binding = 0) uniform texture2D _9[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in vec2 UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    vec4 _32 = texelFetch(_9[INDEX], ivec2(uvec2(uint(int(UV.x)), uint(int(UV.y)))), int(0u));
    SV_Target.x = _32.x;
    SV_Target.y = _32.y;
    SV_Target.z = _32.z;
    SV_Target.w = _32.w;
}

Nice. I recently completed SM 6.6 support for vkd3d-proton, have fun with that!

How this maps to AMD GPUs (teaser for next post)

If anything, it feels like the D3D12 binding model is tailored pretty well to AMD hardware. It probably makes sense given the Mantle heritage. The binding model on GCN is extremely neat and tidy and was designed with full bindless in mind since the early days, quite forward looking, indeed. Descriptors are just memory that is loaded into scalar registers and passed directly to various operations which take descriptors. Very little magic is going on here and once you understand how this work, the modern API binding models start making some sense. Root parameters and push constants map nicely to user SGPRs, and as long as we don’t use too many u32s, we have a very direct mental model of how this is going to map to hardware.

Where D3D12 suffers I think is on hardware where there are still remnants of legacy descriptor slots in hardware. We cannot take advantage of this due to poor design decisions made in D3D12, but it is what it is. There is only so much we can do as a translation layer.

We’ll explore this topic in a future post …

Conclusion

It has taken many, many months, if not over a year to get where we are today in vkd3d-proton. Handling all of this insanity pushed us to the breaking point, but somehow it works pretty well. There are no shortcuts we can take.

Despite the mammoth length of this post, I don’t think we’re quite done with the subject of binding models. I’ll probably need another post to cover codegen examples for AMD ISA as well as an overview of the million ways we can convert the same DXIL code to different SPIR-V depending on the root signature.