My personal hell of translating DXIL to SPIR-V – part 3

It’s time we tackle one of the big problems of DXIL, the binding model. The D3D12 binding model is completely foreign to most people who know the Vulkan binding model, and vice versa. I don’t think there are that many people in the world who can claim to fully grasp the binding models in both APIs. Translating every last detail of the D3D12 binding model to Vulkan is extremely painful, and I feel D3D12 made some critical design mistakes which bite us (and native drivers?) hard. Whenever I hear people naively claim D3D12 and Vulkan is basically the same API, I cringe hard and cry a little inside. Translating low level APIs is hellish when the details don’t map 1:1 exactly and the binding model is the perfect vehicle to demonstrate it.

I hope this blog post can serve as a definitive document on the insanity we need to go through in vkd3d-proton to make all of this work well. We have landed on a solution I feel is quite solid for AMD, but perhaps less so on other IHVs … A lot of credit here goes to Doitsujin who went through the insane task of rewriting the entire binding model in vkd3d-proton to the full TIER_3 binding model last year.

Earlier posts in the series:

The starting point of D3D12 – allow reuse of D3D11-era shaders

Before we can discuss D3D12, we must have a basic understanding of the D3D11 binding style we find in SM 5.0. From there, we will see how this was extended to the frankenstein monster that is SM 5.1 DXBC and how this was cribbed wholesale in SM 6.0+.

Binding resources to slots

The binding model in D3D11 was extremely simple, simple enough that we have no problems implementing that in Vulkan 1.0 as-is. Essentially, you can bind a fixed amount of:

  • SRV (t#): Generic read-only resources, sampled images, read-only SSBO and uniform texel buffers
  • UAV (u#): Generic read-write resources, storage images, read-write SSBO and storage texel buffers
  • Samplers (s#): Separate samplers, same as Vulkan
  • CBV (b#): Constant buffer views, i.e. UBOs in Vulkan

Each resource in the HLSL declares which register it binds to, easy enough! This maps very easily to Vulkan. The main difference is that the bindings are per stage in D3D11, but these details are easy to work around. Another awkward detail of D3D11 is that SRVs and UAVs are catch alls for any kind of read-only or read-write resource, even if they supports many different descriptor types. Sampled images, raw buffers and typed buffers are quite different things! Of course, with a slot based binding model, the implementation can resolve these details easily since we know all accessed resources up front. No bindless hell to deal with.

Now, we imagine that game developers are sitting on a large repository of D3D11-era shaders and D3D12 rolls around. Very few were going to risk going D3D12 exclusive any time soon, so it had to be possible to ship the exact same DXBC shaders to target both APIs. Even to this day, most new D3D12 games ship DXBC to target both APIs! (Maintaining two completely different shader backends is soooo much fun and productive use of time …) The center of gravity is slowly shifting towards DXIL, and it’ll probably take a few more years before DXIL is the main bytecode games ship. I still haven’t seen a single Unreal Engine title shipping DXIL-exclusive yet for example.

One key aspect of the D3D11 model is that the shaders say nothing about how data is accessed. Especially buffer data can be implemented in many different ways, and it would have been up to the D3D11 driver to figure out how to implement constant buffers efficiently for example. Of course, with the highly managed model of that API, the driver had ample opportunity to do so, at the cost of CPU overhead. With explicit APIs like D3D12 and Vulkan, there is far less room for drivers to optimize because the application developer is given certain freedoms, but they are expected to do some work in return. In reality, the IHV will probably add hacks behind your back if your performance is important enough to them, but that’s another story.

With buffer data like constant buffers for example, there are three primary indirection levels where we can trade off speed vs. space:

  • Preloaded registers: The absolute fastest mode of execution. If a constant buffer is tiny (a few u32s), it might as well fit entirely inside the register bank of the shader core, no memory operation required in the shader!
  • Store a pointer in preloaded registers: The second fastest mode. We can take one indirection to be able to access a larger chunk of data with a simple load instruction from pointer.
  • Access a descriptor: The “slowest” mode, with two indirections. First, we need to load a descriptor, then, we load data based on that descriptor. We also get robustness guarantees, which is another special kind of hell to handle.

As we see here, these three styles of buffer access have different trade-offs. We can get more efficient access the smaller data we need to consider. A few u32s of data? Preload registers. A few KiB? Store a pointer to it in registers. Many different buffers with different base pointers? Descriptor is the way to go.

The awkward part of this simplified model is that on certain hardware, using descriptors might be the fast path, so we’ll have to be careful. IHVs have optimized these paths to death over the decades.

The critical part for D3D12 here is that there is no way to express these concerns in the HLSL shader itself, whereas we certainly can in GLSL / SPIR-V!

A “virtual” vs. “physical” binding model

In Vulkan, we can express concerns like these in the SPIR-V for the most part.

  • Preloaded registers: Push constants is designed to map directly to this scheme. Some drivers might also be able to promote INLINE_UNIFORM_BUFFER to registers, especially if the descriptor set is not UPDATE_AFTER_BIND.
  • Store a pointer in preloaded registers: In Vulkan, we can place buffer device addresses in a push constant block and load directly from that. This is somewhat esoteric, but we use this a lot in vkd3d-proton. INLINE_UNIFORM_BUFFER also maps to this scheme. If we consider a descriptor set to be a pointer, inline UBO data can be placed directly in that descriptor memory, and we get one indirection. Implementation details between IHVs tends to be wildly different for inline UBO.
  • Access a descriptor: Just normal descriptor sets. If the descriptor set is not UPDATE_AFTER_BIND, a driver might be able to promote a buffer to a pointer in preloaded registers. This model has two indirections. One to load a descriptor, one to load the data from that descriptor.

Since we’re able to express ourselves fairly explicitly in Vulkan in the shader itself, I’ll call this a “physical” binding model, or at least a close enough approximation. A true physical binding model would be one where descriptors are accessed through raw pointers, but good luck convincing N IHVs to agree on how that should work 🙂 Inline UBO is the odd exception to this rule where the pipeline layout specifies access patterns which are not expressed in SPIR-V.

One fair criticism of Vulkan’s model is that there are a lot of ways to do effectively the same thing. Should you use push constants, inline UBO, UBO, UBO with dynamic offset, push constant buffer device address, push descriptors or normal pool descriptors? Hard to say without lots of experience and profiling. At least we have the tools available in the API, even if it’s not always obvious what the optimal choice is. In vkd3d-proton we use almost all of these, depending on the context.

To lament the state of graphics programming a bit, loading constant data optimally is still an unsolved problem in graphics, but we have accelerated ray traced global illumination at least, so yay? 🙂 To be fair, it’s hard to go wrong with plain old Vulkan 1.0 UBOs. Any perf gains beyond that tends to be minor and highly situational.

The abstract nature of D3D12 binding model

D3D12’s model is far more abstract, but at the same time, it maps to extremely specific restrictions on an implementation. It’s a weird model where it feels like an abstraction, but it actually isn’t. We can see right through it. We’ll explore this point in the root signature section …

Shader model 5.1 – I wanna bindless too!

5.1 is a weird version of DXBC. It changes some critical things:

  • Resources can be declared as arrays with dynamic indexing. Unbounded array size is also supported. Bindless, baby!
  • register() bindings now take an optional space parameter. This seems similar to DescriptorSet in SPIR-V, but it actually isn’t. It’s meaningless on its own. It is necessary however when using a lot of unbounded array size declarations. In DXBC, all t# or u# registers for example would be exhausted by unbounded array sizes, but by using different register spaces we can declare multiple unbounded arrays. When referring to a resource, we need to consider its binding, space, shader stage and type (SRV, UAV, CBV vs Sampler).

5.1 is kinda obsolete now that DXIL (6.0) has been around for a while, but it is still shipped in various titles.

The root signature

As mentioned earlier, HLSL shaders give no hints to the driver about how resources are actually accessed, so this information has to come from somewhere. In D3D12, we provide all this information in the root signature. The root signature is analogous to Vulkan’s VkPipelineLayout, but the details are completely different.

I think it’s helpful to understand D3D12 root signatures from a top-down view, and the point I would like to get across is that we should see the root signature as a very fancy push constant layout declaration.

The root signature defines an ABI for shaders. We lay down up to 256 bytes in memory, and then define how those 256 bytes can be used to access resources in various ways. The weird way root signatures work is that there’s a very explicit limit of a certain number of bytes, but we cannot actually access those bytes ourselves in a shader. The 256 bytes are organized in terms of 64 DWORDs. Each DWORD can be viewed in a certain way.

  • 32_BIT_CONSTANT: Consume a certain number of DWORDs and assign them to a CBV with specific register binding. That CBV’s data cannot be accessed with any dynamic array indices. This enforces the hidden implementation detail that the data can be mapped directly to registers. As an interesting tidbit, Vulkan push constants actually don’t have this restriction. In Vulkan, you can still dynamically index push constants as long as the access is dynamically uniform (yes, some hardware can dynamically index registers, it’s neat!)
  • Root descriptor: Consume 2 DWORDs to store a 64-bit GPU virtual address to any kind of raw buffer. CBV, UAV and SRV all work as long as they are untyped. The key thing to note here is that there is no room for a buffer size, so buffer robustness is completely disabled for root descriptors. All of this is just a very roundabout way to say that we store a pointer and access it as a pointer. This is basically a very restricted form of VK_KHR_buffer_device_address which is hidden from the developer. The driver is of course free to turn this into something like a Vulkan push descriptor if it so chooses. In vkd3d-proton, we’d love if games actually made use of root CBVs, but oh well 🙁 Some engines got the memo though 🙂
  • Table pointer: The cost is 1 DWORD, even though it should be 2 (we’ll get to this later, it gets confusing!). This is where the D3D12 binding model starts to get esoteric and diverge completely from Vulkan. We’ll need a separate section to cover this.

The global descriptor heaps and table pointers

Descriptors are managed in a completely different way than Vulkan and this is the point that gives us a pretty big headache.

In D3D12, you’re expected to allocate one heap which contains N descriptors. N goes up to 1,000,000 in the highest binding tier. All descriptors you access in a shader must exist inside this heap, and the heap must be explicitly bound to a command list before use. For a given command list, you’re not really expected to change it, but you can suballocate, and 1 million descriptors should be enough unless we take the Nanite meme far enough where we need to render one material per pixel. Samplers have their own heap, and there’s a limit of a few thousand here.

From the global heap, you’re able to bind sub-ranges of that heap. These sub-ranges are represented as the table pointers in the root signature. Essentially, a table pointer encodes a u32 offset into the global heap – even if you give the API a GPU VA – which is why it costs 1 DWORD instead of 2 (for now …).

Apparently, this idea is a workaround for certain GPUs. There exists a literal “descriptor palette” in some hardware and we can only read descriptors from this “palette” essentially. D3D12 seems to cater to that model, at the cost of flexibility. You’re asked as a developer to own that palette / heap and allocate memory on top of that.

View objects work completely differently as well. In D3D12, a view object is created directly into a descriptor heap, rather than having separate VkImageView and VkBufferView objects. Copying descriptors is a critical part of D3D12 descriptor management. It’s possible to create non-shader visible descriptor heaps, which is basically just a fancy malloc(). Here, it’s intended that we can stream out descriptors to the shader-visible descriptor heap by calling CopyDescriptors over and over, often over 10000 times per frame! Making this path efficient enough has been a ton of work and has placed some serious restrictions on our implementation. There is only one way to do it “fast” and correct, it just took 3-4 iterations to get there.

In Vulkan, the approach is quite different. We create view objects as standalone objects, which are then written into descriptor sets with vkUpdateDescriptorSets. If we squint enough, the Vulkan view objects represent the non-shader visible descriptor heaps, and vkUpdateDescriptorSets is closer to CopyDescriptors (non-shader visible → shader visible) than Create*View. Profiling native drivers shows this clearly. Create*View() is ~50-100x slower than CopyDescriptors, similar to Vulkan if you look at vkCreate*View() vs. vkUpdateDescriptorSets.

In the Vulkan model, there is no concept of a “descriptor heap” which has to be bound, however, we can figure out what is really going on under the hood. We can allocate a lot of view objects in Vulkan, but eventually, we will actually hit out-of-memory conditions on some hardware. This is a good indication that we have actually exhausted the internal descriptor “palette”. The views merely contain references to that global palette instead.

Another headache is that view objects in D3D12 do not have a lifetime or their own. The lifetime is tied to the descriptor heap itself, as a descriptor is essentially treated as plain old data. This is not the case with Vulkan. The unfortunate side effect of this scenario is that we need to maintain a per ID3D12Resource hash-map in vkd3d-proton where we keep view objects alive until we know for sure that we can destroy the views, i.e. at resource destruction time. The older implementation in vkd3d-proton used reference counted views, but as you can expect, when games copy 10000+ descriptors per frame in many threads concurrently, the overhead was unusably large. The shit hit the fan so to say in Death Stranding where we spent >80% of CPU time copying descriptors, not a good look. The only solution was to use VkCopyDescriptors functionality of vkUpdateDescriptorSets, which is considered quite esoteric in Vulkan.

So how do shaders access the global descriptor heap? Well, through a lot of fixed function jank, that’s how! This jank is more or less fixed in SM 6.6 as we’ll get to later, but now we just have yet another API to implement, support and test, sigh …

First, we start with the table parameter that is pushed to the GPU through the command list API. This represents an offset into the descriptor heap. A table entry has a certain number of descriptor ranges associated with it. These ranges specify that a certain subset of descriptors will be consumed as descriptors. For example, consider a Texture2D[] bound at register(t10, space4). A table entry in the root signature might say “Parameter #5 has an SRV range which begins at space #4, register #8, contains unbounded number of descriptors, with a constant offset of 15 descriptors from the table entry.” When accessing this resource, we should access the descriptor heap at descriptor #tableEntry5 + (10 – 8) + 15 + dynamicIndex in shader. As we see, the registers and spaces in the HLSL have no physical meaning until the root signature gives it meaning.

Codegen to SPIR-V

We can map the entire root signature into two parts, the root parameters and immutable sampler declaration.

For example, take this shader:

cbuffer CPush : register(b0) { float4 cpush; };
cbuffer C : register(b1) { float4 c; };
cbuffer C2 : register(b2) { float4 c2; };

Texture2D<float4> T[] : register(t40);
SamplerState S : register(s30);

float4 main(uint index : INDEX, float uv : UV) : SV_Target
{
	return T[NonUniformResourceIndex(index)].Sample(S, uv) +
		cpush + c + c2;
}

A typical codegen here would be something like (via SPIRV-Cross):

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct AddCarry
{
    uint _m0;
    uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerFloat4NonWrite;
layout(buffer_reference, std430) readonly buffer PhysicalPointerFloat4NonWrite
{
    vec4 value;
};

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
    vec4 _m0[4096];
} _21[];

// The root parameters map to push constants.
layout(push_constant, std430) uniform RootConstants
{
    uvec2 _m0; // Root CBV for b1
    uint _m1;  // Various table offset root parameters
    uint _m2;  // ...
    ...
    uint _m9;  // Root constants
    uint _m10;
    uint _m11;
    uint _m12;
} registers;

layout(set = 0, binding = 0) uniform texture2D _14[];
layout(set = 2, binding = 0) uniform sampler _25[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in float UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    uint _46 = registers._m6 + 2u;

    // Textures can only be accessed through heap.
    // Here we see address computation being based on all three inputs.
    // Push constant + constant + dynamic index
    // All of this is wrapped in a nonuniform thing for extra spice.
    vec4 _66 = texture(nonuniformEXT(sampler2D(_14[registers._m1 + (INDEX + 40u)], _25[registers._m3 + 30u])), vec2(UV));

    // Root constants, directly copy from push constants.
    // The cbufferLoad instruction gives us constant offsets, so we can
    // directly access push constant members here.
    vec4 _84 = uintBitsToFloat(uvec4(
        registers._m8, registers._m9, registers._m10, registers._m11));

    // Root CBV, here implemented with buffer device address,
    // kinda ugly SPIR-V, but gotta go fast.
    // This works great on AMD, not so much on NVIDIA … but we can
    // Use a normal plain UBO instead if we want.
    // Ah, the luxury of giving the implementation a choice … :)
    
    AddCarry _98;
    _98._m0 = uaddCarry(registers._m0.x, 0u * 16u, _98._m1);
    PhysicalPointerFloat4NonWrite _105 = PhysicalPointerFloat4NonWrite(uvec2(_98._m0, registers._m0.y + _98._m1));

    // Bindless UBO action, table CBV here we go …
    // Not a great look for performance on some GPUs.
    // On Pascal in particular, we have to emit bindless SSBO instead, rip perf.
    // Not much we can do sadly.
    SV_Target.x = ((_84.x + _66.x) + _105.value.x) + _21[_46]._m0[0u].x;
    SV_Target.y = ((_84.y + _66.y) + _105.value.y) + _21[_46]._m0[0u].y;
    SV_Target.z = ((_84.z + _66.z) + _105.value.z) + _21[_46]._m0[0u].z;
    SV_Target.w = ((_84.w + _66.w) + _105.value.w) + _21[_46]._m0[0u].w;
}

 

It’s common to think of the Table parameter as equivalents to Vulkan descriptor sets, but that is misleading. The correct way to think about it is that D3D12 has one huge (technically two for Sampler heap) descriptor set, and for convenience you can push offsets into that heap as desired. In Vulkan, descriptor sets can be allocated and bound in complete isolation. Shader model 6.6 removes all of this pretense and shows it how it really is. Everything is just an index to the heap, and the pre-6.6 world tried to hide this fact.

My main problems with the binding model

Very loose coupling between root signature and table descriptor access

The only thing the root signature does is to define how offsets into the heap is computed. There is no knowledge about the array-ed-ness of a resource, nor if every resource is even valid at record time (exception here is STATIC descriptors, but alas, extremely rare in the wild). Unfortunately, this loose nature of root signature and shader matching complicates descriptor hoisting, and when you complicate hoisting, you better pray the target hardware has pristine bindless support (pro-tip, they don’t).

A root signature could say that “we have 100 CBVs in this range”, but shaders might just happen to use a few CBVs from that range. Doing descriptor work per shader is pretty gross and goes against the spirit of the modern graphics APIs. Aggressive hoisting and repacking is CPU overhead that we shouldn’t consider except in exceptional circumstances. A native driver might have a much easier time dealing with these things since they can modify their own command streams at the last minute if they want. In D3D12, SIMULTANEOUS_ACCESS command lists don’t exist, so the driver has this option. Perhaps this restriction was put in place precisely for these scenarios? Who knows, we can only armchair these things!

The D3D12 docs also encourage developers to use somewhat generic root signatures and rarely change them, so a root signature is likely going to be quite “fat” compared to an ideal Vulkan pipeline layout, making useful workarounds even more difficult …

Awkward and unnecessary aliasing

In a D3D12 descriptor heap, there is no descriptor type. Descriptor #10 can be an SRV and descriptor #11 can be a CBV, it’s all a few bits between friends. Surely, games are not going to commit the grave sin of placing the wrong descriptor type in the wrong index, right? … Right?! Of course they do. Debugging insanity like this was the prelude to vkd3d-proton’s fast descriptor QA checking mechanism. It’s actually quite fast. In AAA games, we have code like this running at 40 FPS usually. More than good enough I’d say when hunting for bugs.

void descriptor_qa_report_fault(uint fault_type, uint heap_offset, uint cookie, uint heap_index, uint descriptor_type, uint actual_descriptor_type, uint instruction)
{
    uint _63 = atomicAdd(QAGlobalData.fault_atomic, 1u);
    if (_63 == 0u)
    {
        QAGlobalData.failed_cookie = cookie;
        QAGlobalData.failed_offset = heap_offset;
        QAGlobalData.failed_heap = heap_index;
        QAGlobalData.failed_descriptor_type_mask = descriptor_type;
        QAGlobalData.actual_descriptor_type_mask = actual_descriptor_type;
        QAGlobalData.failed_instruction = instruction;
        QAGlobalData.failed_shader_hash = uvec2(291u, 0u);
        memoryBarrierBuffer();
        QAGlobalData.fault_type = fault_type;
    }
}

uint descriptor_qa_check(uint heap_offset, uint descriptor_type_mask, uint instruction)
{
    uint _98 = QAHeapData.descriptor_count;
    uint _100 = QAHeapData.heap_index;
    uvec2 _102 = QAHeapData.cookies_descriptor_info[heap_offset];
    uint _110 = QAGlobalData.live_status_table[_102.x >> 5u];
    uint _121 = (uint(heap_offset >= _98) | (((_102.y & descriptor_type_mask) == descriptor_type_mask) ? 0u : 2u)) | (((_110 & (1u << (_102.x & 31u))) != 0u) ? 0u : 4u);
    if (_121 != 0u)
    {
        descriptor_qa_report_fault(_121, heap_offset, _102.x, _100, descriptor_type_mask, _102.y, instruction);
        return _98;
    }
    return heap_offset;
}

void main()
{
    uint _45 = descriptor_qa_check(registers._m1 + (INDEX + 40u), 1u, 1u);
    vec4 _132 = texture(nonuniformEXT(sampler2D(_14[_45], _18[registers._m3 + 30u])), vec2(UV));
    SV_Target.x = _132.x;
    SV_Target.y = _132.y;
    SV_Target.z = _132.z;
    SV_Target.w = _132.w;
}

 

With this scheme we can check for:

  • Is the descriptor type correct?
  • Is the resource destroyed already?
  • Did we index the heap out of bounds?

If we hit failures we get a neat atomic cookie in host memory we can sample at regular intervals on CPU. Works surprisingly well and tailor made for our needs, which is likely why it’s so fast compared to generic alternatives. We also only report the first failure. With this mode enabled, we allocate N + 1 descriptors, and reserve the last descriptor as a NULL descriptor which we can OpSelect on failure to avoid GPU hangs. (well, if someone did a faulty atomic compswap loop on a NULL descriptor I guess we’d timeout, but whatever, not much we can do about that).

Forcing the addition of VK_VALVE_mutable_descriptor_type

The combination of weirdly aliased descriptor types and full bindless is pretty deadly since Vulkan has typed descriptors, for good reason! This unfortunate mismatch forced us to create many descriptor sets, one per type, which looks very dumb. It consumes enough memory that we actually ran into serious performance issues on AMD hardware in the beginning. To fix all the immediate issues, we developed VK_VALVE_mutable_descriptor_type, which is kinda gross, but it lets us implement the D3D12 weird aliased descriptors directly, and all the problems went away. It’s a pretty goofy model though and not something Vulkan developers should be using directly. Not all descriptors have the same size in hardware, so there is gaps between descriptors, which is what I’m pretty sure is the case for native D3D12 as well, rip K$ … The more natural model would have been to have descriptor heaps where equally sized objects were allocated in different heaps, but we have to deal with whatever problems D3D12 throws our way and make it work …

Another “nice” side effect of using mutable was that some GPU hangs went away. If the game used the wrong descriptor type, it seemed to at least not read a descriptor that pointed to already freed memory, but rather just a descriptor of wrong type. Somehow, this helped certain games to run around the time the extension was released. It is deeply disturbing that games can ship in this state. :\

A game engine targeting Vulkan can be smarter about all these things. For example, having 1M sampled images seems reasonable, as that’s the number one thing you’d use bindless for, but do you need 1M CBVs, fully bindless? I doubt it. I’m far more in favor of a model where normal, packable, hoistable descriptors are used by default, and bindless is only tapped into as needed.

Enforces a fully bindless implementation

When implementing this binding model in other APIs, we’re forced to make everything fully bindless. Everything is accessed in terms of the global descriptor heap, and even if 99% of shaders use a very traditional D3D11-style binding model, we cannot do anything smart here because of one critical design flaw of D3D12 …

The great VOLATILE mistake of Root Signature 1.0

Root Signature 1.0 really doubles down on forcing full bindless everywhere, even when shaders don’t actually need it. All descriptors are considered VOLATILE, which means that we require all the Vulkan flags that are opt-in:

  • UPDATE_AFTER_BIND
  • UPDATE_WHILE_PENDING
  • PARTIALLY_BOUND
  • VARIABLE_COUNT

This fact completely blocks us from hoisting descriptors on the fly. Even if the shader is simple enough, we’re royally screwed because there is no guarantee that descriptors are actually valid at the time of command recording. Almost all games still cling to Root Signature 1.0, and probably rely on driver heroics to work around any performance implication. Table CBV on Pascal GPUs for example is disgustingly slow in vkd3d-proton. In Vulkan, these GPUs don’t even support UPDATE_AFTER_BIND UBOs! I can only wonder what utter depravities the D3D12 driver engineers had to do to make this work well … Hoist descriptors late with device generated commands based on the PSO? Bleh. The API forces us to implement these as bindless SSBO, and I die a little inside every time I have to think about this. Root CBV works perfectly fine, yet games insist on using table CBVs with VOLATILE just because …

On AMD, everything is fully bindless anyways, as descriptors are just memory, but it’s unfortunate that the D3D12 API went VOLATILE by default. It really should have been opt-in. It’s not like the poor Pascal souls can actually buy new GPUs these days even if they wanted to 🙂

The second mistake of Root Signature 1.1 – STATIC

The mistake of VOLATILE-by-default was recognized by the time 1.1 rolled around, and STATIC was now made the default. STATIC is the nicest mode possible for drivers, since it fully allows hoisting of descriptors, but you actually lose robustness guarantees (!). Most big game engines probably looked at this for 2 seconds and noped out.

D3D in general has gone to extreme lengths to ensure robustness guarantees, and engines of course rely on every esoteric OOB scenario to not fail. This is hopefully not by design, but engine bugs get masked for years if they never cause issues in practice. Losing that guarantee means no one is going to risk random GPU crashes in the wild for just a potential performance increase.

The actual fix that no game ever uses – STATIC + BOUNDS_CHECKED

The actual fix is fairly obscure. At some point a more reasonable mode was added which preserves bounds checking. Of course, no game I know of actually uses this. vkd3d-proton can take advantage of STATIC / BOUNDS_CHECKED CBVs in some cases where we will hoist them to push descriptors for significant NV perf gains, but alas, when no games use the APIs like intended, it might as well not exist …

Immutable sampler jank

The docs imply that the driver is supposed to implement this with an internal hashmap of sampler objects. Oddly enough, these sampler objects live outside the normal sampler heap you bind to the command list, and in fact, you don’t have to bind anything to use immutable samplers, it’s intended to be implemented through driver magic.

This magic is certainly odd, but in Vulkan we have to allocate and bind immutable samplers to a command buffer. This is because very little hardware actually supports true immutable samplers in the sense that they are embedded in the shader code itself. While we still have to allocate descriptor sets, at least we don’t have to actually write the immutable samplers to the descriptor set.

On AMD, the descriptors are read directly from scalar registers, so there’s nothing stopping a motivated compiler from emitting a bunch of immediate register moves and go to town, but sadly, most hardware don’t work like this, and I don’t think drivers even take advantage of this possibility yet.

My big question is how this is even supposed to work on D3D12 if the samplers have to live somewhere in memory. If the shader is simply storing constant offsets into some hidden sampler heap, then surely those constants are not stable across different runs, and thus shader caching fails. Patching in new constants seems pretty gross, but I can only speculate … Either way, in vkd3d-proton we end up creating a VkDescriptorSet per root signature which holds all immutable samplers for a given root signature. This is automatically bound when flushing dirty state in a draw or dispatch call.

Painful and awkward raw buffer types

In Vulkan we are blessed with the SSBO, the flexible and versatile buffer. D3D12 is a bit more weird and has extremely specific alignment and robustness guarantees which don’t map cleanly to anything.

First, we have ByteAddressBuffer.

  • Alignment requirement of buffer binding, 16 bytes (good)
  • Load1,2,3,4 variants, with 4 byte alignment requirement (2 for 16-bit). SM 6.2 improved this a fair bit, with a templated type, but for us in DXIL land, very little changes.
  • Robustness is checked per-component (why ._.)
  • Loads and stores are done at specific byte offsets, which is super ugly, but mostly for shader authors

The only natural way to implement this is a uint[] array, where we unroll loads and store per-component. Fun times. I would be surprised if hardware actually supported vectorized load-store with per-component robustness at scalar alignment …

The second style is StructuredBuffer which represents a buffer as a T data[] array. One deeply frustrating aspect of this is that data structures are tightly packed, meaning T could be float3, and we’d get 12 byte stride. StructuredBuffers must be bound in terms of elements, not byte offsets, so it’s perfectly valid to bind a structured buffer at 12 byte offset. This cannot be implemented safely on GPUs which report 16 byte SSBO alignment sadly. Due to all this, it’s almost impossible to cleanly declare a vec3 data[]; inside the SPIR-V, even if we have scalar block layout support. Just like ByteAddressBuffer, we kinda have to unroll loads and stores with uint data[] again, shame … We have gone to great lengths to support an “offset” buffer, which means we can bind an SSBO at 16 byte alignment, and nudge the offset inside the shader to fix things up, but it is horrible.

There are some mind-melting optimizations in flight for dxil-spirv which can vectorize these things, but it’s not a perfect solution. Weird robustness and alignment rules forces us to do some pretty horrible things, and we only have a few escape hatches which we can exploit, and dxil-spirv will try hard to squeeze every last legal vectorization opportunity out of the existing rules.

For example, StructuredBuffers have a rule saying that robustness only needs to be checked once, based on the structure element that is being accessed. For dynamic access into the element itself, it is undefined behavior to straddle the element boundary. Think StructuredBuffer<float4> where you do a dynamic access into the components, and index overflows.

I’m wondering if hardware actually has a special descriptor type they use to implement structured buffers. Perhaps the same ones which are used for vertex attribute fetch? And yes, even AMD uses tbuffer with stride pulled from descriptors, so saying it’s fully programmable fetch is a bit misleading I think.

Texel buffers are not a valid solution either. They can be bound at scalar alignment, which is nice, but we cannot safely use vectorized load-store either. Texel buffers can only be read and written to fully. No write masks allowed. Another problem is mix and match of 16-bit load-store which completely broke any hope of using texel buffers. In Vulkan, we can redeclare an SSBO multiple times with different data layouts, which is basically a human readable version of ByteAddressBuffer.

It’s a bit frustrating to deal with these problems, because almost no sane application is going to run into the edge cases, or need all the esoteric robustness features, but we have to make every possible edge case work. We cannot take shortcuts. To demonstrate the insanity we have to go through on certain GPUs …

StructuredBuffer<float3> B;

float3 main(uint index : INDEX) : SV_Target
{
	return B[index];
}

Since this can be bound at offset 12, we might see this horrible code. The offset buffer is associated with the descriptor heap, and we allocate a separate side channel buffer for this purpose.

layout(set = 1, binding = 0, std430) restrict readonly buffer SSBO_Offsets
{
    uvec2 _m0[];
} _13;

layout(set = 1, binding = 1, std430) restrict readonly buffer SSBO
{
    uint _m0[];
} _18[];

layout(push_constant, std430) uniform RootConstants
{
   // root parameters
} registers;

layout(location = 0) flat in uint INDEX;
layout(location = 0) out vec3 SV_Target;

void main()
{
    // Load extra offset + actual range. We cannot bind the SSBO tightly enough
    // on some GPUs.
    // Since the access is not nonuniform, we can give a strong hint to compiler
    // that the offsets can be loaded as a broadcast.
    uvec2 _37 = _13._m0[subgroupBroadcastFirst(registers._m1)] >> uvec2(2u);
    uint _41 = INDEX * 3u;
    // Trip explicit OOB if we fail.
    uint _47 = (_41 < _37.y) ? (_41 + _37.x) : 1073741820u;
    // Unroll 3 loads. ;_;
    vec3 _60 = uintBitsToFloat(uvec3(_18[registers._m1]._m0[_47], _18[registers._m1]._m0[_47 + 1u], _18[registers._m1]._m0[_47 + 2u]));
    SV_Target.x = _60.x;
    SV_Target.y = _60.y;
    SV_Target.z = _60.z;
}

The horrible workarounds

We have a fair amount of issues to deal with. A lot of invalid API behavior in D3D12 happens to work on native drivers, and many games ship with subtle bugs which force us to do horrible things. IHVs are going to have a fun time in 5 years when hardware details change 🙂

What is a raw buffer anyways?

A surprisingly common bug is that D3D12 does not complain if you create a typed buffer, but actually intended to create a raw buffer (ByteAddressBuffer) and vice versa. The difference is just one flag away and is easily missed. For us, that means SSBO vs texel buffer, a catastrophic error, but apparently, D3D12 does not care and drivers will happily consume typed buffers as raw and vice versa. I have no idea how this works, but somehow it does (I wrote tests)! Either way, it got bad enough that when emitting buffer descriptors, we have to emit both an SSBO variant and texel buffer variant and then the shader can pick the correct descriptor it wants. Of course, this being bindless and decoupled, we cannot pick the correct descriptor at draw time, sigh … Sometimes I think bindless all the things is going to be considered a huge mistake in 10 years from a software ecosystem point of view.

What is a NULL descriptor anyways?

A reasonably common bug is the NULL descriptor with wrong type. Technically in D3D12, NULL descriptors are still typed and you have to pass in a valid resource desc when creating one, but mistakes happen all the time. Using the wrong type works perfectly fine on existing hardware in almost all cases (I wrote a test!). For this reason, vkd3d-proton takes the conservative approach, and writing a NULL descriptor means splatting out NULL descriptors to all the different types of descriptors we have in the heap. For MUTABLE, this means two sets, for non-mutable, six! Voila, various GPU hangs disappeared, just like that.

VOLATILE is really bad for validation

Having simple validation is critical for development, and I think the real hidden mistake of VOLATILE being default, is that only GPU-assisted validation can catch any bugs! GPU validation is very heavy and usually avoided when possible. I feel a ton of these bugs could have been caught if descriptors could have been validated on CPU timeline. As long as an invalid descriptor is not dynamically accessed by a shader, it’s all OK! Dynamically accessed is the critical wording here. With this requirement, we can only validate on submit time if we can prove ahead of time that a shader will generate invocations which then statically access an invalid descriptor. Good luck proving that fragment shader invocations happen 100% of the time! Don’t get me wrong, bindless is very powerful and great for the use cases it enables, but derping everything into bindless doesn’t seem like the correct approach to me.

Local root signatures – one great kludge to rule them all

Local root signatures add yet another layer of hell to the binding model. It was annoying enough to implement a dozen different ways to access resources, and now we have do it all over again, but this time, replacing push constants with SBTs + some additional edge cases that are specific to local root signatures!

Local root signatures are used specifically for the shader record tables in ray tracing pipelines. In Vulkan, this data is accessed directly as a buffer, but D3D12 went the way of describing a view instead where each parameter is laid out in memory one after the other. As usual, table pointers is a thing, but this time, they consume 2 DWORDs instead of 1. Why on earth was table pointers described in terms of GPU VA and not just a simple u32 “offset into heap”? Sometimes I find D3D12’s obsession with GPU VA-s quite bizarre and counter-productive. It doesn’t even have buffer device address support, yet we’re using them everywhere, even when it doesn’t make much sense at all. In vkd3d-proton, we had to do the janky thing of returning artificial GPU VA-s for descriptor heaps so that it was fast to convert the VA into an offset.

While root descriptors in global root signatures can be implemented as descriptors, we are forced to use buffer device addresses since the VA is literally sourced from buffer memory. The fun thing is that you can even place RT acceleration structures here as a literal VA! And this is how OpConvertUToAccelerationStructureKHR was born … 🙂

Local root signatures also add the headache of also supporting immutable samplers! This time, we have to defer the creation of a second immutable sampler descriptor set to RTPSO creation, which is super annoying. At bind time, we therefore might run the risk of having to bind one immutable sampler set for the global root signature, then one which is tied to the union of all local root signatures for the RTPSO, so much fun! >_<

Implementing SM 6.6 bindless

SM 6.6 is actually straight forward. Our model in vkd3d-proton was always to map pre-6.6 style bindings to a SM 6.6 style “access the heap directly” model, so this translation is actually not that bad at all. The main annoyance is how there are now 5 (!) opcodes in DXIL to deal with resource handle creation, each with their own idiosyncrasies. I’m pretty sure all of this could have been squeezed into 2 …

  • CreateHandle
  • CreateHandleForLib
  • CreateHandleFromHeap
  • CreateHandleFromBinding
  • AnnotateHandle

Fun times indeed …

float4 main(uint index : INDEX, float2 uv : UV) : SV_Target
{
	Texture2D<float4> T = ResourceDescriptorHeap[index];
	return T.Load(int3(uv, 0));
}

#version 460
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_samplerless_texture_functions : require

layout(set = 0, binding = 0) uniform texture2D _9[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in vec2 UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    vec4 _32 = texelFetch(_9[INDEX], ivec2(uvec2(uint(int(UV.x)), uint(int(UV.y)))), int(0u));
    SV_Target.x = _32.x;
    SV_Target.y = _32.y;
    SV_Target.z = _32.z;
    SV_Target.w = _32.w;
}

Nice. I recently completed SM 6.6 support for vkd3d-proton, have fun with that!

How this maps to AMD GPUs (teaser for next post)

If anything, it feels like the D3D12 binding model is tailored pretty well to AMD hardware. It probably makes sense given the Mantle heritage. The binding model on GCN is extremely neat and tidy and was designed with full bindless in mind since the early days, quite forward looking, indeed. Descriptors are just memory that is loaded into scalar registers and passed directly to various operations which take descriptors. Very little magic is going on here and once you understand how this work, the modern API binding models start making some sense. Root parameters and push constants map nicely to user SGPRs, and as long as we don’t use too many u32s, we have a very direct mental model of how this is going to map to hardware.

Where D3D12 suffers I think is on hardware where there are still remnants of legacy descriptor slots in hardware. We cannot take advantage of this due to poor design decisions made in D3D12, but it is what it is. There is only so much we can do as a translation layer.

We’ll explore this topic in a future post …

Conclusion

It has taken many, many months, if not over a year to get where we are today in vkd3d-proton. Handling all of this insanity pushed us to the breaking point, but somehow it works pretty well. There are no shortcuts we can take.

Despite the mammoth length of this post, I don’t think we’re quite done with the subject of binding models. I’ll probably need another post to cover codegen examples for AMD ISA as well as an overview of the million ways we can convert the same DXIL code to different SPIR-V depending on the root signature.

My personal hell of translating DXIL to SPIR-V – part 2

In the previous blog post, I began a long form exploration of the DXIL shader format, and how it translates to SPIR-V. In this part, we’ll look more closely at the LLVM format, how it is parsed, and how to interpret the parsed result.

The LLVM IR binary format is mostly undocumented. Very early on we have to dig through the source to understand what is going on. LLVM IR was never intended to be used as a “standard” format that is shipped between different software stacks. It’s clearly an ad-hoc serialization format that serves the purpose of LLVM internals. The IR format is backwards compatible at the very least, which is why we can parse DXIL LLVM 3.7 modules with modern LLVM versions.

As we’ll see, LLVM IR is very complex to parse compared to SPIR-V. There are some interesting similarities however, as SPIR-V shares some DNA of LLVM.

Layered architecture

LLVM IR is parsed in multiple layers. At the lower level is a compression scheme which feels somewhat like LZ compression. The bit-stream teaches the decoder how to decode the stream, by emitting “templates” (or “code book entries”), and these templates can then be instantiated to form complete records.

The low-level bit-stream parser

The initial part of the LLVM IR puzzle is pulled from RenderDoc’s implementation. The basic gist of it is documented here. To summarize however, the idea is that a module consists of one top-level “block”. A block is a structure of blocks and records. A record has an ID with an array of uint64_t operands (quite similar to SPIR-V, except SPIR-V is array of 32-bit operands).

Storing full uint64_t operands is of course very wasteful, and this is where the primitives types of LLVM IR come in. We can express primitive types compactly with:

  • Variable length integers (configurable chunk size)
  • Fixed width integers (configurable bit width)
  • 6-bit chars (useful for C-style identifiers, i.e. a-z, A-Z, 0-9 and _)

Variable length integers are encoded in a scheme where we look at N bits at a time, N – 1 bits contain useful data, and the MSB marks whether to keep looking at N more bits.

Blocks and records are invoked in an esoteric way, which is where abbreviations come in. When we parse, we’re parsing abbreviations one at a time, which either results in some action:

  • 0 – END_BLOCK – Ends block scope
  • 1 – ENTER_SUBBLOCK – Begins a new scope, can nest arbitrarily
  • 2 – DEFINE_ABBREV – Defines a template for how to build new records. For example, we can specify that a record is [vbr4, char6, literal constant] or something. That abbreviation implicitly gets a new ID, starting with 4, which can be invoked when parsing new abbreviations. When decoding this abbreviation, the parser knows ahead of time how to decode the bits into arguments. For char6 strings in particular, it’s also possible to specify an array abbreviation.
  • 3 – UNABBREV_RECORD – YOLO mode, directly decodes a record with a bunch of variable length integers, fairly inefficient. DXC seems to love to emit these 🙂
  • 4+ – Invoke user abbreviations.

In typical LLVM IR fashion, the number of bits used to encode the abbreviation ID is variable. It starts at 2 bits (since there are no user abbreviations to worry about yet), but can grow as needed. Fun!

The details are not super interesting for this post, but suffice to say, there’s a decent amount of detail that goes into parsing this.

The calling code ends up looking something like this:

LLVMBC::BitcodeReader reader(static_cast<const uint8_t *>(data), size);
LLVMBC::BlockOrRecord toplevel = reader.ReadToplevelBlock();

// The top-level block must be MODULE_BLOCK.
if (KnownBlocks(toplevel.id) != KnownBlocks::MODULE_BLOCK)
   return nullptr;

// We should have consumed all bits, only one top-level block.
if (!reader.AtEndOfStream())
   return nullptr;

The BlockOrRecord struct is fairly straight forward, simplified here as:

struct BlockOrRecord
{
  uint32_t id; // What kind of record or block is this?
  Type type; // block or record
  dxil_spv::Vector<BlockOrRecord> children; // If block
  dxil_spv::Vector<uint64_t> ops; // If record
};

Higher level parser

Now, we’re at a level where we have recovered structure from the bit-stream, and now we need to turn the BlockOrRecord structs into actual API objects, llvm::Module, llvm::Function, llvm::Value, etc, etc … dxil-spirv implements a LLVM C++ API drop-in replacement to be able to cross-reference our implementation against the reference implementation at any time (which has saved me many times). The implementation only implements exactly what is needed for DXIL however, so don’t expect too much of it. 🙂

Refer to objects by ID

Very similar to SPIR-V, types and values are referred to by an uint64_t ID. The annoying part however is that types and values implicitly allocate their own IDs, meaning that forgetting to parse something can be fatal. On top of this, IDs may or may not refer to other IDs through deltas relative to their own values or absolute values. It is somewhat context sensitive which one to use, which gets quite annoying to deal with.

Decoding llvm::Type

LLVM IR has a type hierarchy similar to SPIR-V. You start by declaring fundamental types like ints and float, and then upgrade them to vectors, array, pointers or structs. While parsing, the top-level block can contain TYPE blocks, which contains a bunch of records.

for (auto &child : toplevel.children)
{
   if (child.IsBlock())
   {
      switch (KnownBlocks(child.id))
      {
      case KnownBlocks::TYPE_BLOCK:
         for (auto &entry : child.children)
             parse_type(entry);
         break;
      }
   }
}

bool ModuleParseContext::parse_type(const BlockOrRecord &child)
{
   Type *type = nullptr;
   switch (TypeRecord(child.id))
   {
   case TypeRecord::VOID_TYPE:
   case TypeRecord::HALF:
   case TypeRecord::INTEGER:
   case TypeRecord::POINTER:
   case TypeRecord::ARRAY:
   case TypeRecord::FUNCTION:
   // you get the idea
   }
   types.push_back(type);
}

Integers deserve special mention, because they are somewhat whacky in LLVM. First, they have no signage associated with them. This kinda makes sense, since signage only actually matters in certain opcodes, like signed min/max, signed compare, arithmetic vs logical right shift, signed vs unsigned float <-> int conversion, etc. SPIR-V maintains sign for its integer type, but we can ignore it in most scenarios. (There is an esoteric exception to this however where DXIL kinda breaks down, once we dig into relaxed precision signed integers!) Another annoying exception we have to deal with all the time is stage IO and resource variables which are explicitly signed or unsigned in DXIL.

As the grizzled C programmer will know, signed overflow is undefined, but unsigned overflow is not. Does LLVM just not care? Well, it does. LLVM can mark operations as being “no signed wrap”, or “no unsigned wrap” for optimization purposes, but we don’t have to care about those at all fortunately.

Booleans are expressed as 1-bit integers, which kind of makes sense, but at the same time feels like a very LLVM thing to do … Logical operations reduce to simple arithmetic operations on 1-bit values instead.

The final whack part is that you can declare non-POT integer sizes. There are shaders in the wild which declare 11-bit integers and rely on wrapping on these values to work! (dear lord … <_<) I even tried to compile this to x86_64 and yes, it does actually deal with it correctly. I’m kind of amazed, and scared at the same time.

Overall though, type declaration in LLVM IR is pretty easy to understand if you understand SPIR-V.

Decoding constants

Similar as types, constants are records within a block. They can appear at function scope or global scope.

bool ModuleParseContext::parse_constants_record(
    const BlockOrRecord &entry)
{
    llvm::Constant *value = nullptr;
    switch (ConstantsRecord(entry.id))
    {
        case ConstantsRecord::SETTYPE:
        case ConstantsRecord::CONST_NULL:
        case ConstantsRecord::UNDEF:
        case ConstantsRecord::INTEGER:
        // ...
    }
    values.push_back(value);
}

Roughly speaking, this looks very similar, with some quirks. SETTYPE informs subsequent constant blocks which type is actually used. CONST_NULL is example of a fully context sensitive constant, similar to OpConstantNull in SPIR-V.

INTEGERs are converted through sign rotations. Since small negative numbers would be horribly inefficient to encode with VBR otherwise, the first bit is the sign bit, encoded in a sign magnitude scheme. -0 is interpreted as INT64_MIN.

Where LLVM constants get disgusting however, is the pseudo-specialization constant operation support. It is possible to encode a constant cast operation, or constant access chain into a global object (wtf?) this way. I don’t understand the motivation behind this, but there are lots of super weird edge cases here that took some time to iron out.

AGGREGATE is the first time we start to see how value IDs are referenced.

Vector<Value *> constants;
constants.reserve(entry.ops.size());

for (auto &op : entry.ops)
{
   constants.push_back(get_value(op, element_type,
      /* force absolute IDs */ true));
}

Value *value;
// Ah, yes. Why have VECTOR and ARRAY types when you can
// have a context sensitive one instead.
if (current_constant_type_is_vector)
{
   value = context->construct<ConstantDataVector>(
      get_constant_type(), std::move(constants));
}
else
{
   value = context->construct<ConstantDataArray>(
      get_constant_type(), std::move(constants));
}

get_value() is quite sneaky. In LLVM IR, it is valid to forward reference an ID, as long as the type is known. This leads to ProxyValue objects being created, which are resolved later. get_value() can be relatively indexed, or absolutely indexed depending on the context, which is always fun.

Global variables

Global variables are declared in top-level records. Typically these are only used for groupshared variables. In DXIL, a special pointer address space is reserved for this purpose. Global variables can also be used for global look-up tables. Global variables can also have optional initializers. This is very similar to SPIR-V overall. The equivalent is an OpVariable with either Workgroup or Private storage class.

Resource handles are declared in a completely different way … unless we’re DXR (more on that later, sigh v_v …)

Function prototypes

We also get to declare function prototypes at this stage. Some functions only have prototypes, and the common case here is various prototypes which declare dx.op intrinsics functions. If a prototype is declared to also have a body, we place that in a queue.

We also have to parse parameter attribute lists (surprisingly tricky!), just in case the function declares LLVM attributes. The only case we have to care about here is FP32 denorm handling. Why that isn’t a metadata entry, I’ll never know. DXIL really likes splitting its implementation across two completely different systems for no good reason …

Parsing functions

A function body is a block, consisting of records (which express normal opcodes), and other blocks (e.g. constant blocks). The first record we’ll typically see is the DECLAREBLOCKS one, which specifies the number of basic blocks in the function.

Basic blocks

A basic block is a fundamental building block of SSA-based IRs. A basic block enters execution at the first instruction, and executes in a straight line fashion until a terminator instruction executes. A terminator instruction can be anything which transfers control like a direct branch, conditional branch, switch statement, returns, etc. If you know SPIR-V, this is nothing new. It’s the exact same concept.

Unlike SPIR-V, where we have explicit OpLabel opcodes to begin a new block, LLVM makes this implicit. When we observe a terminator, the next instruction will be added to the next basic block.

Context sensitive parsing

The opcodes in the IR match closely to the type hierarchy of LLVM, let’s look at parsing llvm::BinaryOperation. A binary operation is any c = op(a, b) kind of instruction, it’s not necessarily just and/or/xor, etc. This is a catch all for FAdd, IMul, IAdd, Xor, etc.

case FunctionRecord::INST_BINOP:
{
   unsigned index = 0;
   auto lhs = get_value_and_type(entry.ops, index);
   if (!lhs.first)
      return false;
   auto *rhs = get_value(entry.ops, index, lhs.second);
   if (!lhs.first || !rhs)
      return false;
   if (index == entry.ops.size())
      return false;
   auto op = BinOp(entry.ops[index++]);
   auto *value = context->construct<BinaryOperator>(lhs.first, rhs, translate_binop(op, lhs.second));
   if (index < entry.ops.size())
   {
      // Only relevant for FP math,
      // but we only look at fast math state for
      // FP operations anyways.
      auto fast_math_flags = entry.ops[index];
      bool fast = (fast_math_flags &
          (FAST_MATH_UNSAFE_ALGEBRA_BIT |
           FAST_MATH_ALLOW_CONTRACT_BIT)) != 0;
      value->set_fast_math(fast);
   }
   if (!add_instruction(value))
      return false;
   break;
}

In SPIR-V a binary operation would be encoded as:

%id = OpMyBinOp %type %operand_a %operand_b

Very explicit and understandable. LLVM IR on the other hand is more clever, for better or worse.

First, the result %id is implicit, and is allocated linearly as new opcodes come in. The type of an instruction is context sensitive. First, we parse %operand_a. If we have seen this ID already, %type is deduced directly from the operand. If it is a forward reference, the type of %operand_a is encoded in the record explicitly.

IDs in most opcodes are encoded with a relative scheme. Since SSA requires that declarations of an ID dominates all uses of it, the common case is that uses of an ID come after the declaration of it, so this is a decent compression scheme. The implementation of get_value_and_type() is something like:

std::pair<Value *, Type *> ModuleParseContext::get_value_and_type(
   const Vector<uint64_t> &ops, unsigned &index)
{
   if (index >= ops.size())
      return {};

   uint64_t op = ops[index++];
   // Context sensitive, for backwards compat mostly, but
   // modules can choose to use absolute or relative encoding.
   if (use_relative_id)
      op = uint32_t(values.size() - op);

   if (op < values.size())
   {
      // Normal reference.
      return { values[op], values[op]->getType() };
   }
   else
   {
      // Forward reference, the type is encoded in the next element.
      if (index >= ops.size())
         return {};

      auto *type = get_type(ops[index++]);
      auto *proxy = context->construct<ValueProxy>(type, *this, op);
      pending_forward_references.push_back(proxy);
      return { proxy, type };
   }
}

I had tons of bugs where I didn’t handle the possible forward references. Very awkward. I was of the impression that only PHI instructions could possibly have forward references, but of course, it’s never that simple.

Speaking of PHI, get_value() changes here to a signed-aware variant, where the relative value ID is encoded with sign_rotation, just like for INTEGER constants. This is because we expect that PHI inputs are forward referenced just as often as backwards referenced.

Overall, it’s just a grind to implement all relevant opcodes. DXIL only uses a subset, but it’s not well documented which subset of LLVM IR is actually used. We just have to implement new stuff as it comes in. DXIL.rst has a list of which LLVM instructions are supported, but this list cannot be trusted because in DXR, DXC emits various vector instructions (so much for being a scalar IR format) and the unreachable terminator which is missing from the table.

Metadata

Metadata lives in its own block hierarchy and has a completely different set of types, llvm::MDNode, llvm::MDOperand, llvm::ConstantAsMetadata, etc.

At the top of the hierarchy we can declare NamedMDNodes, which we see in the LLVM assembly as:

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.resources = !{!4}
!dx.entryPoints = !{!7}

NamedNodes contain a list of MDNodes, which nest into more MDNodes, or terminate in constant values. These correspond to NAMED_NODE, NODE and VALUE record types, not too many surprises here.

Emitting SPIR-V opcodes

After we get through the parsing step, LLVM IR and SPIR-V has many similarities, and translating opcodes isn’t particularly difficult. For each LLVM basic block, we emit a basic block in SPIR-V and translate the opcodes one by one. We preserve the SSA nature as-is. There are of course a lot of details here, but they’re not very interesting, and too many details to enumerate. The two biggest problems we need to focus on are:

Resource access

Accessing textures, constant buffers, structured buffers and the weird and wonderful zoo of resource types is insanely intricate, and I’ll need to dedicate an entire blog post to this.

Control flow structurization

Another massive issue is the control flow. In LLVM, there is no structurization information whatsoever, and we’ll have to reconstruct this somehow. This is at least one more blog post. After emitting SPIR-V code into basic blocks, we need to rewrite the control flow and annotate the basic blocks with merge information, which then allows us to emit a final SPIR-V module, ready for driver consumption.

Conclusion

This was a rough overview of LLVM IR and how it is parsed from scratch. I think it’s safe to say it’s far more difficult to parse than SPIR-V, which is literally just a stream of [N x uint32_t] opcodes. Parsing IR is not the most exciting part of DXIL to SPIR-V conversion, but it had to be done. On the other hand, it might be useful starting knowledge for other projects.

For next post, we’ll look at how to translate the D3D12 binding model into Vulkan.

My personal hell of translating DXIL to SPIR-V – part 1

In vkd3d-proton, we’re translating D3D12 to Vulkan, and the single biggest piece I’ve contributed so far is translating shader model 6 to Vulkan. We already had DXBC support, but what is a graphics API without two completely incompatible IR formats, right? Getting working DXIL support in vkd3d-proton was the biggest target when I began working on vkd3d at the time. The result of this work is a standalone library and tool, dxil-spirv.

Introduction

In this blog series I’d like to go through the DXIL format, explain the problems I’ve had to solve and hopefully serve as an introduction to basic compiler theory, explained by yours truly who has no idea what they’re talking about. What could go wrong!?

I never learned any of this formally, but through working on SPIRV-Cross and dxil-spirv, I just had to learn this by trying and failing. Most of this theory is locked behind academic and computer science jargon, because it is a domain where rigor is actually necessary, and the algorithms for most things have been well known since the 70s (60s?). This isn’t graphics programming where we can be “close enough” and hand wave problems away. The smallest inaccuracy will break anything you come up with. Mix this with recursive tree traversal algorithms and all hell breaks loose when you least expect it. I wonder where compiler engineers pick this stuff up. Is there a secret club I’m not invited in? :p

DXIL has been, and continues to be extremely painful to translate correctly, with new edge cases popping up every week it seems. The main reasons for these are related to one particular problem, which this blog series intends to address over the course of the blog series.

Rewrite goto soup to strict structured control flow

Of the three core problems, unstructured control flow, aka goto soup, is the real difficult part, and solving this problem is one of the single most painful problems I’ve ever worked on. When I think I’ve solved it, the next AAA game we see finds a new way to break it. At the very least, it seems like it gets a little easier with every iteration, so we’re getting asymptotically correct at the very least!

It’s very likely that the existing solution in dxil-spirv is just garbage, and it needs to be rewritten from scratch at some point, but at least the current implementation can play some vidya, so, eh.

LLVM 3.7 IR

LLVM is another problem, since we need to consume a barely documented IR format which was never designed to be consumed as a standard format to begin with. The official DXC compiler is literally a fork of LLVM, forever frozen at the 3.7 version.

It is of course possible (and intended) to use the LLVM library directly, parse the IR that way, and then iterate over the llvm::Module to create SPIR-V, but this requires shipping yet another vendored LLVM copy, and we all love those don’t we … It is just not acceptable to ship that from a practical point of view. It clocks in at a nice 40 MB+ blob, compared to the 2 MB d3d12.dll binary vkd3d-proton compiles to.

The first iteration of dxil-spirv did indeed target LLVM APIs directly for practical reasons. Not having to spend months figuring the arcane byte code format certainly helped! Later, I had to make a LLVM API replacement – with some help from RenderDoc’s low-level parser to get started – which does exactly what is required to parse DXIL. Overall, it has been very helpful to keep a native LLVM port alive through API replication, since I can always cross check the parsing results against the real deal with -DDXIL_SPIRV_NATIVE_LLVM=ON. It worked out very well in the end!

Virtualized resource hell

HLSL, and by extension DXIL does not concern itself one bit how resources are actually accessed by the implementation. Given similar code, there’s about 8 ways to codegen it depending on context obtained by the global or local root signatures. Fun! On top of this, we need vendor specific hackery for perf and feature workarounds.

In this post

As this post is just an introduction to the format, no difficult problems are presented yet. 🙂

Anatomy of a shader blob

In our journey we begin at the output from DXC, a raw shader binary.

// $ dxc -Tps_6_0 -Fo test.dxil test.frag
float4 main(float4 a : A) : SV_Target
{
    return a;
}

When creating a PSO in D3D12, you hand the API this blob of data. This blob is basically the same as DXBC. The only real difference is the DXIL tag. Based on this, we know how to dispatch compilation in vkd3d-shader.

00000000 44 58 42 43 00 00 00 00 00 00 00 00 00 00 00 00 |DXBC............|
.......
00000140 b0 04 00 00 60 00 00 00 2c 01 00 00 44 58 49 4c |....`...,...DXIL|
....

For development, it is then useful to extract the raw LLVM IR part. We can do this with dxil-extract from dxil-spirv:

$ dxil-extract --output test.bc test.dxil # Extract raw LLVM bytecode
$ llvm-dis test.bc # Disassemble to LLVM assembly
$ cat test.ll
; ModuleID = 'test.bc'
source_filename = "test.bc"
target datalayout = "e-m:e-p:32:32-i1:32-i8:32-i16:32-i32:32-i64:64-f16:32-f32:32-f64:64-n8:16:32:64"
target triple = "dxil-ms-dx"

define void @main() {
%1 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
%2 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 1, i32 undef)
%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 2, i32 undef)
%4 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 3, i32 undef)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %1)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 1, float %2)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 2, float %3)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 3, float %4)
ret void
}

; Function Attrs: nounwind readnone
declare float @dx.op.loadInput.f32(i32, i32, i32, i8, i32) #0

; Function Attrs: nounwind
declare void @dx.op.storeOutput.f32(i32, i32, i32, i8, float) #1

attributes #0 = { nounwind readnone }
attributes #1 = { nounwind }

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.viewIdState = !{!4}
!dx.entryPoints = !{!5}

!0 = !{!"clang version 3.7 (tags/RELEASE_370/final)"}
!1 = !{i32 1, i32 0}
!2 = !{i32 1, i32 6}
!3 = !{!"ps", i32 6, i32 0}
!4 = !{[6 x i32] [i32 4, i32 4, i32 1, i32 2, i32 4, i32 8]}
!5 = !{void ()* @main, !"main", !6, null, null}
!6 = !{!7, !11, null}
!7 = !{!8}
!8 = !{i32 0, !"A", i8 9, i8 0, !9, i8 2, i32 1, i8 4, i32 0, i8 0, !10}
!9 = !{i32 0}
!10 = !{i32 3, i32 15}
!11 = !{!12}
!12 = !{i32 0, !"SV_Target", i8 9, i8 16, !9, i8 0, i32 1, i8 4, i32 0, i8 0, !10}

Convenient! This is the same stuff we would see in dxc if we don’t write to a file. The next question is what all of this means. The first entry point into understanding what is going on is DXIL.rst. Unfortunately, this documentation is only good for the bring-up phase of DXIL compilation. The documentation just … stops eventually. At that point we’re on our own, but it’s not like it’s that hard to figure out what is going on. We have the source after all and we can correlate test HLSL with the output.

What is DXIL, really?

At a fundamental level, DXIL is just LLVM 3.7 with some sugar on top. The sugar adds whatever magic that normal LLVM code cannot express, but LLVM has ways to express things like these in a generic way. The solutions feel very clunky at times, but it’s clearly a compromise to attempt to shoehorn a shading language into something very generic like LLVM.

Intrinsics

In this simple case, we’re starting to see code like:

declare float @dx.op.loadInput.f32(i32, i32, i32, i8, i32) #0
declare void @dx.op.storeOutput.f32(i32, i32, i32, i8, float) #1
%1 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %1)

If we cannot express features in raw LLVM, we use dx.op intrinsics, declared as external linkage. The first argument is a constant int, which encodes the actual opcode to use. The loadInput or storeOutput names are not significant here. There are over 200 opcodes like this, which is not surprising given how much junk we’ve accumulated in shading languages over the years. Some of the opcodes are documented, some are not. *shrug* 🙂

Scalar code

In SPIR-V, there is explicit support for vectors and matrices, but DXIL is flat and scalar (except when it isn’t, of course). With modern GPU architectures, 32-bit arithmetic is scalar anyways, but 16-bit and 8-bit arithmetic ops are packed operations on all modern GPUs I know of, so it feels a bit weird to go full scalar. I guess backend compilers just need to suck it up and learn the dark art of re-vectorization. We also end up with quite bloated code, since any vector operation has to be unrolled. On the upside, there are fewer cases to implement and test in dxil-spirv. As an example, we have a shader:

Texture2D<float4> T;
SamplerState S;

float4 main(float2 uv : TEXCOORD) : SV_Target
{
    return T.Sample(S, uv);
}

In places where we cannot ignore vectors, DXIL sometimes reaches for structs instead of the far more natural vector type.

// The resource API is quite peculiar, for later ...
%1 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false)
%2 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 3, i32 0, i32 0, i1 false)

// Load UV, one component at a time
%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
%4 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 1, i32 undef)

// Sample
%5 = call %dx.types.ResRet.f32 @dx.op.sample.f32(i32 60, %dx.types.Handle %1, %dx.types.Handle %2, float %3, float %4, float undef, float undef, i32 0, i32 0, i32 undef, float undef)

// Extract components one by one ...
%6 = extractvalue %dx.types.ResRet.f32 %5, 0
%7 = extractvalue %dx.types.ResRet.f32 %5, 1
%8 = extractvalue %dx.types.ResRet.f32 %5, 2
%9 = extractvalue %dx.types.ResRet.f32 %5, 3
// There is actually a 5th member! If that member is statically used
// the sample opcode is actually a sparse residency query.
// Isn't this fun? Welcome to my personal hell :3

// Store result one by one ...
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %6)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 1, float %7)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 2, float %8)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 3, float %9)

With a simple spot check, we get binary sizes:

  • DXC (-Tps_6_0 test.frag -Qstrip_reflect -Qstrip_debug): 2063 bytes
  • FXC (-Tps_5_0 test.frag -Qstrip_reflect -Qstrip_debug): 268 bytes
  • DXC SPIR-V (-Tps_6_0 test.frag -Qstrip_reflect -Qstrip_debug -spirv): 744 bytes

Fun … DXBC is very compact, at least it has that going for it.

Metadata

In SPIR-V, we have execution modes, decoration and various instructions to encode metadata. These instructions are instructions like any other opcode, but LLVM has a separate system that lives on the side, the metadata nodes. In the IR assembly, we see a spider web of weird data structures.

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.resources = !{!4}
!dx.viewIdState = !{!10}
!dx.entryPoints = !{!11}

!0 = !{!"clang version 3.7 (tags/RELEASE_370/final)"}
!1 = !{i32 1, i32 0}
!2 = !{i32 1, i32 6}
!3 = !{!"ps", i32 6, i32 0}
!4 = !{!5, null, null, !8}
!5 = !{!6}
!6 = !{i32 0, %"class.Texture2D<vector<float, 4> >"* undef, !"", i32 2, i32 1, i32 1, i32 2, i32 0, !7}
!7 = !{i32 0, i32 9}
!8 = !{!9}
!9 = !{i32 0, %struct.SamplerState* undef, !"", i32 4, i32 3, i32 1, i32 0, null}
!10 = !{[4 x i32] [i32 2, i32 4, i32 15, i32 15]}
!11 = !{void ()* @main, !"main", !12, !4, null}
!12 = !{!13, !17, null}
!13 = !{!14}
!14 = !{i32 0, !"TEXCOORD", i8 9, i8 0, !15, i8 2, i32 1, i8 2, i32 0, i8 0, !16}
!15 = !{i32 0}
!16 = !{i32 3, i32 3}
!17 = !{!18}
!18 = !{i32 0, !"SV_Target", i8 9, i8 16, !15, i8 0, i32 1, i8 4, i32 0, i8 0, !19}
!19 = !{i32 3, i32 15}

What does this even mean? These nodes encode the entry point(s), which resources exist, the stage IO variables, and everything like that. It is at least documented, just awkward to read.

!dx.entryPoints = !{!11}
!11 = !{
   void ()* @main, /* Pointer to function */
   !"main", /* Name of entry point. Not relevant before DXR. */
   !12 /* List of stage IO types */,
   !4 /* Resource lists */,
   null}

!12 = !{
   !13, /* Stage input */
   !17, /* Stage output */
   null /* Patch IO for tessellation */}

/* Encode things like semantic, register offsets, component offset, etc ... */
!14 = !{i32 0, !"TEXCOORD", i8 9, i8 0, !15, i8 2, i32 1, i8 2, i32 0, i8 0, !16}

!18 = !{i32 0, !"SV_Target", i8 9, i8 16, !15, i8 0, i32 1, i8 4, i32 0, i8 0, !19}

!4 = !{
   !5, /* SRVs */
   null, null, /* UAV, CBV */
   !8 /* Samplers */}

/* Encode register space, register #t, component types, etc, etc ... */
!6 = !{i32 0, %"class.Texture2D<vector<float, 4> >"* undef, !"", i32 2, i32 1, i32 1, i32 2, i32 0, !7}

The instructions which interact with resources work very different than SPIR-V. In SPIR-V, resources are represented as plain variables, but it is a bit more indirect in DXIL. For example:

%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)

Here we specify:

  • Refer to stage input by index (directly refer to metadata)
  • Refer to row (if the stage input is an array)
  • Refer to component (must be a compile time constant, handy!)

I kinda like this approach actually. It gives us a specific place to deal with stage IO translation, so we don’t have to analyze and track random memory instructions in LLVM, which would be kinda horrible. Especially builtin semantics can get pretty weird in translation and being able to handle all addressing logic in one go is a life saver. I think this is also done so that we avoid the problem of having to deal with vectors. (But again, as we shall see later, someone missed the memo when implementing DXR.)

Resources work similarly:

%1 = call %dx.types.Handle @dx.op.createHandle(i32 57,
   i8 0, /* SRV */
   i32 0, /* Index into metadata */
   i32 0, /* Array offset */
   i1 false) /* NonUniformResourceIndex? */

Again, we request a handle where we refer directly to the metadata. In this case, we request SRV with metadata index #0. This isn’t bad. (But again, as we shall see, there is a completely separate system for accessing resources in DXR … ._. Why …)

As a convenient feature, dxil-spirv can emit SPIR-V -> GLSL through SPIRV-Cross’s API, so we can see the resulting codegen, which in this case closely resembles a 1:1 instruction conversion.

#version 460

layout(set = 1, binding = 2) uniform texture2D _8;
layout(set = 3, binding = 4) uniform sampler _11;

layout(location = 0) in vec2 TEXCOORD;
layout(location = 0) out vec4 SV_Target;

void main()
{
  vec4 _31 = texture(sampler2D(_8, _11), vec2(TEXCOORD.x, TEXCOORD.y));
  // le sigh <_<
  SV_Target.x = _31.x;
  SV_Target.y = _31.y;
  SV_Target.z = _31.z;
  SV_Target.w = _31.w;
}

I’ve certainly seen much worse (DXBC cross compilation comes to mind … *shudder*), but this is not the true codegen that vkd3d-proton would use, since we need to transform this into fully bindless code. That’s for later.

Conclusion

This is a very basic view of how the format is put together. Next time, I’ll probably have a deeper look into the raw LLVM IR and how that is parsed. Better bring bleach for your eyes!

Compressed GPU texture formats – a review and compute shader decoders – part 3/3

This is a long overdue blog post. Last year, I began a blog series where I explored the weird and wonderful world of texture compression formats. Unfortunately, I didn’t have time and motivation to finish it at the time, but it’s time I get back to finish these posts.

Up until now, I’ve covered the major families: DXT (BC 1-5), ETC2 and BPTC (BC 6-7) in part 1 and part 2. Complexity has been steadily increasing, but now we’re finally going to tackle the end boss of texture compression, ASTC.

ASTC

ASTC, or Adaptive Scalable Texture Compression is the result of a fever dream which attempts to answer the question of “how complex can we make a texture compression format and still trick IHVs to make it into silicon”. Complexity in codec design generally translates to better quality / bit at the cost of more expensive decode/encode. As hardware designs evolve, we can afford more complex codecs. Finding a sweet spot is incredibly hard, and ASTC cranked it to 11, for better or worse.

It began life on the Mali Midgard GPU series way back in the day, but has since been widely adopted by the mobile GPU ecosystem by all relevant vendors in the Khronos sphere, at least the core LDR profile.

Profiles you say, in my texture formats? What is this, H.264? Well … we’ll get to that 🙂

Notably, it’s also supported on the Nintendo Switch (Tegra supports Android after all), causing all sorts of fun issues for emulators … There’s actually multiple independent compute shader decoders of ASTC out there in the wild!

ASTC (released in 2012) is still considered the state of the art, and I don’t think anyone else have made a serious attempt at introducing a new hardware accelerated texture compression format since. The state of the art in ASTC compression is still being developed with recent improvements from Pete Harris on astc-encoder, the official reference encoder, and that is to my knowledge the only encoder implementation of ASTC HDR/3D profiles.

Desktop IHVs other than Intel have so far refused to expose support for ASTC, which is annoying, since it seems we’re forever stuck with two different worlds of texture compression, BC variant zoo on desktop and ASTC on mobile.

The compute shader decoder

Last year, around the time I wrote blogs on BPTC, I also implemented an ASTC LDR/HDR decoder that produced bit-exact results compared to the reference implementation and real hardware in my validation suite. Compared to the BC6 and 7 shaders, the shader complexity is ridiculous, and somehow it works. It’s integrated in Granite such that I can load ASTC compressed scenes transparently, which is useful for testing in a “because I can!” sense.

The shader is certainly not optimized for speed. I focused on readability and debuggability, since it was hard enough to implement this as-is. It’s more than fast enough for my purposes at least.

Enter the horror here. Here be dragons.

One format to bind them all

ASTC has a goal of better quality / bit than every existing format at the same time. This is not easy, considering that BC went the route of many different specialized formats designed to solve specific texture compression needs and you would expect each format to excel at their specific use case. As far as I know, ASTC succeeds here, given a sufficiently sophisticated encoder.

To supplant every format, you’d need to support these features at the very least:

  • R (BC4, EAC), RG (BC5, EAC), RGB (BC1, ETC2), RGBA (BC3, ETC2, BC7)
  • 4bpp (BC1, ETC2 RGB, BC4), 8bpp (BC3, BC5, ETC2 RGBA, BC6, BC7), and even lower (2bpp PVRTC?)
  • LDR vs HDR (BC6)
  • Decent multi-partition support
  • Decorrelated channel support

ASTC goes beyond these requirements in an attempt to future-proof the format. Bit-rate is fine grained. From sub 1bpp, up to 8bpp in fine steps. 2D is boring, so ASTC also supports 3D blocks, although adoption of this is non-existing outside Arm as far as I know. For SDF rendering, 3D block compression seems useful in theory at least.

Codec concepts

The ASTC specification in the Khronos Data Format document is 35 pages (!). There’s a lot of ground to cover when implementing this format, but fortunately, the codec features are mostly orthogonal, i.e. the complexity is additive, not exponential, making it somewhat manageable.

Bit-rate scalability

The ASTC format like every other format we’ve seen up until now is block based, where each block consumes a fixed number of bits. ASTC chose 128 bits, but bit-rate scalability is achieved through different block sizes. No longer can we rely on nice 4×4 blocks :(. Instead, we have ridiculous things like 8×6 blocks or 5×5 blocks, all the way up to 12×12 blocks. Very little actually changes depending on block size, we just get more weights to decode.

In practice from what I have seen, encoders tend to focus on specific block sizes like 4×4, 6×6 and 8×8. In an ideal world, we’d be able to change the block size dynamically within a texture (think H.264 macro blocks being split into smaller blocks adaptively), but we need random access to stay sane.

Good old color endpoint + weight architecture

At a fundamental level, ASTC does not change how texture compression works. There’s still the concept of encoding color endpoints on a block level, and then per-texel weights are applied to interpolate between them. However, as we’ll see, a ton of complexity goes into efficiently encoding these color entry points and weights.

If you’re familiar with video encoding, there are some fun parallels to draw here, since H.264 and beyond have just piled on more and more complexity on top of the basic motion compensated DCT blocks since the 80s. Texture compression is very similar. Keep piling codec features on top of the same fundamental architecture forever, what could go wrong!

Bits are boring, trits and quints is where it’s at!

Up until now, color endpoints and weights have been encoded with bits, but especially for weights, it’s very hard to achieve fine grained bit rate deltas. What if we could spend a fractional number of bits instead? ASTC achieves this through trinary and quintary (is this a word?) numbers, encoded into a binary representation.

Every encoded weight or endpoint is given N bits as LSBs, and optionally one trit or quint as the MSBs.

The binary encoding works by grouping values together in blocks. For trits, we’re aiming to encode 5 values together into a single number, which is 3^5 = 243 combinations. If we encode this into 8 bits, we’re pretty close to the theoretically optimal encoding.

This is where we start running into sprinkles of incomprehensible code snippets in the spec which is a mix of C and Verilog designed to be fast in HDL, but unreadable nonsense in software. I have no idea what this code is trying to do, so to decode, I just build a LUT which converts 8 bits into 5 trits and call it a day. Quints are similar, where we encode 3 values together, which is 5^3 = 125 combinations, and fits snugly into 7 bits.

As an example, here’s what I ended up with in GLSL:

// Trit-decoding.
// quant.x = number of bits per value
int block = idiv5_floor(index); // Classic cute trick: (v * 0x3334) >> 16
int offset = index - block * 5;
start_bit += block * (5 * quant.x + 8);

int t0_t1_offset = start_bit + (quant.x * 1 + 0);
int t2_t3_offset = start_bit + (quant.x * 2 + 2);
int t4_offset = start_bit + (quant.x * 3 + 4);
int t5_t6_offset = start_bit + (quant.x * 4 + 5);
int t7_offset = start_bit + (quant.x * 5 + 7);

// ;__;
int t = (extract_bits(payload, t0_t1_offset, 2) << 0) |
    (extract_bits(payload, t2_t3_offset, 2) << 2) |
    (extract_bits(payload, t4_offset, 1) << 4) |
    (extract_bits(payload, t5_t6_offset, 2) << 5) |
    (extract_bits(payload, t7_offset, 1) << 7);

// LUT magic
t = int(texelFetch(LUTTritQuintDecode, t).x);
t = (t >> (3 * offset)) & 7;

int m_offset = offset * quant.x;
m_offset += idiv5_ceil(offset * 8); // (1) Explanation below

if (quant.x != 0)
{
    int m = extract_bits(payload, m_offset + start_bit, quant.x);
    ret = (t << quant.x) | m;
}

… and similar garbage code for quints.

Note for (1): The reason the T value is scattered around is a special feature where the bit stream can be terminated early if the block is only partially filled. Every trit value requires less than 8/5 bits to encode, so after a value is emitted, we know that we must have emitted ceil(8 * count / 5) bits to encode the trits in binary.

Complex weight un-quantization

For purposes of orthogonality, it’s generally desirable that one part of the decoding process does not affect other parts of the decoding process. Weights is one such case. The color interpolator shouldn’t have to care how many bits the weights were encoded with, and thus we decode weights to a fixed range. In our case, weights have a range of [0, 64].

This is very similar to BPTC. In BPTC, the codec defines un-quantization LUTs for 2bpp, 3bpp and 4bpp like this:

const int weight_table2[4] = int[](0, 21, 43, 64);
const int weight_table3[8] = int[](0, 9, 18, 27, 37, 46, 55, 64);
const int weight_table4[16] = int[](0, 4, 9, 13, 17, 21, 26, 30, 34, 38, 43, 47, 51, 55, 60, 64);

Why not just bit-replication you ask? Well, dividing by non-POT values after weight scale is a PITA, that’s why, and we cannot bit-replicate trits and quints either way. ASTC un-quantizes in a more roundabout way, but uses the same idea from BPTC where weight range is normalized to [0, 64].

Again, the un-quantization step is gnarly enough that I just made a LUT for that as well, because why not:

int decode_weight(uvec4 payload, int weight_index, ivec4 quant)
{
    int primary_weight = decode_integer_sequence(payload, 0,
        weight_index, quant.xyz);
    // quant.w is offset into unquant LUT.
    primary_weight = int(texelFetch(LUTWeightUnquantize,
        primary_weight + quant.w).x);
    return primary_weight;
}

Color endpoints

Color endpoints are weird in that there are multiple phases to it:

  • Decode N values of integer sequence; which N values to use depends on which partition the texel belongs to
  • Un-quantize N values to 8 bits in range [0, 0xff]
  • Interpret N 8-bit values in some magical way depending on the endpoint format (of which there are 16!)
  • Emit multiple RGBA values in UNORM8 (LDR) or UNORM12 (HDR)
void decode_endpoint(out ivec4 ep0, out ivec4 ep1, out int decode_mode,
    uvec4 payload, int bit_offset, ivec4 quant, int ep_mode,
    int base_endpoint_index, int num_endpoint_bits)
{
    num_endpoint_bits += bit_offset;
    // Need to explicitly terminate bitstream for color end points.
    payload &= build_bitmask(num_endpoint_bits);

    // Could of course use an array,
    // but that doesn't lower nicely to indexed registers on all GPUs.
    int v0, v1, v2, v3, v4, v5, v6, v7;

    // End point mode is designed so that the top 2 bits encode
    // how many value pairs are required.
#define DECODE_EP(i) \
    int(texelFetch(LUTEndpointUnquantize, quant.w + \
        decode_integer_sequence(payload, bit_offset, i + \
        base_endpoint_index, quant.xyz)).x)

    int hi_bits = ep_mode >> 2;
    v0 = DECODE_EP(0); v1 = DECODE_EP(1);
    if (hi_bits >= 1) { v2 = DECODE_EP(2); v3 = DECODE_EP(3); }
    if (hi_bits >= 2) { v4 = DECODE_EP(4); v4 = DECODE_EP(5); }
    if (hi_bits >= 3) { v6 = DECODE_EP(6); v7 = DECODE_EP(7); }

    switch (ep_mode) { ... }
}

The end point modes are defined as:

Every endpoint has code snippets explaining how to take the n values and turn them into RGBA values, which are to be fed to the weight interpolator stage. As expected, we just need a horrible switch statement which handles every case. :<

Each partition can even select its own encoding format, which is pretty wild. Poor encoder that has to consider all these scenarios …

LDR endpoints

The endpoint decoding process starts off with trivial cases like luminance endpoints.

Mode 0 (direct)

e0 = (v0, v0, v0, 0xFF);
e1 = (v1, v1, v1, 0xFF);

Since we have an explicit decode step here, ASTC also allows us to take advantage of redundancy between the endpoint values, e.g.:

Mode 1 (base + offset)

L0 = (v0 >> 2) | (v1 & 0xC0);
L1 = L0 + (v1 & 0x3F);
if (L1 > 0xFF) { L1 = 0xFF; }
e0 = (L0, L0, L0, 0xFF);
e1 = (L1, L1, L1, 0xFF);

This encoding scheme improves precision when L0 and L1 are sufficiently close, i.e. correlated. BC7 don’t really attempt to take advantage of correlation between end points, outside the minimal shared endpoint / subset bits. BC6 does however, through use of delta bits.

One question that immediately pops into my head is how this is supposed to work in practice for trits and quint encoding. Since bits are reinterpreted and shuffled around like this, any un-quantization that’s not just bit replication would probably create unexpected results in the MSBs.

Luminance + Alpha (Mode 4/5)

ASTC is a little awkward in how 2-component textures are encoded. The common case for 2-component textures is normal maps, where we normally encode it as RG, ala BC5 or EAC. ASTC only supports luminance alpha, so in order to efficiently encode these formats, we have to pre-swizzle the texture into (R, R, R, G), and apply a (R, W, x, x) swizzle in the VkImageView.

Similar to luminance-only, there is a direct mode, and a correlated mode.

RGB

Other highlights from the LDR formats are base + scale, where two endpoints are encoded RGB and RGB * A. Seems very nice for luminance gradients, and quite compact! We also have direct modes, with a special feature which takes some ideas from YUV. Blue contraction improves color accuracy close to gray. Of course, there’s a base + offset mode that we already saw for luma and luma + alpha formats.

RGBA

RGBA encoding is very similar to RGB, with two tacked on alpha values, nothing particularly interesting here.

HDR endpoint insanity

While LDR encoding modes are pretty straight forward once you stare at it long enough, HDR is just incomprehensible. Where BC6 was fairly naive w.r.t. endpoints, ASTC is the complete opposite.

Similar to BC6, HDR is implemented in a way where we interpret a floating point format as UNORM. This means we’re interpolating in a pseudo-logarithmic domain, i.e. the perceptual domain.

While BC6 basically directly interpolates in U/SNORM16 with a direct conversion to FP16, ASTC is a little more … particular about it.

When decoding HDR end points, UNORM12 values are generated. It starts innocent enough in mode 2 where 8-bit inputs are just shifted in place:

int y0, y1;
if (v1 >= v0)
{
    y0 = v0 << 4;
    y1 = v1 << 4;
}
else
{
   // Oh hai thar BPTC shared bits +
   // BC1 symmetry exploitation
   y0 = (v1 << 4) + 8;
   y1 = (v0 << 4) - 8;
}

ep0 = ivec4(ivec3(y0), 0x780 /* 1.0 */);
ep1 = ivec4(ivec3(y1), 0x780 /* 1.0 */);

but eventually you just have to give up trying to reason about the spec when this is the wall of code that greets you.

This is the point where you yell WTF out loud, cry a little inside, contemplate your life decisions and copy paste the spec.

Partition hell

Like BPTC, ASTC supports partitions. As we’ve seen before, this feature is designed to deal with sharp transitions in the block. BPTC made it pretty simple where it was possible to spend 6 bits to select one of the 64 partition layouts. This works fine for BPTC since it’s locked to 4×4 blocks and using LUTs makes sense.

ASTC does not have fixed block sizes, so what to do? One LUT for every possible combination? No. ASTC went the route of procedurally generated partition assignments using a 10 bit seed. This works for any block size and partition count, which “solves” that problem. Again, the spec has a long, incomprehensible code snippet defining the process.

Screw this. As usual, look-up texture it is. ASTC can support up to 4 partitions which is pretty wild. No idea how useful this is in practice, as we’ll probably end up spending all bits just encoding color endpoints at that rate …

This seems like a nightmare for encoders. Most of the resulting partitions are worthless garbage, and exhaustively testing over 1000 partition combinations per block does not seem very practical …

Mode hell

Speaking of many combinations, there’s a lot of different modes. BC7 has only 8 modes, and BC6H has 14. From what I’ve seen, even BPTC encoders just focus on 2 or 3 modes at most. ASTC has several thousand modes if we follow the same logic! 😀

Most of the mode bits are spent on encoding things like:

  • Quantization mode of weight grid
  • Resolution of weight grid [see below]
  • Decorrelated colors
  • Void-extent [see below]
  • Reserved [see below]

Once we have looked at mode bits, we can compute how many bits are required to encode weights, and based on that result, we also know how many bits are required to encode color endpoints. The color endpoint quantization level is thus implicit (so much fun …). There are also tons of edge cases we have to account for and handle error cases …

Weight grid interpolation

A special ASTC feature is that the weight grid isn’t actually 1:1 with texels like any other format we’ve seen so far. The weight grid can be much lower resolution than the block itself, and the real weights are reconstructed with bi-linear interpolation. This seems quite useful for scenarios where the block encodes a smooth gradient and the weights are highly predictable. Might as well spend more bits on encoding color endpoints instead. The spec outlines a specific fixed-point scheme that must be followed exactly, and finally the code snippets in the spec actually make some sense!

This feature is a must-have for larger block sizes. There just isn’t room for a full weight grid when we start to hit big boy block sizes like 8×8. At this low bit-rate, the only tool we have to encode high frequency features is partitions …

Void extent

Encoding constant high-precision colors can actually be kind of awkward in most texture compression formats with limited endpoint precision, but ASTC has a special mode for it, where exact FP16 values can be encoded if desired. The feature even goes so far as to specify that a region outside the block also has the same color, which theoretically allows a texture filtering unit to short circuit. This doesn’t sound very practical, and I’m not sure if this feature is actually used.

Painful error handling

Another “feature” of ASTC is how errors are handled. Unlike other texture compression formats, there are many encodings which are illegal, and a correct decoder must detect all error scenarios and return the proper error color. This error color is either saturated magenta or NaN (ææææææææ!) depending on the implementation and whether or not HDR is supported.

Inconsistent decode formats

ASTC decoders will either decode to UNORM8 or FP16, depending on the profile supported by the GPU. This is kinda annoying, especially when we consider formats like SRGB8 where apparently the RGB portion is supposed to decode to UNORM8 viewed as SRGB, but alpha is still FP16? (._.) Not even the reference decoder seems to get that right, so I just decode alpha as UNORM8 in that case.

Interpolation and final encode

Finally, a breather. Once color endpoints and weights are fully decoded, we enter interpolation. It’s not all trivial though. That would be boring.

The first step is always to expand the UNORM8 or UNORM12 results into UNORM16. This is context sensitive on things like the endpoint format as well as how we’re encoding the value after interpolation. This final encode is the raw data that is fed into the texture filtering unit.

if (decode_mode == MODE_HDR)
{
    ep0 <<= 4; // Simple expansion 12 -> 16-bit
    ep1 <<= 4;
}
else if (decode_mode == MODE_HDR_LDR_ALPHA)
{
    ep0.rgb <<= 4;
    ep1.rgb <<= 4;
    ep0.a *= 0x101; // Bit replicate UNORM8 to UNORM16,
    ep1.a *= 0x101; // for FP16 conv later
}
else if (DECODE_8BIT)
{
    ep0 = (ep0 << 8) | ivec4(0x80); // Treat the data as 8.8 fixed point.
    ep1 = (ep1 << 8) | ivec4(0x80); // Also bake in 0.5 rounding factor.
}
else
{
    ep0 *= 0x101;
    ep1 *= 0x101;
}

// This is why weights have [0, 64] range and not [0, 63].
ivec4 color = (ep0 * (64 - weight) + ep1 * weight + 32) >> 6;

Now we have a complete UNORM16 color value that needs to be encoded.

Encode to UNORM8 / SRGB8

For 8-bit decode, we make use of the 8.8 fixed point expansion, and just shift the result down. Easy!

 imageStore(OutputImage, coord.xy, uvec4(final_color >> 8));

Encode LDR to FP16

This one is annoying. When interpolating, we did bit replication, and we need to convert the UNORM16 value to FP16. It’s not as simple as just dividing by 0xffff, of course. We explicitly need to treat 0xffff as 1.0, and for other values, we divide by 2^16 with round-to-zero semantics. We have to do this by hand of course. Soft-float is so much fun! <_<

// ASTC has a very peculiar way of converting the decoded result to FP16.
// 0xffff -> 1.0, and for everything else we get
// roundDownQuantizeFP16(vec4(c) / vec4(0x10000)).
ivec4 msb = findMSB(color);
ivec4 shamt = msb;
ivec4 m = ((color << 10) >> shamt) & 0x3ff;
ivec4 e = msb - 1;
uvec4 decoded = m | (e << 10);
uvec4 denorm_decode = color << 8;
decoded = mix(decoded, uvec4(denorm_decode), lessThan(e, ivec4(1)));
decoded = mix(decoded, uvec4(0x3c00), equal(color, ivec4(0xffff)));
return decoded;

Encode HDR to FP16

The intention is that the interpolation is logarithmic, but floats are just piecewise logarithmic. BC6 doesn’t care at all and just reinterprets the resulting UNORM16 interpolation as FP16, but ASTC tries to be a little smarter. The mantissa is tweaked a bit.

// Interpret the value as FP16, but with some extra fixups along the way to make the interpolation more
// logarithmic (apparently). From spec:
ivec4 e = color >> 11;
ivec4 m = color & 0x7ff;
ivec4 mt = 4 * m - 512;
mt = mix(mt, ivec4(3 * m), lessThan(m, ivec4(512)));
mt = mix(mt, ivec4(5 * m - 2048), greaterThanEqual(m, ivec4(1536)));

From what I can tell, there is no signed float support in ASTC, except for the void extent.

Decode format extensions

In Vulkan, there is a weird extension called VK_EXT_astc_decode_mode which allows a VkImageView to select which format an ASTC block should decode to. Why is this even a thing? One quirk of ASTC is that even UNORM blocks full of LDR data must be decoded to bit-exact FP16. This means that if a GPU architecture decodes compressed textures into L1 cache, we would end up consuming a lot more cache than strictly necessary. The whole point of the decode mode extension is to allow us to explicitly decode to a lower precision format to improve cache hit rates, presumably. (That makes me wonder what GPUs actually do in practice …)

In Granite, I setup the decode mode as:

// ...
if (format_is_srgb(create_info.format))
   return true;

if (format_is_compressed_hdr(create_info.format))
{
   auto &features = device->get_device_features();
   if (features.astc_decode_features.decodeModeSharedExponent)
      astc_info.decodeMode = VK_FORMAT_E5B9G9R9_UFLOAT_PACK32; // Nice
   else
      astc_info.decodeMode = VK_FORMAT_R16G16B16A16_SFLOAT;
}
else
{
   astc_info.decodeMode = VK_FORMAT_R8G8B8A8_UNORM;
}
// ...

Fortunately, there are now special SFLOAT format variants for ASTC in the VkFormat enum, which are tied to the ASTC HDR extension, and can be used to notify applications if the blocks likely contain HDR blocks. The awkward part of ASTC is that even when using the UNORM format, you can still decode HDR blocks, and it just happens to work on GPUs which support ASTC HDR. Profiles are fun, right? ._. With decode mode, we can at least clamp down on these shenanigans. HDR blocks fail to decode in UNORM8 mode.

Conclusion

So far, this is the endgame of texture compression, and I really don’t want to look at this again. 🙂 It’s been a year since I looked into this last time, so I might have missed some details.

There’s tons of smart ideas in this format, but I also feel it’s too clever for its own good. Writing a decoder was difficult enough, and I don’t even want to think about how painful an encoder would be. A format is only as good as its encoding ecosystem after all.

Compressed GPU texture formats – a review and compute shader decoders – part 2

This is the second part of the blog series I started in part 1. We have covered the S3TC, RGTC and ETC family of formats. This served as a good introduction to the topic of texture compression, but from here, the complexity will explode.

BPTC

This post will be dedicated to the BPTC compressed formats. These formats represent BC6 and BC7 formats in Vulkan, and is the state-of-the-art in texture compression on desktop GPUs, completing RGB and RGBA compression. BC1 and BC3 were the only proper desktop alternatives before these formats came along, and the quality of BC1 and BC3 aren’t … great.

BPTC also adds support for HDR. This is a big deal as before BPTC it was not possible to properly compress HDR data.

I pondered a bit over which format I should present first, but I think it’s appropriate to start with BC6 since it is much simpler than BC7 in terms of overall complexity. I feel like when BC6 was designed it sacrificed complexity from BC7 in order to add HDR.

Common ideas

Partitions

As we saw in ETC2, there was a very crude and early attempt to add partitions into the block format. The T and H formats in particular let you specify two different color endpoint pairs which could be selected at will using 1 bit per texel. Rather than letting the partition be dictated by the 2×4 / 4×2 sub-block, you could select any partition scheme.

Specifying an entire bit per texel to select a partition is quite overkill however, especially for a format which just gives you 64 bits for color. This makes the T and H modes crude hacks on top of ETC1. BPTC instead adds the concept of a pre-made table of partition shapes, and support for more than 2 partitions. Rather than specifying the partition index for each texel, we specify say, 64 common shapes, and just spend up to 6 bits per block (6 / 16 bits per texel) to encode the information. Of course, this idea falls flat if the exact partition we need doesn’t exist, but BPTC’s partitions seem well thought out. This is just a pain to implement since it means copy-pasting over a lot of tables ._.

Variable bit depth for endpoints and weights

Earlier formats were extremely rigid in how the blocks were to be encoded. You get X number of bits to specify endpoints and 2 bits per texel for the weights, a neat 32/32 bit split between the control block and texel data. Overall, this is a good setup for 64 bit block formats like BC1 and ETC, but it falls a bit short for 128 bit block formats like BPTC. Now we have more freedom and headroom to either spend lots of bits on endpoint accuracy, more weight bits per texel, or perhaps more partitions — which requires a lot more bits to encode multiple endpoints. Essentially, everything becomes a trade-off. More partitions can deal better with uncorrelated texels, more endpoint precision can deal with smooth gradients and more weight bits improves general PSNR when dynamic range in the block is large.

How all of this is configured is done through …

Mode hell

As we also saw with ETC2, there were multiple “modes” a block could enter into, depending on the encoded bits. We have the “differential”, “individual”, “T”, “H” and “Planar” modes. This already adds a lot of complexity to an encoder — which needs to figure out which mode is best through heuristics and trial and error — and decoder — which has to implement everything.

BPTC makes the mode idea more explicit and a certain number of bits are reserved to express the “mode” a block will be in.

Exploiting endpoint symmetry

While earlier formats used endpoint symmetry as a way to enable different modes (e.g. BC1), BPTC has already encoded the mode explicitly. Instead, we exploit symmetry to save one weight bit. We can simply assume that the MSB weight bit of the first texel is 0. We can always flip the order of the endpoints to make this work. Actually, we can save one bit per partition, since one texel per partition can be assumed to have MSB weight bit of 0. Which texels to treat specially is all done through another look-up table (sigh …) defined by the specification. Unfortunately, this means bitfield extraction from the 128-bit payload becomes very awkward and irregular … For this purpose I had to make a bitfield helper function. I doubt it’s very efficient, but you gotta do what you gotta do …

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bitextract.h

Normalized weight un-quantization

Up until now there hasn’t been any good way of interpolating between endpoints in any format. Interpolation in S3TC and RGTC involves a non-POT divider, which is awkward to make bit-exact, and ETC technically does not interpolate between endpoints, as there is an offset table (which is basically the same thing in practice, but it doesn’t concern itself with a divider). While ETC has bit-exact decoding, S3TC and RGTC do not, as the interpolation is defined in terms of floating point.

With the number of weight bits being variable in BPTC, it makes sense to normalize the weights to a range which is easier to work with, and can easily support bit-exact decoding. Values in the range of [0, 64] were chosen. BPTC can use either 2, 3 or 4-bit weights, and the specification defines a table which normalizes the weights onto the [0, 64] range. Interpolation can then be easily done in fixed point as:

interpolated = (a * (64 - weight) + b * weight + 32) >> 6;

This avoids the non-POT divider headache we’ve seen earlier. I tried finding a neat arithmetic expression to re-normalize the weights without a LUT, but I gave up pretty quickly.

No planar mode?

One thing I found rather puzzling was the lack of a planar mode in BPTC. ETC2 has it and seems kinda useful …

BC6 – 4×4 – 128 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bc6.comp

BC6 is laser focused on compressing HDR data (FP16). There is only RGB support, no RGBA at all.

Representing floating point without floating point

In the other formats we have looked at, we have used normalized integers as a way to represent endpoints and colors. Interpolation of integer endpoints is very easy since it’s just fixed point math. When we start introducing floating point into the mix, there is a question of how we represent this efficiently. As we have seen, we cannot spend too many bits on representing endpoints, even with 8-bit color. With FP16, representing endpoints in full 16-bits per channel is extremely wasteful. As with 8-bit color, we need to have a simple and efficient solution for quantization of FP16 values with an arbitrary amount of bits.

A hypothetical solution for FP16 could be achieved by storing the exponent (5 bits) and just quantize the mantissa accordingly. BC6 isn’t far from this idea, although, it is much simpler than that. BC6 exploits the fact that floating point representations of finite numbers is monotonic for positive numbers.

For example, if we consider that the internal representation of endpoints is 16-bit unsigned, we interpolate endpoints as 16-bit integers, and perform a scaling operation and bitcast to FP16:

fp16_bits = (interpolated_value * 31) / 64;
fp16_texel = bitcast<fp16>(fp16_bits);

What this effectively does is to map 0 to 0.0 in FP16 and 0xffff to 0x7bff, which is 65504.0 and the largest finite representable value in FP16 (0x7c00 is +Inf). Everything in-between is monotonically increasing. The interpolation itself ends up being non-linear (closer to logarithmic), but for HDR data this should be fine. Linear light values are perceptually logarithmic anyways, so it might actually make more sense to interpolate in a logarithmic space rather than linear here, a very nice hack indeed! Of course, this is just a crude approximation as the FP16 representation is only piece-wise logarithmic.

For the signed format (BC6H_SFLOAT), we do a very similar fix-up step, but it is a bit more involved since we need to take care of the sign bit. The interpolated value is assumed to be a signed integer in [-0x8000, 0x7fff] range.

signed = interpolated_value < 0;
fp16_bits = (abs(interpolated_value) * 31) / 32;
fp16_bits |= signed ? 0x8000 : 0;
fp16_texel = bitcast<fp16>(fp16_bits);

A recognized bug in the specification (or feature depending on the situation) is that -0x8000 will be translated to -Inf (0x7c00 | sign_bit). It is not possible to represent +Inf.

Transformed endpoints

A thing BC6 can do is to assume correlation between two endpoints. This is similar to the “differential” mode in ETC2, but BC6’s variant of it is far more flexible. Rather than giving us X number of bits per component, we encode the first endpoint with more bits, and then the other endpoint as an offset from the first, with fewer bits. This combines very well with multiple partitions, since the other partition’s endpoints are also encoded as differentials from the base endpoint.

Mode “hell”

BC6 has a lot of modes to choose from, 14 to be specific. However, that is only at first glance. I like to separate these into 2 major types:

  • 2 partition modes
  • 1 partition modes

After selecting how many partitions you have, most of the modes just let you specify how many bits are spent on the base endpoint color, and how many bits are spent for each transformed endpoint (delta bits). Essentially, modes 0, 1, 2, 6, 10, 14, 18, 22, 26 are all the same, with different bit-allocation schemes. Mode 30 appears to be slightly different on first glance since it does not have transformed endpoints, but that’s just because delta bits == endpoint bits at this point, so it’s meaningless to transform the endpoints. The story is the same for the modes where number of partitions is 1.

Based on the mode, we either get 2 bits per texel of weights or 3 bits.

The major issue in decoding BC6 is that the bit layout for each mode is completely nonsensical and irregular. The bits are packed in seemingly random places, so unfortunately the decoder ends up with:

if ((mode & 2) == 0)
{
    if ((mode & 1) != 0)
        interp = decode_bc6_mode1(payload, linear_pixel, part, anchor_pixel);
    else
        interp = decode_bc6_mode0(payload, linear_pixel, part, anchor_pixel);
}
else
{
    switch (mode)
    {
    case 2:
        interp = decode_bc6_mode2(payload, linear_pixel, part, anchor_pixel);
        break;
    case 3:
        interp = decode_bc6_mode3(payload, linear_pixel);
        break;
    case 6:
        interp = decode_bc6_mode6(payload, linear_pixel, part, anchor_pixel);
        break;
    case 7:
        interp = decode_bc6_mode7(payload, linear_pixel);
        break;
    case 10:
        interp = decode_bc6_mode10(payload, linear_pixel, part, anchor_pixel);
        break;
    case 11:
        interp = decode_bc6_mode11(payload, linear_pixel);
        break;
    case 14:
        interp = decode_bc6_mode14(payload, linear_pixel, part, anchor_pixel);
        break;
    case 15:
        interp = decode_bc6_mode15(payload, linear_pixel);
        break;
    case 18:
        interp = decode_bc6_mode18(payload, linear_pixel, part, anchor_pixel);
        break;
    case 22:
        interp = decode_bc6_mode22(payload, linear_pixel, part, anchor_pixel);
        break;
    case 26:
        interp = decode_bc6_mode26(payload, linear_pixel, part, anchor_pixel);
        break;
    case 30:
        interp = decode_bc6_mode30(payload, linear_pixel, part, anchor_pixel);
        break;
    default:
        interp = DecodedInterpolation(ivec3(0), ivec3(0), 0);
        break;
    }
}

Not pretty 🙁 Most of the code in bc6.comp is used to decode all the weird variants.

Endpoint quantization

Once we have the endpoints, we un-quantize them to full 16-bit integer range by a simple shift. At this point, we can interpolate the endpoints using the normalized weights, and apply the BC6 fixup to turn it into a final FP16 value.

ivec3 unquantize_endpoint(ivec3 ep, int bits)
{
    ivec3 unq;
    // Specialization constant
    if (SIGNED)
    {
        // Sign-extend
        ep = bitfieldExtract(ep, 0, bits);
        if (bits < 16)
        {
            ivec3 sgn = 1 - ((ep >> 30) & 2); // 1 or -1
            ivec3 abs_ep = abs(ep);
            unq = ((abs_ep << 15) + 0x4000) >> (bits - 1);
            // Special cases. Boolean mix FTW.
            unq = mix(unq, ivec3(0), equal(ep, ivec3(0)));
            unq = mix(unq, ivec3(0x7fff), greaterThanEqual(abs_ep, ivec3((1 << (bits - 1)) - 1)));
            unq *= sgn;
        }
        else
            unq = ep;
    }
    else
    {
        // Zero-extend
        ep = ivec3(bitfieldExtract(uvec3(ep), 0, bits));
        if (bits < 15)
        {
            unq = ((ep << 15) + 0x4000) >> (bits - 1);
            // Special-cases.
            unq = mix(unq, ivec3(0), equal(ep, ivec3(0)));
            unq = mix(unq, ivec3(0xffff), equal(ep, ivec3((1 << bits) - 1)));
        }
        else
            unq = ep;
    }
    return unq;
}

Summary

BC6 introduces a lot of new concepts that BC7 will make use of as well. The main complication is all the different modes and the awkward bit packing layouts. After extracting the quantized endpoints, the process from there is quite straight forward.

BC7 – 4×4 – 128 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/bc7.comp

BC7 is similar to BC6 in many ways, and uses many of the same ideas.

8-bit endpoints

Just like BC6 endpoints are un-quantized to 16 bits before interpolation, BC7 un-quantizes to 8-bit integers. This is done through the typical “bit-replication” algorithm that we often see when converting RGB444 and RGB565. The weight interpolation is done the exact same way as BC6 with a [0, 64] range of weights.

Support for 3 partitions

BC6 only supports 2 partitions, but BC7 can support 3. There are separate tables for 3 partition modes compared to 2 partition modes. (Fun …)

Flexible alpha channel

Unlike BC6, BC7 fully supports alpha, and finally we have a way to encode RGB and A together. This is a big improvement over the older formats like S3TC and ETC2 which encode alpha completely separately in different block formats. When we encode RGB and A together, we can allocate bits between them as we please, e.g.:

  • No alpha. If alpha is constant 1.0 inside a block, we can use all 128 bits for RGB.
  • Correlated RGB and A. If these components are correlated, we can encode endpoints as RGBA with one weight per texel. This saves a lot of weight bits.
  • Uncorrelated RGB and A. This is the case essentially assumed by BC2, BC3 and ETC2. However, BC7 will let us tune a little bit how many bits we spend on color and how many bits we spend on alpha.

Uncorrelated color channel support

Since we can encode RGB and A as uncorrelated endpoints, what if we could freely select which component is uncorrelated? BC7 can do this, and this is expressed through the rotation bits. After decoding, we can swap A out with any other component. That way we can technically encode (R, GB), (RB, G) or (GB, R) for example. This seems rather niche, but it’s there … I suspect the designers happened to have 2 extra bits lying around they could use for this purpose.

No transformed endpoints

In a somewhat curious design choice, there is no support for transformed (correlated) endpoints in BC7. I suspect that for this reason, the number of modes could be kept somewhat sensible. I also suspect the need for transformed endpoints isn’t as great since we’re not working with HDR values anymore.

Modes

There are 8 modes in BC7, which all seem to be carefully chosen based on their use cases. Each mode makes some choices like:

  • How many partitions? (1 – 3)
  • Is there alpha? (Mode [0, 3] don’t have, [4, 7] have).
  • How many bits per endpoint component? (Color and alpha have separate bit depth)
  • How many bits per index?
  • Is alpha uncorrelated or correlated? (Mode 4 and 5 have two weight indices)
  • Which component is uncorrelated? (For mode 4 and 5)

This is all tabulated by the specification, and from there the algorithms are similar for all the modes. Other than this, there might be some leftover bits, and I believe the BC7 designers just invented some use for them. In some modes, the bits can be used as a shared LSB of the endpoints for example, or it can be used to select how many weight bits color and alpha channels receive respectively in mode 4.

Summary

BPTC is a fairly advanced texture compression format. BC6 and BC7 share many ideas as expected and introduces a large, but not unmanageable configuration space. From an implementation point-of-view though, the large reliance on look-up tables is annoying, but understandable.

BPTC is focused on perfecting color texture encoding and leaves the other kind of encoding to RGTC, which I think does a great job with 1 and 2 channel textures already.

Up next?

In the third (probably not final) part we tackle the monster that is ASTC. Complexity will jump another 10x.

Compressed GPU texture formats – a review and compute shader decoders – part 1

Compressed texture formats is one of the esoteric aspects of graphics programming almost no one cares all that much about. Neither did I, however, I’ve recently taken an academic interest in the zoo of compressed texture formats.

During development in Granite, I occasionally find it useful to test scenes which target mobile on desktop and vice versa, and in Vulkan, where there are no fallback paths for unsupported compression formats, we gotta roll our own decompression.

While it really isn’t all that useful to write a decoder for these formats, my goal is to create a suite of reasonably understandable compute shader kernels which can decode all of the standard formats I care about. Of course, I could just use a Frankenstein decoder which merges together a lot of C reference decoders and call it a day, but that’s not aesthetically pleasing or interesting to me. By implementing these formats straight from the Khronos Data Format specification, I learned a lot of things I would not otherwise know about these formats.

There are several major families of formats we can consider multi-vendor and standardized. Each of them fill their own niche. Unfortunately, desktop and mobile each have their own timelines with different texture compression standards, which is not fully resolved to this day in GPU hardware. (Basis Universal is something I will need to study eventually as well as it aims to solve this problem in software.)

By implementing all these formats, I got to see the evolution of block compression formats, see the major differences and design decisions that went into each format.

The major format families

First, it is useful to summarize all the families of texture compression I’ve looked at.

S3TC / DXT

The simplest family of formats. These formats are also known as the “BC” formats in Vulkan, or rather, BC 1, 2 and 3. This is the granddad of texture compression, similar to how I view MPEG1 in the video compression world.

These formats are firmly rooted in desktop GPUs. They are basically non-existent on mobile GPUs, probably for historical patent reasons.

RGTC

A very close relative of S3TC. These formats are very simple formats which specialize in encoding 1 and 2 uncorrelated channels, perfect for normal maps, metallic-roughness maps, etc. It is somewhat questionable to call these a separate family of formats (the Data Format specification separates them), since the basic format is basically exactly equal to the alpha format of S3TC, except that it extends the format to also support SNORM (-1, 1 range) alongside UNORM. These formats represent BC4 and BC5 in Vulkan.

These formats are firmly rooted in desktop GPUs. They are basically non-existent on mobile GPUs.

ETC

The ETC family of formats is very similarly laid out to S3TC in how different texture types are supported, but the implementation detail is quite different (and ETC2 is quite the interesting format). To support encoding full depth alpha and 1/2-component textures, there is the EAC format, which mirrors the RGTC formats.

These formats are firmly rooted in mobile GPUs. ETC1 was originally the only mandated format for OpenGLES 2.0 implementations, and ETC2 was mandated for OpenGLES 3.0 GPUs. It has almost no support on desktop GPUs. Intel iGPU is an exception here.

BPTC

This is where complexity starts to explode and where things get interesting. BC6 and BC7 are designed to compress high quality color images at 8bpp. BC6 adds support for HDR, which is to this day, one of only two ways to compress HDR images.

On desktop, BPTC is the state of the art in texture compression and was introduced around 2010.

ASTC

ASTC is the final boss of texture compression, and is the current state of the art in texture compression. Its complexity is staggering and aims to dominate the world with 128 bits. Mere mortals are not supposed to understand this format.

ASTC’s roots are on Mali GPUs, but it was always a Khronos standard, and is widely supported now on mobile Vulkan implementation (and Intel iGPU :3), at least the LDR profile. What you say, profiles in a texture compression format? Yes … yes, this is just the beginning of the madness that is ASTC.

PVRTC?

PVRTC is a PowerVR-exclusive format that has had some staying power due to iOS and I will likely ignore it in this series. However, it seems like a very different kind of format to all the others and studying it might be interesting. However, there is zero reason to use this format in Granite, and I don’t want to chew over too much.

What is a texture compression format anyway?

In a texture compression format, the specification describes a process for taking random bits given to it, and how to decode the bit-soup into texels. There are fundamental constraints in texture compression which is unique to this problem domain, and these restrictions heavily influence the design of the formats in question.

Fixed block size

To be able to randomly access any texel in a texture, there must be an O(1) mapping from texture coordinate to memory address. The only reasonable way to do this is to have a fixed block size. In all formats, 4×4 is the most common one. (As you can guess, ASTC can do odd-ball block sizes like 6×5).

Similarly, for reasons of random access, the number of bits spent per block must be constant. The typical block sizes are 64-bits and 128-bits, which is 4bpp and 8bpp respectively at 4×4 block size.

Image and video compression has none of these restrictions. That is a major reason why image and video compression is so much more efficient.

A set of coding tools

Each format has certain things it can do. The more complex the operations the format can do, the more expensive the decoding hardware becomes (and complex a software decoder becomes), so there’s always a challenge to balance complexity with quality per bit when standardizing a format. The most typical way to add coding tools is to be able to select between different modes of operation based on the content of the block, where each mode is suited to certain patterns of input. Use the right tool for the job! As we will see in this study, the number of coding tools will increase exponentially, and it starts to become impossible to make good use of all the tools given to you by the format.

Encoding becomes an optimization task where we aim to figure out the best coding tools to use among the ones given to us. In simpler formats, there are very few things to try, and approaching the optimal solution becomes straight forward, but as we get into the more esoteric formats, the real challenge is to prune dead ends early, since brute forcing our way through a near-infinite configuration space is not practical (but maybe it is with GPU encode? :3)

Commonalities across formats

Image compression and video compression uses the Discrete Cosine Transform (DCT) even to this day. This fundamental compression technique has been with us since the 80s and refuses to die. All the new compression formats just keep piling on complexity on top of more complexity, but in the center of it all, we find the DCT, or some equivalent of it.

Very similarly, texture compression achieves its compression through interpolation between two color values. Somehow, the formats all let us specify two endpoints which are constant over the 4×4 block and interpolation weights, which vary per pixel. Most of the innovation in the formats all comes down to how complicated and esoteric we can make the process of generating the endpoints and weights.

The weight values are typically expressed with very few bits of precision per texel (usually 2 or 3), and this is the main way we will keep bits spent per pixel down. This snippet is the core coding tool in all the formats I have studied:

decoded_texel = mix(endpoint0, endpoint1, weight_between_0_and_1);

To correlate, or not to correlate?

The endpoint model blends all components in lock-step. Typically the endpoint will be an RGB value. We call this correlated, because this interpolation will only work well if chrominance remains fairly constant with luminance being the only component which varies significantly. In uncorrelated input, say, RGB with an alpha mask, many formats let us express decorrelated inputs with two sets of weights.

decoded_rgb = mix(endpoint0_rgb, endpoint1_rgb, rgb_weight);
decoded_alpha = mix(endpoint0_alpha, endpoint1_alpha, alpha_weight);

This costs a lot more bits to encode since alpha_weight is very different from rgb_weight, but it should be worth it.

Many formats let us express if there is correlation or not. Correlation should always be exploited.

Working around the horrible endpoint interpolation artifacts

Almost all formats beyond the most trivial ones try really hard to come up with ways to work around the fact that endpoint interpolation leads to horrible results in all but the simplest input. The most common approach here is to split the block into partitions, where each partition has its own endpoints.

S3TC – The basics

A compute shader decoder:

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/s3tc.comp

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.h

BC1 – 4×4 – 64 bits

The BC1 format is extremely simple and a good starting point. 32 bits is used to encode two RGB endpoints in RGB565 format. The other 32 bits encode 16 weights, with 2 bits allocated to each texel.

This lets us represent interpolation weights of 0, 1/3, 2/3 and 1.

Since there is a symmetry in this design, i.e.:

mix(a, b, l) == mix(b, a, 1.0 - l)

there would be two ways to specify the same block, where we swap endpoints and invert the weights to compensate. This is an extra bit of information we can exploit. Based on the integer representation of the two endpoints, we can check if one of greater than the other, and use a different decoding mode based on that information. This exploitation of symmetry will pop up again in many formats later! In the secondary mode, we add support for 1-bit alpha, called a punch-through in most formats. In this mode, the interpolation weights become 0, 1/2, 1 and BLACK. This lets us represent fully transparent pixels. However, color becomes BLACK, so this will only work with pre-multiplied alpha schemes, otherwise there will be black rings around textures. I don’t think this mode is used all that much these days, but it is an option.

That is it for this format, it really is that simple.

One thing to note is that the specification is defined in terms of floating point with under-specified requirements for precision, and thus there is no bit-exact representation of the decoded values. Almost all hardware decoders of this format will give slightly different results, which is unfortunate. MPEG1 and MPEG2 also made the same mistake back in the day, where the DCT is specified in terms of floating point.

BC2 – 4×4 – 128 bits

BC2 is a format which adds alpha support by splicing together two blocks. A BC1 block describes color, and a second block adds an alpha plane with 4-bit UNORM. This format is quite obscure since the next format, BC3, generally does a much better job at compressing alpha. A curious side effect of BC2 and 3 is that the punch-through mode in the BC1 block no longer exists, i.e. the symmetry exists, so we lose 1 bit of information. I wonder why that information bit was not used to toggle between BC2 alpha decode (noisy alpha) and BC3 alpha decode (smooth alpha) …

BC3 – 4×4 – 128 bits

The alpha encoding of BC3 is very similar to how BC1 works. It also forms the basis of RGTC. The 64 bits of the alpha block spends 16 bits to encode 2 8-bit endpoints for alpha. We now get 3 bits as interpolation weights instead of 2 since there’s 48 bits left for this purpose. Similar to BC1, we can exploit symmetry and get two different modes based on the endpoints. In the first mode, the interpolation weights are as expected, [0, 1, 2, …, 7] / 7. In the second mode, we have 6 interpolated values [0, 1, 2, …, 5] / 5, and two values are reserved to represent 0.0 and 1.0. This can be very useful. This is essentially a very early attempt to introduce partitions into the mix, as we can essentially split up a block into 3 partitions on demand: (Fully opaque texels, fully transparent texels, the in-betweens). This can let us specify a tighter range for the endpoints as there is never a need to use a full [0, 0xff] endpoint range.

Summary

The S3TC formats are very simple, but there are certainly things to note about them. Alpha support is just bolted on, it is not integrated into the format. This means that even though the block is 128 bits, there is no way to spend more than 64 bits on color, even if the alpha plane has a completely flat value.

RGTC

RGTC (red-green) is basically BC3’s alpha block format turned into its own thing. Their main use is with non-color textures, e.g. normal maps, metallic-roughness maps, luminance maps, etc. It is very simple, and quality is quite good.

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.comp

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.h

BC4 – 4×4 – 64 bits

This is BC3’s alpha block format as its own format, which returns one component. The only real difference from BC3 alpha is that it also supports an SNORM variant, which is very useful for normal maps, although I only bothered with UNORM, since my shaders need to assume input can be from any format.

BC5 – 4×4 – 128 bits

RGTC assumes uncorrelated channels, and thus the only sensible choice was to just slap together two BC4 blocks side-by-side, and voila, we can encode 2 channels instead of 1.

Summary

RGTC is simple and nice. It only needs to consider single channels of data, and writing encoders for it is very easy, and it is probably the simplest format out there. For what they do, I really like these formats.

Like S3TC, there is no bit-exact decoding, which is rather unfortunate.

ETC – Refining S3TC

ETC, or Ericsson Texture Compression is a family with multiple generations. ETC2 is backwards compatible with ETC1 in that all valid ETC1 blocks will decode the same way in ETC2, but ETC2 exploits some undefined behavior of ETC1 to extend the format into something more interesting.

ETC has a bit-exact decode, which makes verification very easy. 😀

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/etc2.comp

ETC1 – 4×4 – 64 bits

In many ways, ETC1 is quite similar to BC1, but there are some key differences. Just like BC1, 32 bits are spent to encode endpoints, and 32 bits are spent to give 16 texels 2 bits of weight information each. The main difference between ETC1 and BC1 is how endpoints work.

Sub-blocks

As a very crude form of partitioning, ETC1 allows you to split a block into either 2×4 or 4×2 sub-blocks, where each sub-block has its own endpoints. To do this, endpoints are expressed in a more compact way than BC1. Rather than specifying two RGB values, ETC in general likes to express endpoints as RGB +/- delta-intensity, where delta-intensity is described by a table. This makes things far more compact since we enforce constant chrominance. By saving so many bits, we can express 4 endpoints in total, 2 for each sub-block.

Uncorrelated or correlated endpoints?

Since we have to specify two sets of endpoints, the format gives us a way to specify if the two endpoints have completely different colors, or if the endpoints should be specified in base + offset form. This is controlled with a single bit, which changes the encoding from ep0 = RGB444, ep1 = RGB444 to ep0 = RGB555, ep1 = ep0 + sign_extend(RGB333). These values are not allowed to overflow in any way, which is something ETC2 exploits to great effect later.

NOTE: I found it more instructional to call it uncorrelated and correlated endpoint modes, but the specification calls it “individual” and “differential” modes.

Summary

ETC1 is somewhat different than BC1, but overall, it’s quite similar. It turns out, if you add a few restrictions on top of ETC1, you get ETC1S, which can trivially be transcoded to BC1. Basically, enforce the correlated endpoint mode, set the RGB333 bits to 0, enforce that delta-luma is the same for both sub-blocks.

There is also no way to express alpha with ETC1, which is unfortunate, but ETC1 is completely obsolete for my use cases either way.

ETC2 RGB – 4×4 – 64 bits

As mentioned earlier, ETC1 is a sub-set of ETC2. ETC2 adds a bunch of new and curious modes which gives us a small glimpse into more flexible ways to express endpoints, and even adds a mode which I have never seen in any other formats ever since.

Exploiting undefined behavior

When you select the correlated (differential) endpoint mode, there were some restrictions on overflow. We can exploit this fact in order to gain 3 new modes of operation for the ETC2 color codec!

First, we check if R + sign_extend(dR) is outside [0, 31] range. If so, we activate the so-called “T” mode. In this mode, we essentially add a partitioning scheme to the codec. We now remove the concept of two sub-blocks and let all texels access all available endpoints. We encode two RGB444 values (A, B), and a delta value (d). We form a T-shape by specifying 4 possible color values as A, B, B + d, B – d. This can be useful if the block is smooth, except for some weird outliers. A would be the outlier color, and B represents the middle of the smooth colors.

If G + sign_extend(dG) overflows, we enter a very similar “H” mode. In this mode, we do the exact same thing, except that the 4 possible colors become A + d, A – d, B + d, B – d.

If B + sign_extend(dB) overflows, we enter a very interesting mode, which I have never seen again in future formats. I’m not sure why, since it seems very useful for expressing smooth gradients. Essentially, in this mode we don’t encode weights per texel, but rather express RGB at texel (0, 0), at texel (4, 0) and texel(0, 4), and just bilinearly interpolate across the block to obtain the actual color. This is very different from the other endpoint interpolation we’ve seen earlier, because that flattens everything into a single line in the color space, but now we can access an entire 2D plane in the space instead.

Punch-through alpha

Like BC1’s 1-bit alpha scheme, ETC2 is very similar. When this format is enabled, we remove the capability for uncorrelated endpoints (individual mode), and replace the bit with a selection to select if all texels are Opaque, or potentially transparent. This idea is the exact same as BC1. In the transparent mode, code == 2 marks the texel as being transparent black. It does not work in planar mode though, this bit is ignored there.

Alpha support

Very similar to BC3, ETC2 also supports full 8-bit alpha by slapping together a separate block alongside the color block. The way this works is very similar to how RGTC works, but instead of two endpoints, ETC2 encodes a center point, and then uses tables to expand that range into 8 possible values using a table selector and a multiplier. These 8 possible values for the block are then selected with 3bpp indices. We lose the capability to cleanly represent 0.0 and 1.0 though, which is somewhat curious.

EAC – 4×4 – 64 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/eac.comp

EAC is ETC2’s version of RGTC, it is designed as a way to encode 1 and 2 de-correlated channels, basically the exact same approach as RGTC where the alpha block format is reused for the 1/2-channel formats. EAC is a bit different in that the internal precision is 11 bits (for some reason).

Unfortunately, EAC is kinda awkward since it’s technically bit-exact in the final fixed point value between [0, 2047], but it specifies many different ways how this can be converted to a floating point value.

Next up, BPTC

S3TC, RGTC and ETC represents the simpler formats, and hopefully I’ve summarized what these formats can do, next up, I’ll go through the BPTC formats, which significantly increases complexity.

Clustered shading evolution in Granite

Shading many lights in a 3D engine is kinda hard once you step outside the bubble of classic deferred shading. Granite supports a fair amount of different ways to do lighting, mostly because I like to experiment with different rendering structures.

I presented some work on this topic at SIGGRAPH 2018 in the Moving Mobile Graphics course when I still worked at Arm: https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-20-66/5127.siggraph_2D00_2018_2D00_mmg_2D00_3_2D00_rendering_2D00_hanskristian.pdf. Unfortunately, some slides have videos embedded, and they didn’t seem to have made the transition unscathed to PDF.

Since then I’ve looked into VK_EXT_descriptor_indexing in order to remove some critical limitations with my older implementation and I ended up with some uncommon implementation details.

Classic deferred

Just to summarize what I mean by this, this is the good old method of rendering light volumes on top of a G-buffer and light is accumulated per-light by blending on top of the frame buffer. This method is considered completely obsolete on desktop these days, but it’s still quite viable on mobile with on-chip G-buffers.

Where classic deferred breaks down

  • Very high bandwidth on desktop due to fill-rate/bandwidth
  • No forward shading support (well, duh)
  • No transparent objects
  • No volumetric lighting

Thus, I explored some alternatives …

Why I don’t like tile deferred shading

This is a somewhat overloaded term, but this is the method where you send the G-buffer to a compute shader. The G-buffer is split into tiles, assigned to a workgroup, depth range (or multiple ranges) is found for the tile, which then forms a frustum. Lights are then culled against this frustum, and shaded in one go. This technique was a really great fit for the PS3 SPUs, as shown by Battlefield 3 back in the day.

It still doesn’t really solve the underlying issues of classic deferred except that it is far more bandwidth efficient on desktop-class GPUs.

Forward shading isn’t feasible unless you split the algorithm into multiple stages with Z-prepass -> build light list per tile -> resubmit geometry and shade, but then it’s probably called something entirely different … (Is this what’s known as Forward+ perhaps?)

Transparency isn’t doable unless you render all transparent geometry in a separate pass to find min/max depth per pixel or something, and forget about volumetrics.

On mobile TBDR architectures you lose all bandwidth savings from staying on-tile, and doing FRAGMENT -> COMPUTE barriers on a tile-based architecture is usually a terrible idea for pipelining. The exception here is tile shaders on recent iPhone hardware which seems almost designed to do this algorithm in hardware.

Clustered shading – the old implementation

Clustered shading is really nice in that it is completely agnostic to a depth buffer, so all the problems mentioned earlier just go away. Lights are assigned in 3D-space rather than screen-space. The original paper on the subject is from around the same time tile deferred was getting popular.

Abandoning the frustum structure

In Granite, I chose early on to abandon the “frustum” layout. Culling spot lights and point lights against elongated frustums analytically is very hard and computationally expensive, and getting the Z slices correct is also very fiddly.

A common workaround for culling is using conservative rasterization, but that is a feature I cannot rely on. I figured that using a more grid-like structure I could get away with much simpler culling math. Since all elements in the grid-based cluster are near-perfect cubes, I could get away with treating the elements as spheres, and https://bartwronski.com/2017/04/13/cull-that-cone/ makes spot light-to-sphere culling very cheap and quite tight. Here’s a visualization of the structure. This structure is stored in a 3D texture. Each “mip-level” is packed in the Z dimension as the resolution of each level is the same.

Bitscan loops instead of light lists

Light lists is the approach where each element in the cluster contains a (start, count) and all lights found in that range are shaded. Computing this list on GPU is rather messy. The memory footprint for a single element is unknown in CPU timeline, and we cannot deal properly with worst-case scenarios. This is easier when we cluster on the CPU instead, but that’s boring!

I really wanted to cluster lights on the GPU, so I landed on a bitmask approach instead. The worst case storage is just 1 bit per light per element rather than 32 bit.

The main limitation of this technique is still the number of lights we could feasibly support. With a bitmask structure we need to allocate for worst-case and it can get out of hand when we consider worst case with 1000+ lights. I only had modest ideas in the beginning, so I supported 32 spot lights and 32 point lights, which were encoded in RG32UI per element in a 3D texture. At a resolution of the cluster at 64x32x16x9, culling on the GPU is very fast, even on mobile. We can set the ceiling higher of course if we expand to RGBA32UI or use more texels per element.

Bitscan loops are great for scalarization

A thing I realized quickly when doing clustered forward shading is the importance of keeping VGPRs down on AMD hardware. The trick to move VGPRs to more plentiful SGPRs is to ensure that values are uniform across a subgroup. E.g. instead of doing this:

// VGPR
int light_bitmask = fetch_bitmask_for_world_coord(coord);

vec3 color = vec3(0.0);
while (light_bitmask != 0)
{
    int lsb = findLSB(light_bitmask);

    // All light data must be loaded in VGRPs since lsb is a VGPR.
    color += shade_light(lsb);
    light_bitmask &= ~(1 << lsb);
}

we can do a simple trick with subgroup operations:

// VGPR
int light_bitmask = fetch_bitmask_for_world_coord(coord);

// OR over all active threads.
// As this is the same value for all threads, compiler promoted to SGPR.
light_bitmask = subgroupOr(light_bitmask);

vec3 color = vec3(0.0);
while (light_bitmask != 0)
{
    int lsb = findLSB(light_bitmask);

    // All light data can be loaded into SGPRs instead.
    // Far better occupancy, much amaze, wow!
    color += shade_light(lsb);
    light_bitmask &= ~(1 << lsb);
}

Uniformly loading light data from buffers is excellent. I’ve observed up to 15% uplift on AMD by doing this. The light list approach mentioned earlier has a much harder time employing this kind of optimization. We would have to scalarize on the cluster element itself, which could lead to very bad worst-case performance.

No bindless – ugly atlasing

Another problem with clustered shading (and tile deferred for that matter) is that we need to shade a lot of lights in one go, and those lights can have shadow maps. Without bindless, all shadow maps for spot lights must fit into one texture, and point lights must fit into one texture. Atlassing is the classic solution here, but it is a little too messy for my taste. As the number of lights was rather low, I just had a plain 2D texture for spot lights, and a cube array for point lights. Implementing variable resolution with an atlas is also rather annoying, and for point lights, I would be forced to flatten the cube down to 6 2D rects and do manual cube lookup instead without proper seam filtering, ugh.

Scaling to “arbitrary” number of lights

While performance for reasonable number of lights was quite excellent compared to alternative techniques, I couldn’t really scale number of lights arbitrarily, and it has been nagging me a bit. Memory becomes a concern, and while the “list of lights” approach is likely less memory hungry in the average case, it has even worse worst-case memory requirements, and it’s not very friendly for GPU culling.

Kinda clustered shading? New bindless hotness

The talk on Call of Duty’s renderer in Advances in Real-Time Rendering 2017 presents a very fresh idea on how to do shading, and it hits all the right buttons. Culling on GPU, bitscan loops, scalarization, scales to a lot of lights, a lot of things to like here.

I spent some time this holiday season implementing a new path for clustered shading based on this technique, and I ended up deviating in a few places. I’ll go through my implementation of it, but it will make more sense once you study the presentation first.

Decoupling XY culling from Z

The key feature of the Call of Duty implementation is how it partitions space. Rather than a full 3D cluster, we decouple XY and Z dimensions, so rather than O(X * Y * Z) we get O(X * Y + Z).

Z is also binned linearly, and we can have several thousand bins in Z. This makes everything much nicer to deal with later. Culling here is trivial, since we compute min/max Z in view space for each light, which is very simple.

For each Z-slice, we just need to figure out the minimum light index and maximum light index which hits our Z-slice. Of course, to make the ranges as tight as possible, we sort lights by Z distance.

Data structures

  • Each tile in XY needs a bitmask array, u32[ceil(num_lights / 32)] for each tile. This can be tightly packed in a single buffer.
  • A buffer containing Z-slices as described above.
  • Per-light information: position/radius/cone/light type/shadow matrix/etc

Going back to frustum culling

Now that we cluster in XY and Z separately, I went back to frustum partitioning, and now we need a way to do frustum culling against spot and point lights … Conservative rasterization really is the perfect extension to use here, just a shame it’s not widely available yet on all relevant hardware.

The presentation has an alternative for conservative rasterization as current consoles do not support this feature, which is mostly to render light volumes a-la classic deferred (not completely dead yet!) at full-resolution and splatting out bits with atomics as fragment threads are spawned. If you have depth information you can eliminate coverage using classic deferred techniques. However, I went in a completely different direction without using depth at all.

Compute shader conservative rasterization

It felt natural to do all the culling work in compute shaders. This is where most of the “fun” is. This is also a rather esoteric way of doing it, but I like doing esoteric stuff with compute shaders, see https://themaister.net/blog/2019/10/12/emulating-a-fake-retro-gpu-in-vulkan-compute/ as proof of that.

Conservative sphere rasterization

To solve this problem, we’re going to tackle it geometrically instead. First, we solve the screen-space bounding box problem. Fortunately, this problem is separable, so we can compute screen-space bounds in X and Y separately.

(Behold glorious Inkscape skillz ._.)

What we want to do is to figure out where P_lo and P_hi intersect the near plane. P can be rotated in 2D by the half-angle in two directions. This way we find tangent points on the circle. sin(theta) is conveniently equal to r / length(L), so building a 2×2 rotation matrix is very easy. After rotating P, we can project P_lo and P_hi, and now we have clip space bounds in one dimension. Compute separately for XZ and YZ dimensions and we have screen-space boundaries.

Projecting a sphere to clip space creates an ellipsis, and to compute the ellipsis, we need to rotate view space such that the sphere center lies perfectly on the X or Y axis. For simplicity, we orient it on the +X axis. We can then perform the range test, and an ellipsis is formed. We can now test any point directly against this ellipsis. If the sphere intersects the near plane in any way, we can fall back to screen-space bounding box.

Here’s a ShaderToy demonstrating the math. Of course, Inigo Quilez did it first :p

In the real implementation, we only need to compute setup data once for each point light, to rasterize a pixel, we apply a transform, and perform a conservative ellipsis test, which is rather straight forward.

Conservative spot light rasterization

I tried an analytical approach, but I gave up. Spot-Frustum culling is really hard if you want tight culling, so I went with the simpler approach of just straight up rasterizing 6 triangles which forms a spot light. We can rasterize in floating point since we don’t care about water-tight rasterization rules, and it’s conservative anyways. It’s not the prettiest thing in the world to do primitive clipping inside a shader, but you gotta do what you gotta do …

The … peculiar shader can be found here.

Binning shader

Once we have setup data for point lights and spot lights, we do the classic culling optimization where we bin N lights in parallel over a tile, broadcast the results to all threads, which then computes the relevant lights per pixel. Subgroup ballots is a nice trick here, which replaces the old shared memory approach. Each workgroup preferably works on 32 lights at a time to compute a 32-bit bitmask.

Shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer_bindless_binning.comp

The shading loop

The final loop to shade becomes something like:

uint cluster_mask_range(uint mask, uvec2 range, uint start_index)
{
	range.x = clamp(range.x, start_index, start_index + 32u);
	range.y = clamp(range.y + 1u, range.x, start_index + 32u);

	uint num_bits = range.y - range.x;
	uint range_mask = num_bits == 32 ?
		0xffffffffu :
		((1u << num_bits) - 1u) << (range.x - start_index);
	return mask & uint(range_mask);
}

vec3 shade_clustered(Material material, vec3 world_pos)
{
    ivec2 cluster_coord = compute_clustered_coord(gl_FragCoord.xy);
    int linear_cluster_coord = linearize_coord(cluster_coord);
    int z = compute_z_slice(dot(world_pos - camera_pos, camera_front));

    uvec2 z_range = cluster_range[z];

    // Find min/max light we need to consider when shading slice Z.
    // Make this uniform across subgroup. SGPR.
    int z_start = int(subgroupMin(z_range.x) >> 5u);
    int z_end = int(subgroupMax(z_range.y) >> 5u);

    for (int i = z_start; i <= z_end; i++)
    {
        // SGPR
        uint mask =
            cluster_bitmask[linear_cluster_coord *
                            num_lights_div_32 + i];

        // Restrict to lights within our Z-range. VGPR now.
        mask = cluster_mask_range(mask, z_range, 32u * i);
        // SGPR again.
        mask = subgroupOr(mask);

        // SGPR
        uint type_mask = cluster_transforms.type_mask[i];

        // Good old scalarized loop <3
        while (mask != 0u)
        {
            int bit_index = findLSB(mask);
            int light_index = 32 * i + bit_index;
            if ((type_mask & (1 << bit_index)) != 0)
            {
                result += compute_point_light(light_index,
                                              material,
                                              world_pos);
            }
            else
            {
                result += compute_spot_light(light_index,
                                             material,
                                             world_pos);
            }
        }
    }
}

Shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer_bindless.h

Potential performance problems

By decoupling XY and Z in culling there’s a lot of potential for false positives where large lights might dominate how Z-ranges are computed and trigger a lot of over-shading. I haven’t done much testing here though, but this is probably the only real weakness I can think of with this technique. Regular tile-deferred has similar issues.

Culling tightness

I’m using smaller lights here to demonstrate. Red or green light signifies that a light was computed for that pixel:

Placing point light in volumetric fog for good measure. The weird red/green “artifacting” around the edges is caused by the forward shading and subgroupOr logic when shading to ensure subgroup uniform behavior.

Potential improvements

Clipping Z-range calculation against view frustum might help a fair bit, since Z-range can be way too conservative for large positional lights like these. Especially spot lights which point to the side like the image above. Classic deferred has a very similar problem case unless ancient stencil culling techniques are used to get double sided depth tests.

Conclusion

I’m happy with the implementation. Performance seems very good, but I haven’t dug deep in analysis there yet. I was mostly concerned getting it to work. Just waiting for mobile GPU vendors to support bindless, so I can test there as well I guess.

YUV sampling in Vulkan – a niche and complicated feature – VK_KHR_sampler_ycbcr_conversion

Sometimes I like to implement Vulkan extensions in Granite just because. This time around we’re looking at YUV.

For anyone who have worked with multimedia before, YUV (or YCbCr as it’s also referred to) is a nightmare. The act of converting YUV to RGB is very simple as it’s just a color space transform with a 3×3 matrix, but YUV means many things. There is no end to how overloaded YUV can be, and making sure you know exactly which YUV flavor you’re dealing with can be quite tricky. From a system integration point of view, YUV is a serious pain.

When dealing with video content in Vulkan, you will have to consume YUV somehow.

YUV has some characteristics which all make sense for video compression, but complicate things.

  • Planar: Each color component is often packed in different 2D images.
  • Luma/Chroma: Y refers to luminance, UV (or CbCr) refers to chrominance (color). This is critical for video compression.
  • Downsampled chroma. Human eyes have higher resolution for luminance vs color, so less bandwidth on color is an easy way to save space.

There is a lot of variance here between different YUV formats, as there is no obvious standard for these kinds of things.

  • How many planes? 2 or 3 are common (Y, U, V) vs (Y, packed UV). Packed single plane can be used in some other obscure scenarios. (YUYV and variations on that).
  • For packed representations, which component comes first?
  • How many bits per component? 8 is the most common, but 10-bit YUV content can be found in the wild.
  • How much is chroma downsampled? 2x horizontally and vertically is by far the most common, often referred to as YUV420p (the naming convension in YUV makes no sense, don’t try to find any).
  • Where is the texel center for the chroma samples? Common values are co-sited at every other luma sample, or in the mid-point between groups of 2×2 luma samples.
  • What is the exact color space conversion matrix from YUV to RGB?
  • How is chroma reconstructed to full resolution?

Dealing with YUV without fancy extensions

To render a YUV frame in RGB is not necessarily a difficult task, but depending on how many formats you need to deal with, shader variants can quickly get out of hand. You’ll typically do something like this:

layout(binding = 0) uniform TexLuma;
layout(binding = 1) uniform TexCb;
layout(binding = 2) uniform TexCr;

layout(location = 0) out vec3 FragColor;
layout(location = 0) in vec2 TexCoord;

const mat3 yuv_to_rgb_matrix = mat3(....);

void main()
{
    float Luma = textureLod(TexLuma, TexCoord, 0.0).x;
    float Cb = textureLod(TexCb, TexCoord, 0.0).x; // For mid-point chroma
    float Cr = textureLod(TexCr, TexCoord, 0.0).x;
    vec3 yuv = vec3(Luma, Cb, Cr);
    // Possibly expand range here if using TV YUV range and not PC YUV range.
    yuv = rescale_yuv(yuv);
    FragColor = yuv_to_rgb_matrix * yuv;
}

This is fine, but there is some motivation to do better here. Since watching video is one of the most common operations a GPU does, it would be nice if we could do this more efficiently, especially on low-power mobile devices with batteries attached to them. It would also be really nice if we could sample a video texture in our shader and have the GPU just “deal with it”.

The planar texture formats

VK_KHR_sampler_ycbcr_conversion adds a lot of new texture formats to Vulkan. For YUV420p, we’re going to look at this format, VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM.

In planar formats, we see that 3 color components are spread out across 3 planes, with a “420” here to mean that the second and third components are half resolution. “GBR” is a bit strange, but G refers to Y, and B and R refer to Cb and Cr respectively. Green is the strongest contributor to the luma Y channel after all, so I guess it kinda makes sense like that …

By having planar texture formats, it means that a texture unit of the GPU is actually able to sample 3 samples at once, meaning we will put a lot less stress on the GPU texturing unit. If we think of rendering video, this is a large part of the work done on the GPU. For lower-end mobile devices rendering 4K video without melting, this is actually really important.

The image aspects

In order to be able to refer to each plane separately when copying data in and out of the texture, we can use VK_IMAGE_ASPECT_PLANE_{0,1,2}_BIT. When copying to and from plane 1 and 2 in YUV420p, the resolutions of those planes are halved. VK_IMAGE_ASPECT_COLOR_BIT refers to the whole “GBR” as a whole, and it’s only useful when sampling the image.

Disjoint image allocation

Since we have 3 image planes, it is not weird to desire combining three separate images together. E.g. we can allocate three R8_UNORM images and make it planar later. This is supported and we’ll have to create and allocate images in a particular way.

First, we create the R8_UNORM images with VK_IMAGE_CREATE_ALIAS_BIT. This means that we will alias the image meaningfully with another image, even when using OPTIMAL image layout and image layouts are shared across aliases! We’ll use this to alias with a plane inside the planar texture later.

Be aware of alignment requirement. I’ve found that planar textures can need larger alignment than the single-component textures might need, so either using standalone allocations per plane, or bumping alignment to something like 64k works around that.

When we create the planar texture, we specify DISJOINT_BIT and ALIAS_BIT. For disjoint, it means we need to query allocation requirements and bind memory separately for each plane. To do this, we need to use vkGetImageMemoryRequirements2 and vkBindImageMemory2. Here, we just bind the same memory we used for our separate textures.

Setting up a sampler conversion object

The vkCreateSamplerYcbcrConversion function creates an object which encodes exactly how we will convert the planar components into RGB values.

VkSamplerYcbcrConversionCreateInfo info = { VK_STRUCTURE_TYPE_SAMPLER_YCBCR_CONVERSION_CREATE_INFO };

// Which 3x3 YUV to RGB matrix is used?
// 601 is generally used for SD content.
// 709 for HD content.
// 2020 for UHD content.
// Can also use IDENTITY which lets you sample the raw YUV and
// do the conversion in shader code.
// At least you don't have to hit the texture unit 3 times.
info.ycbcrModel = VK_SAMPLER_YCBCR_MODEL_CONVERSION_YCBCR_709;

// TV (NARROW) or PC (FULL) range for YUV?
// Usually, JPEG uses full range and broadcast content is narrow.
// If using narrow, the YUV components need to be
// rescaled before it can be converted.
info.ycbcrRange = VK_SAMPLER_YCBCR_RANGE_ITU_NARROW;

// Deal with order of components.
info.components = {
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
};

// With NEAREST, chroma is duplicated to a 2x2 block for YUV420p.
// In fancy video players, you might even get bicubic/sinc
// interpolation filters for chroma because why not ...
info.chromaFilter = VK_FILTER_LINEAR;

// COSITED or MIDPOINT? I think normal YUV420p content is MIDPOINT,
// but not quite sure ...
info.xChromaOffset = VK_CHROMA_LOCATION_MIDPOINT;
info.yChromaOffset = VK_CHROMA_LOCATION_MIDPOINT;

// Not sure what this is for.
info.forceExplicitReconstruction = VK_FALSE;

// For YUV420p.
info.format = VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM;

VkSamplerYcbcrConversion conversion;
table->vkCreateSamplerYcbcrConversionKHR(device, &info, nullptr,
                                         &conversion);

Passing along to VkImageView and VkSampler

Both an image view and sampler which are used with KHR_sampler_ycbcr_conversion must have this sampler conversion object passed into it via pNext. This is because part of this information will be split up between the two objects. Planar and swizzle information is likely part of image view, while filtering and chroma siting is likely part of sampler object.

Immutable sampler

Some information is also relevant for the shader compiler, so for YCbCr sampling there are some restrictions. You have to use a COMBINED_IMAGE_SAMPLER and the sampler must be immutable in the descriptor set layout. Since that immutable sampler contains the conversion object, this allows the shader compiler to see how to complete the transform where hardware support stops. This might just be performing the YUV to RGB conversion, or it lets the shader implement almost everything on its own.

In Granite, I pass along immutable sampler information in a rather crude way where I look at the declared name of a combined image sampler in the shader.

Putting it all together I can write shader code like this in Granite and sample YUV420p content directly:

#version 450

// "LinearYUV420P" guides reflection to use an immutable sampler.
// Not ideal since I need to know which YUV conversion to use ...
// But it works for now.
layout(set = 0, binding = 0) uniform sampler2D uSamplerLinearYUV420P;
layout(location = 0) in vec2 vUV;
layout(location = 0) out vec4 FragColor;

void main()
{
    // It's just like sampling a regular texture :D
    FragColor = textureLod(uSamplerLinearYUV420P, vUV, 0.0);
}

Getting fancy with hardware video decoding

The natural way to extend this is to use hardware video decoding, export each plane as external memory, and import it in Vulkan. DISJOINT allocation comes in handy here since we don’t have to rely on a video decoder exporting a full planar image as one large unit. I haven’t looked into this yet though, but I suggest looking at MPV here. There seems to be some support for this kind of zero-copy flow with Vulkan.

Was all of this really all that useful?

Not really, but I have some ideas to add a “video texture material” in the future to Granite, so it might come in handy later. There might be some use for it in graphics programming where bandwidth is saved by encoding post-process render targets to YUV. That way, sampling the YUV post targets shouldn’t require a lot of manual hackery.

Emulating a fake retro GPU in Vulkan compute

RetroWarp – a fake retro GPU

Lately, I’ve been fiddling with a side project which ended being quite interesting.

The goal of this side project was to prototype out a system which implements software rasterization in compute shaders using modern GPU features like Vulkan 1.1 subgroups and async compute to improve performance. Then, I wanted to apply this to emulation of retro GPUs, in particular, a more low-level approach.

I believe compute shader rasterization has some key advantages in the domain of low-level emulation. Chasing full accuracy means not being able to make use of the key fixed function aspects of the graphics pipeline on Vulkan GPUs and most of the reasons to use fragment shaders goes away. With compute, there is no fixed function baggage to grapple with, but it does mean a lot of the things we take for granted must be implemented in software.

I didn’t aim to dive straight into a concrete retro GPU with this prototype, but rather I designed a straight forward rasterizer which supports the basic features found in particular old GPUs. My approach here is that this could be used as a starting point when going further and emulating a real legacy chip.

The repository is available on Github: https://github.com/Themaister/RetroWarp

The high level system

Rather than reiterate everything presented in the slide deck, I will link to it directly instead, however, it’s useful to discuss the system at a very high level.

Presentation slides

I presented this work at the Khronos Munich meetup in October 2019. You can find the presentation slides here.

Implementing Low Level GPU – Hans-Kristian – Munich 2019

Tile-based

Going tile-based is practically necessary for any compute shader implementation. I implemented 8×8 and 16×16 tile modes. Smaller tiles are more suitable for lower resolution like 320×240 and 640×480, but 16×16 was useful for 720p and up.

If you are familiar with tile deferred shading and friends, you know where I’m going with this.

Coarse-then-fine binning

To be tile based, we need to assign primitives to tiles. This is a quite intensive process when the tile size is small as time scales as resolution times number of primitives. To optimize, I bin at a low resolution (e.g. 64×64 tiles) and then refine the binning at full tile resolution.

Bitmap instead of primitive list

A common way to bin is to build an array of primitives which affects a tile, and then the renderer can just loop through that array of indices on a per-tile basis. This is problematic in the worst case where a lot of primitives end up filling the entire screen, there simply might not be memory available to store all these lists. We cannot allocate memory arbitrary on the GPU, and we really want to do tile binning on the GPU and not CPU.

Instead, each tile gets a fixed array of u32 bitmasks, where 1 bit is used per primitive. Bit-scan loops are used instead. To speed up the process where there are many gaps in the bitmap (there certainly is), there is a hierarchy, where the first hierarchy of bits marks a bit if any primitives is binned in groups of 32 primitives. If we find a bit set here, we go down the hierarchy and loop some more. A more concrete example here is:

  • Maximum of 16k primitives (arbitrary limit we choose)
  • Bitmap is u32[16k / 32] to contain all state
  • Coarse bitmap is u32[16k / (32 * 32)]

If more than 16k primitives are used, we can just split this into multiple render passes. These old GPUs don’t exactly support indirect rendering, so that’s not really a problem.

Ubershader vs. split shader architecture

After binning, we could simply implement an ubershader from doom where we deal with any possible render state the GPU supports in one 5000+ line monster. This is very problematic for performance, particularly with register pressure/occupancy on the shader cores.

One of my key deviations from the norm here was to implement a split shader architecture. Rather than rely on ubershaders, it is possible for the depth/blending state to consume pre-shaded tiles which contains color/depth/coverage information necessary to run these stages.

To create color/depth/coverage information, we can generate indirect dispatches and use specialization constants to carve out the code paths we need to run instead. This keeps register pressure down. The key downside of this approach is that we need to allocate memory and bandwidth for the pre-shaded data.

Async compute + graphics queue compute

Depth/blending is the only stage which needs to happen in-order. We can happily do binning and shading and feed the results to the final shading stage. I run everything except for depth/blending in the async compute queue, and depth/blending can run in the graphics queue.

Performance uplifts

See presentation slides for more detailed results.

Subgroup optimizations gave a solid ~20% uplift on AMD/NV/Intel. Async compute gave further 10-20% uplift on AMD/NV. Overall, I’m quite happy with this.

The weird world of shader divergence and LOD

Mip-mapping is hard – importance of keeping your quads full

Sampling textures with mip-mapping is ancient, but it’s still hard apparently. Implicit LOD calculation is the first instance where we poke a hole into the “single threaded” abstraction of high level shading languages like GLSL and HLSL and dive into the maddening world of warps, waves, quads and everything in-between. For fragment shading, at least a group of 2×2 threads (a quad) need to run side by side so we can have gradient information over the screen. On any modern GPU, these 2×2 threads are actually running in lock-step as we’ll see when looking at GPU ISA later …

The Vulkan/GL/GLES ecosystem has always specified that implicit LOD instructions must happen in dynamically uniform control flow. Dynamically uniform just means that either all threads have to execute a texture() instruction, or no threads do. This ensures that there is always 4 valid texture coordinates from which to compute derivatives. The easiest way to ensure that this guarantee holds is simply to never sample in control flow, but that’s not really practical in more interesting shaders.

If you’re sampling in control flow you better make sure you uphold the guarantees of the spec.

Having to be dynamically uniform over an entire draw call is a bit silly, so the Vulkan specification recently tightened the scope such that if you have subgroupSize >= 4, you only need to be dynamically uniform on a per-quad granularity. This makes sense. We only need correct derivatives in the quad, we shouldn’t have to care if some unrelated quad or even triangle is diverging.

An interesting case came up recently where apparently developers expect that you actually can sample with implicit LOD in diverging control flow. Apparently HLSL “defines” this control flow to be valid code.

vec2 uv = from_somewhere();
if (weight > 0.0)
    sum += weight * texture(Texture, uv);

The idea is that we shouldn’t have to sample the texture unless we’re going to use it, but it’s still nice to provide UV for LOD purposes. Unfortunately, there is no obvious way to express this optimization in high level languages. UV is well defined in the outer scope which is dynamically uniform, so that’s something … Intuitively, this code makes sense, but it gets really murky once we dig deeper.

With subgroup ops, we can probably get a good approximation on the HLL side.

bool quadAny(bool value)
{
    // Perhaps this can be translated into s_wqm on AMD
    // if compiler checks this pattern?
    return subgroupClusteredOr(int(value), 4) != 0;
}

vec2 uv = from_somewhere();
// Hoist texture sampling out of branch and force quad uniformity.
vec4 tex;
if (quadAny(weight > 0.0))
    tex = texture(Texture, uv);
if (weight > 0.0)
    sum += weight * tex;

Querying gradients and then sampling with that in the branch is fine as well, but it is slow, and not really a fix, at best a workaround.

HLSL seems a bit murky about if this kind of code is legal, it’s all “that one app did this thing that one time and now we’re screwed”. From my understanding compilers can do some heroics here to work around this in applications.

I wanted to try this kind of code on all Vulkan devices I have available to see what happens. We’re in undefined territory as far as LOD goes, so anything can happen. There’s three outcomes I’m looking for which seem like plausible HW behavior:

  • It just happens to work. This is kinda scary, since it’ll probably break in 5 years anyways.
  • The LOD computed is garbage.
  • The LOD is forced to some value on divergence.

Here’s the concrete shader I’m using, from https://github.com/Themaister/Granite/blob/master/tests/assets/shaders/divergent_lod.frag. A test to run the shader is https://github.com/Themaister/Granite/blob/master/tests/divergent_lod_test.cpp

#version 450

layout(location = 0) in vec2 vUV;
layout(location = 0) out vec4 FragColor;
layout(set = 0, binding = 1) uniform sampler2D uSampler;
layout(set = 0, binding = 0, std140) uniform Weights
{
    vec4 weights[4];
};

void main()
{
    vec3 tex = vec3(0.0);
    float lod = -10.0;
    vec2 uv = vUV;
    if (weights[int(gl_FragCoord.x) + 2 * int(gl_FragCoord.y)].x > 0.0)
    {
        tex = texture(uSampler, uv).rgb;
        lod = textureQueryLod(uSampler, uv).y;
    }

    FragColor = vec4(tex, lod);
}

I render this on a 2×2 frame buffer with a full-screen “expanded triangle” to not get any helper lane shenanigans. Let’s try to run this across a wide range of hardware and see what happens. NOTE: any result here is equally valid in Vulkan, this is intentionally going out of spec.

AMD

I tested this on a Navi card. RDNA ISA seems similar enough to GCN … We effectively have 4 driver stacks for AMD cards now to test.

RADV (LLVM 10)

Garbage LOD

main:
BB16_0:
	s_mov_b64 s[0:1], exec    
	s_wqm_b64 exec, exec
	v_cvt_i32_f32_e32 v3, v3
	s_mov_b32 s6, s3
	s_movk_i32 s7, 0x8000          
	v_cvt_i32_f32_e32 v2, v2     
	v_mov_b32_e32 v5, 0xc1200000    
	s_load_dwordx4 s[8:11], s[6:7], 0x0    
	v_lshlrev_b32_e32 v3, 1, v3  
	v_add_lshl_u32 v2, v3, v2, 4   
	s_waitcnt lgkmcnt(0)    
	buffer_load_dword v4, v2, s[8:11], 0 offen    
	v_mov_b32_e32 v2, 0  
	v_mov_b32_e32 v3, v2      
	s_waitcnt vmcnt(0)        
	v_cmp_lt_f32_e32 vcc, 0, v4 against 0       
	v_mov_b32_e32 v4, v2     
	s_and_saveexec_b64 s[8:9], vcc
	s_cbranch_execz BB16_2
BB16_1:
	s_mov_b32 s3, s7      
	s_add_i32 s6, s2, 0x50           
	s_mov_b32 m0, s4           
	s_load_dwordx8 s[12:19], s[2:3], 0x0       
	s_load_dwordx4 s[20:23], s[6:7], 0x0          
	v_interp_p1_f32_e32 v7, v0, attr0.x
	v_interp_p1_f32_e32 v8, v0, attr0.y      
	v_interp_p2_f32_e32 v7, v1, attr0.x          
	v_interp_p2_f32_e32 v8, v1, attr0.y   
	s_waitcnt lgkmcnt(0)     
	image_sample v[2:4], v[7:8], s[12:19], s[20:23] dmask:0x7 dim:SQ_RSRC_IMG_2D
	image_get_lod v5, v[7:8], s[12:19], s[20:23] dmask:0x2 dim:SQ_RSRC_IMG_2D
BB16_2:
	v_nop                 
	s_or_b64 exec, exec, s[8:9]              
	s_and_b64 exec, exec, s[0:1]         
	s_waitcnt vmcnt(0)                
	exp mrt0 v2, v3, v4, v5 done vm          
	s_endpgm               

We see that v7 and v8 hold the UV coordinates, but they are actually only computed inside the branch (v_interp). The optimizer is allowed to place UV computation inside the branch here. If there is divergence in a quad, the disabled lanes won’t get correct values for v7 and v8 (since execution is masked), and LOD becomes garbage.

RADV (ACO)

Coupled with Navi cards, this is probably the most bleeding edge setup you can run. It’s a completely new compiler backend for AMD cards, not based on LLVM.

Just happens to work

BB0:
	s_wqm_b64 exec, exec 
	s_mov_b32 s0, s3    
	s_movk_i32 s1, 0x8000  
	s_load_dwordx4 s[8:11], s[0:1], 0x0 
	s_mov_b32 m0, s4 
	v_interp_p1_f32_e32 v4, v0, attr0.y 
	v_cvt_i32_f32_e32 v2, v2   
	v_cvt_i32_f32_e32 v3, v3   
	v_lshl_add_u32 v2, v3, 1, v2 
	v_lshlrev_b32_e32 v2, 4, v2 
	s_waitcnt lgkmcnt(0)   
	buffer_load_dword v2, v2, s[8:11], 0 offen 
	v_interp_p2_f32_e32 v4, v1, attr0.y 
	v_interp_p1_f32_e32 v0, v0, attr0.x  
	v_interp_p2_f32_e32 v0, v1, attr0.x 
	v_mov_b32_e32 v1, v4  
	s_waitcnt vmcnt(0)   
	v_cmp_lt_f32_e32 vcc, 0, v2  
	s_and_saveexec_b64 s[0:1], vcc     
	s_cbranch_execz BB3  
BB1:
	s_movk_i32 s3, 0x8000 
	s_load_dwordx8 s[4:11], s[2:3], 0x0  
	s_load_dwordx4 s[12:15], s[2:3], 0x50 
	s_waitcnt lgkmcnt(0)   
	image_sample v[2:4], v[0:1], s[4:11], s[12:15] dmask:0x7 dim:SQ_RSRC_IMG_2D
	image_get_lod v0, v[0:1], s[4:11], s[12:15] dmask:0x2 dim:SQ_RSRC_IMG_2D
BB3:
	s_andn2_b64 exec, s[0:1], exec 
	s_cbranch_execz BB6  
BB4:
	v_mov_b32_e32 v0, 0xc1200000 
	v_mov_b32_e32 v2, 0  
	v_mov_b32_e32 v3, 0   
	v_mov_b32_e32 v4, 0  
BB6:
	s_mov_b64 exec, s[0:1] 
	s_waitcnt vmcnt(0)   
	exp mrt0 v2, v3, v4, v0 done vm 
	s_endpgm

This time, UV is interpolated outside the branch, so sampling in divergent control flow ends up working after all. The registers are well defined as they enter the branch. For AMD, it seems like it just comes down to whether or not the lanes have correct values placed in them and not having them be clobbered by the time we get around to sampling. There doesn’t seem to be any hardware level checks for divergence.

AMDVLK

Garbage LOD

AMDVLK uses the same LLVM stack that RADV LLVM uses, and no surprise, same result, and basically same exact ISA is generated.

Windows

Also just happens to work

I guess it’s the exact same case as the ACO compiler here. No need to paste disassembly.

Intel

Tested on UHD 620 (8th gen mobile CPU I think).

Anvil (Mesa)

The Mesa compiler can spit out assembly, which is nice.

Just happens to work

ISA (a little too wide to embed): https://gist.github.com/Themaister/7c5b011cde3c7585459b089f80f897e2

From what I can make out of the ISA, the UV is interpolated outside control flow, and then only the sampling takes place in control flow. It seems like Intel has similar behavior as AMD here, in that just as long as the registers are valid, divergent sampling “works”.

Windows

Just happens to work

Doesn’t seem to be a way to get ISA from Windows driver, but I suppose it’s same as ANV here.

Nvidia

Tested on a Turing GPU on Linux. Didn’t bother testing on Windows as well considering the driver stack is basically the same.

LOD is clamped to 0, textureQueryLod returns -32.0.

Apparently, now we start seeing interesting behavior. Unfortunately, there is no public ISA to look at. The -32.0 LOD might look weird, but this is kind of expected. This is apparently the smallest possible representable LOD on this GPU. LOD is usually represented in some kind of fixed point, log2(0) = -inf after all.

I confirmed it worked as expected when using non-divergent execution as a sanity check.

Arm

Tested on Mali-G72.

LOD is clamped to 0, textureQueryLod returns -128.0.

Very similar behavior to Nvidia here, except the LOD is -128.0 rather than -32.0. I confirmed it worked as expected when using non-divergent execution as a sanity check.

QCOM

Tested on Adreno 506.

Garbage LOD

Again, no ISA to look at. I confirmed it worked as expected when using non-divergent execution as a sanity check.

Conclusion

Never ever rely on LOD behavior with divergent quads (EDIT: at least the way it’s specced out and implemented on Vulkan drivers right now) . You’d be contributing to the pain and suffering of compiler engineers the world over. Staying quad-uniform is fine though.