My personal hell of translating DXIL to SPIR-V – part 3

It’s time we tackle one of the big problems of DXIL, the binding model. The D3D12 binding model is completely foreign to most people who know the Vulkan binding model, and vice versa. I don’t think there are that many people in the world who can claim to fully grasp the binding models in both APIs. Translating every last detail of the D3D12 binding model to Vulkan is extremely painful, and I feel D3D12 made some critical design mistakes which bite us (and native drivers?) hard. Whenever I hear people naively claim D3D12 and Vulkan is basically the same API, I cringe hard and cry a little inside. Translating low level APIs is hellish when the details don’t map 1:1 exactly and the binding model is the perfect vehicle to demonstrate it.

I hope this blog post can serve as a definitive document on the insanity we need to go through in vkd3d-proton to make all of this work well. We have landed on a solution I feel is quite solid for AMD, but perhaps less so on other IHVs … A lot of credit here goes to Doitsujin who went through the insane task of rewriting the entire binding model in vkd3d-proton to the full TIER_3 binding model last year.

Earlier posts in the series:

The starting point of D3D12 – allow reuse of D3D11-era shaders

Before we can discuss D3D12, we must have a basic understanding of the D3D11 binding style we find in SM 5.0. From there, we will see how this was extended to the frankenstein monster that is SM 5.1 DXBC and how this was cribbed wholesale in SM 6.0+.

Binding resources to slots

The binding model in D3D11 was extremely simple, simple enough that we have no problems implementing that in Vulkan 1.0 as-is. Essentially, you can bind a fixed amount of:

  • SRV (t#): Generic read-only resources, sampled images, read-only SSBO and uniform texel buffers
  • UAV (u#): Generic read-write resources, storage images, read-write SSBO and storage texel buffers
  • Samplers (s#): Separate samplers, same as Vulkan
  • CBV (b#): Constant buffer views, i.e. UBOs in Vulkan

Each resource in the HLSL declares which register it binds to, easy enough! This maps very easily to Vulkan. The main difference is that the bindings are per stage in D3D11, but these details are easy to work around. Another awkward detail of D3D11 is that SRVs and UAVs are catch alls for any kind of read-only or read-write resource, even if they supports many different descriptor types. Sampled images, raw buffers and typed buffers are quite different things! Of course, with a slot based binding model, the implementation can resolve these details easily since we know all accessed resources up front. No bindless hell to deal with.

Now, we imagine that game developers are sitting on a large repository of D3D11-era shaders and D3D12 rolls around. Very few were going to risk going D3D12 exclusive any time soon, so it had to be possible to ship the exact same DXBC shaders to target both APIs. Even to this day, most new D3D12 games ship DXBC to target both APIs! (Maintaining two completely different shader backends is soooo much fun and productive use of time …) The center of gravity is slowly shifting towards DXIL, and it’ll probably take a few more years before DXIL is the main bytecode games ship. I still haven’t seen a single Unreal Engine title shipping DXIL-exclusive yet for example.

One key aspect of the D3D11 model is that the shaders say nothing about how data is accessed. Especially buffer data can be implemented in many different ways, and it would have been up to the D3D11 driver to figure out how to implement constant buffers efficiently for example. Of course, with the highly managed model of that API, the driver had ample opportunity to do so, at the cost of CPU overhead. With explicit APIs like D3D12 and Vulkan, there is far less room for drivers to optimize because the application developer is given certain freedoms, but they are expected to do some work in return. In reality, the IHV will probably add hacks behind your back if your performance is important enough to them, but that’s another story.

With buffer data like constant buffers for example, there are three primary indirection levels where we can trade off speed vs. space:

  • Preloaded registers: The absolute fastest mode of execution. If a constant buffer is tiny (a few u32s), it might as well fit entirely inside the register bank of the shader core, no memory operation required in the shader!
  • Store a pointer in preloaded registers: The second fastest mode. We can take one indirection to be able to access a larger chunk of data with a simple load instruction from pointer.
  • Access a descriptor: The “slowest” mode, with two indirections. First, we need to load a descriptor, then, we load data based on that descriptor. We also get robustness guarantees, which is another special kind of hell to handle.

As we see here, these three styles of buffer access have different trade-offs. We can get more efficient access the smaller data we need to consider. A few u32s of data? Preload registers. A few KiB? Store a pointer to it in registers. Many different buffers with different base pointers? Descriptor is the way to go.

The awkward part of this simplified model is that on certain hardware, using descriptors might be the fast path, so we’ll have to be careful. IHVs have optimized these paths to death over the decades.

The critical part for D3D12 here is that there is no way to express these concerns in the HLSL shader itself, whereas we certainly can in GLSL / SPIR-V!

A “virtual” vs. “physical” binding model

In Vulkan, we can express concerns like these in the SPIR-V for the most part.

  • Preloaded registers: Push constants is designed to map directly to this scheme. Some drivers might also be able to promote INLINE_UNIFORM_BUFFER to registers, especially if the descriptor set is not UPDATE_AFTER_BIND.
  • Store a pointer in preloaded registers: In Vulkan, we can place buffer device addresses in a push constant block and load directly from that. This is somewhat esoteric, but we use this a lot in vkd3d-proton. INLINE_UNIFORM_BUFFER also maps to this scheme. If we consider a descriptor set to be a pointer, inline UBO data can be placed directly in that descriptor memory, and we get one indirection. Implementation details between IHVs tends to be wildly different for inline UBO.
  • Access a descriptor: Just normal descriptor sets. If the descriptor set is not UPDATE_AFTER_BIND, a driver might be able to promote a buffer to a pointer in preloaded registers. This model has two indirections. One to load a descriptor, one to load the data from that descriptor.

Since we’re able to express ourselves fairly explicitly in Vulkan in the shader itself, I’ll call this a “physical” binding model, or at least a close enough approximation. A true physical binding model would be one where descriptors are accessed through raw pointers, but good luck convincing N IHVs to agree on how that should work 🙂 Inline UBO is the odd exception to this rule where the pipeline layout specifies access patterns which are not expressed in SPIR-V.

One fair criticism of Vulkan’s model is that there are a lot of ways to do effectively the same thing. Should you use push constants, inline UBO, UBO, UBO with dynamic offset, push constant buffer device address, push descriptors or normal pool descriptors? Hard to say without lots of experience and profiling. At least we have the tools available in the API, even if it’s not always obvious what the optimal choice is. In vkd3d-proton we use almost all of these, depending on the context.

To lament the state of graphics programming a bit, loading constant data optimally is still an unsolved problem in graphics, but we have accelerated ray traced global illumination at least, so yay? 🙂 To be fair, it’s hard to go wrong with plain old Vulkan 1.0 UBOs. Any perf gains beyond that tends to be minor and highly situational.

The abstract nature of D3D12 binding model

D3D12’s model is far more abstract, but at the same time, it maps to extremely specific restrictions on an implementation. It’s a weird model where it feels like an abstraction, but it actually isn’t. We can see right through it. We’ll explore this point in the root signature section …

Shader model 5.1 – I wanna bindless too!

5.1 is a weird version of DXBC. It changes some critical things:

  • Resources can be declared as arrays with dynamic indexing. Unbounded array size is also supported. Bindless, baby!
  • register() bindings now take an optional space parameter. This seems similar to DescriptorSet in SPIR-V, but it actually isn’t. It’s meaningless on its own. It is necessary however when using a lot of unbounded array size declarations. In DXBC, all t# or u# registers for example would be exhausted by unbounded array sizes, but by using different register spaces we can declare multiple unbounded arrays. When referring to a resource, we need to consider its binding, space, shader stage and type (SRV, UAV, CBV vs Sampler).

5.1 is kinda obsolete now that DXIL (6.0) has been around for a while, but it is still shipped in various titles.

The root signature

As mentioned earlier, HLSL shaders give no hints to the driver about how resources are actually accessed, so this information has to come from somewhere. In D3D12, we provide all this information in the root signature. The root signature is analogous to Vulkan’s VkPipelineLayout, but the details are completely different.

I think it’s helpful to understand D3D12 root signatures from a top-down view, and the point I would like to get across is that we should see the root signature as a very fancy push constant layout declaration.

The root signature defines an ABI for shaders. We lay down up to 256 bytes in memory, and then define how those 256 bytes can be used to access resources in various ways. The weird way root signatures work is that there’s a very explicit limit of a certain number of bytes, but we cannot actually access those bytes ourselves in a shader. The 256 bytes are organized in terms of 64 DWORDs. Each DWORD can be viewed in a certain way.

  • 32_BIT_CONSTANT: Consume a certain number of DWORDs and assign them to a CBV with specific register binding. That CBV’s data cannot be accessed with any dynamic array indices. This enforces the hidden implementation detail that the data can be mapped directly to registers. As an interesting tidbit, Vulkan push constants actually don’t have this restriction. In Vulkan, you can still dynamically index push constants as long as the access is dynamically uniform (yes, some hardware can dynamically index registers, it’s neat!)
  • Root descriptor: Consume 2 DWORDs to store a 64-bit GPU virtual address to any kind of raw buffer. CBV, UAV and SRV all work as long as they are untyped. The key thing to note here is that there is no room for a buffer size, so buffer robustness is completely disabled for root descriptors. All of this is just a very roundabout way to say that we store a pointer and access it as a pointer. This is basically a very restricted form of VK_KHR_buffer_device_address which is hidden from the developer. The driver is of course free to turn this into something like a Vulkan push descriptor if it so chooses. In vkd3d-proton, we’d love if games actually made use of root CBVs, but oh well 🙁 Some engines got the memo though 🙂
  • Table pointer: The cost is 1 DWORD, even though it should be 2 (we’ll get to this later, it gets confusing!). This is where the D3D12 binding model starts to get esoteric and diverge completely from Vulkan. We’ll need a separate section to cover this.

The global descriptor heaps and table pointers

Descriptors are managed in a completely different way than Vulkan and this is the point that gives us a pretty big headache.

In D3D12, you’re expected to allocate one heap which contains N descriptors. N goes up to 1,000,000 in the highest binding tier. All descriptors you access in a shader must exist inside this heap, and the heap must be explicitly bound to a command list before use. For a given command list, you’re not really expected to change it, but you can suballocate, and 1 million descriptors should be enough unless we take the Nanite meme far enough where we need to render one material per pixel. Samplers have their own heap, and there’s a limit of a few thousand here.

From the global heap, you’re able to bind sub-ranges of that heap. These sub-ranges are represented as the table pointers in the root signature. Essentially, a table pointer encodes a u32 offset into the global heap – even if you give the API a GPU VA – which is why it costs 1 DWORD instead of 2 (for now …).

Apparently, this idea is a workaround for certain GPUs. There exists a literal “descriptor palette” in some hardware and we can only read descriptors from this “palette” essentially. D3D12 seems to cater to that model, at the cost of flexibility. You’re asked as a developer to own that palette / heap and allocate memory on top of that.

View objects work completely differently as well. In D3D12, a view object is created directly into a descriptor heap, rather than having separate VkImageView and VkBufferView objects. Copying descriptors is a critical part of D3D12 descriptor management. It’s possible to create non-shader visible descriptor heaps, which is basically just a fancy malloc(). Here, it’s intended that we can stream out descriptors to the shader-visible descriptor heap by calling CopyDescriptors over and over, often over 10000 times per frame! Making this path efficient enough has been a ton of work and has placed some serious restrictions on our implementation. There is only one way to do it “fast” and correct, it just took 3-4 iterations to get there.

In Vulkan, the approach is quite different. We create view objects as standalone objects, which are then written into descriptor sets with vkUpdateDescriptorSets. If we squint enough, the Vulkan view objects represent the non-shader visible descriptor heaps, and vkUpdateDescriptorSets is closer to CopyDescriptors (non-shader visible → shader visible) than Create*View. Profiling native drivers shows this clearly. Create*View() is ~50-100x slower than CopyDescriptors, similar to Vulkan if you look at vkCreate*View() vs. vkUpdateDescriptorSets.

In the Vulkan model, there is no concept of a “descriptor heap” which has to be bound, however, we can figure out what is really going on under the hood. We can allocate a lot of view objects in Vulkan, but eventually, we will actually hit out-of-memory conditions on some hardware. This is a good indication that we have actually exhausted the internal descriptor “palette”. The views merely contain references to that global palette instead.

Another headache is that view objects in D3D12 do not have a lifetime or their own. The lifetime is tied to the descriptor heap itself, as a descriptor is essentially treated as plain old data. This is not the case with Vulkan. The unfortunate side effect of this scenario is that we need to maintain a per ID3D12Resource hash-map in vkd3d-proton where we keep view objects alive until we know for sure that we can destroy the views, i.e. at resource destruction time. The older implementation in vkd3d-proton used reference counted views, but as you can expect, when games copy 10000+ descriptors per frame in many threads concurrently, the overhead was unusably large. The shit hit the fan so to say in Death Stranding where we spent >80% of CPU time copying descriptors, not a good look. The only solution was to use VkCopyDescriptors functionality of vkUpdateDescriptorSets, which is considered quite esoteric in Vulkan.

So how do shaders access the global descriptor heap? Well, through a lot of fixed function jank, that’s how! This jank is more or less fixed in SM 6.6 as we’ll get to later, but now we just have yet another API to implement, support and test, sigh …

First, we start with the table parameter that is pushed to the GPU through the command list API. This represents an offset into the descriptor heap. A table entry has a certain number of descriptor ranges associated with it. These ranges specify that a certain subset of descriptors will be consumed as descriptors. For example, consider a Texture2D[] bound at register(t10, space4). A table entry in the root signature might say “Parameter #5 has an SRV range which begins at space #4, register #8, contains unbounded number of descriptors, with a constant offset of 15 descriptors from the table entry.” When accessing this resource, we should access the descriptor heap at descriptor #tableEntry5 + (10 – 8) + 15 + dynamicIndex in shader. As we see, the registers and spaces in the HLSL have no physical meaning until the root signature gives it meaning.

Codegen to SPIR-V

We can map the entire root signature into two parts, the root parameters and immutable sampler declaration.

For example, take this shader:

cbuffer CPush : register(b0) { float4 cpush; };
cbuffer C : register(b1) { float4 c; };
cbuffer C2 : register(b2) { float4 c2; };

Texture2D<float4> T[] : register(t40);
SamplerState S : register(s30);

float4 main(uint index : INDEX, float uv : UV) : SV_Target
{
	return T[NonUniformResourceIndex(index)].Sample(S, uv) +
		cpush + c + c2;
}

A typical codegen here would be something like (via SPIRV-Cross):

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_nonuniform_qualifier : require

struct AddCarry
{
    uint _m0;
    uint _m1;
};

layout(buffer_reference) buffer PhysicalPointerFloat4NonWrite;
layout(buffer_reference, std430) readonly buffer PhysicalPointerFloat4NonWrite
{
    vec4 value;
};

layout(set = 5, binding = 0, std140) uniform BindlessCBV
{
    vec4 _m0[4096];
} _21[];

// The root parameters map to push constants.
layout(push_constant, std430) uniform RootConstants
{
    uvec2 _m0; // Root CBV for b1
    uint _m1;  // Various table offset root parameters
    uint _m2;  // ...
    ...
    uint _m9;  // Root constants
    uint _m10;
    uint _m11;
    uint _m12;
} registers;

layout(set = 0, binding = 0) uniform texture2D _14[];
layout(set = 2, binding = 0) uniform sampler _25[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in float UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    uint _46 = registers._m6 + 2u;

    // Textures can only be accessed through heap.
    // Here we see address computation being based on all three inputs.
    // Push constant + constant + dynamic index
    // All of this is wrapped in a nonuniform thing for extra spice.
    vec4 _66 = texture(nonuniformEXT(sampler2D(_14[registers._m1 + (INDEX + 40u)], _25[registers._m3 + 30u])), vec2(UV));

    // Root constants, directly copy from push constants.
    // The cbufferLoad instruction gives us constant offsets, so we can
    // directly access push constant members here.
    vec4 _84 = uintBitsToFloat(uvec4(
        registers._m8, registers._m9, registers._m10, registers._m11));

    // Root CBV, here implemented with buffer device address,
    // kinda ugly SPIR-V, but gotta go fast.
    // This works great on AMD, not so much on NVIDIA … but we can
    // Use a normal plain UBO instead if we want.
    // Ah, the luxury of giving the implementation a choice … :)
    
    AddCarry _98;
    _98._m0 = uaddCarry(registers._m0.x, 0u * 16u, _98._m1);
    PhysicalPointerFloat4NonWrite _105 = PhysicalPointerFloat4NonWrite(uvec2(_98._m0, registers._m0.y + _98._m1));

    // Bindless UBO action, table CBV here we go …
    // Not a great look for performance on some GPUs.
    // On Pascal in particular, we have to emit bindless SSBO instead, rip perf.
    // Not much we can do sadly.
    SV_Target.x = ((_84.x + _66.x) + _105.value.x) + _21[_46]._m0[0u].x;
    SV_Target.y = ((_84.y + _66.y) + _105.value.y) + _21[_46]._m0[0u].y;
    SV_Target.z = ((_84.z + _66.z) + _105.value.z) + _21[_46]._m0[0u].z;
    SV_Target.w = ((_84.w + _66.w) + _105.value.w) + _21[_46]._m0[0u].w;
}

 

It’s common to think of the Table parameter as equivalents to Vulkan descriptor sets, but that is misleading. The correct way to think about it is that D3D12 has one huge (technically two for Sampler heap) descriptor set, and for convenience you can push offsets into that heap as desired. In Vulkan, descriptor sets can be allocated and bound in complete isolation. Shader model 6.6 removes all of this pretense and shows it how it really is. Everything is just an index to the heap, and the pre-6.6 world tried to hide this fact.

My main problems with the binding model

Very loose coupling between root signature and table descriptor access

The only thing the root signature does is to define how offsets into the heap is computed. There is no knowledge about the array-ed-ness of a resource, nor if every resource is even valid at record time (exception here is STATIC descriptors, but alas, extremely rare in the wild). Unfortunately, this loose nature of root signature and shader matching complicates descriptor hoisting, and when you complicate hoisting, you better pray the target hardware has pristine bindless support (pro-tip, they don’t).

A root signature could say that “we have 100 CBVs in this range”, but shaders might just happen to use a few CBVs from that range. Doing descriptor work per shader is pretty gross and goes against the spirit of the modern graphics APIs. Aggressive hoisting and repacking is CPU overhead that we shouldn’t consider except in exceptional circumstances. A native driver might have a much easier time dealing with these things since they can modify their own command streams at the last minute if they want. In D3D12, SIMULTANEOUS_ACCESS command lists don’t exist, so the driver has this option. Perhaps this restriction was put in place precisely for these scenarios? Who knows, we can only armchair these things!

The D3D12 docs also encourage developers to use somewhat generic root signatures and rarely change them, so a root signature is likely going to be quite “fat” compared to an ideal Vulkan pipeline layout, making useful workarounds even more difficult …

Awkward and unnecessary aliasing

In a D3D12 descriptor heap, there is no descriptor type. Descriptor #10 can be an SRV and descriptor #11 can be a CBV, it’s all a few bits between friends. Surely, games are not going to commit the grave sin of placing the wrong descriptor type in the wrong index, right? … Right?! Of course they do. Debugging insanity like this was the prelude to vkd3d-proton’s fast descriptor QA checking mechanism. It’s actually quite fast. In AAA games, we have code like this running at 40 FPS usually. More than good enough I’d say when hunting for bugs.

void descriptor_qa_report_fault(uint fault_type, uint heap_offset, uint cookie, uint heap_index, uint descriptor_type, uint actual_descriptor_type, uint instruction)
{
    uint _63 = atomicAdd(QAGlobalData.fault_atomic, 1u);
    if (_63 == 0u)
    {
        QAGlobalData.failed_cookie = cookie;
        QAGlobalData.failed_offset = heap_offset;
        QAGlobalData.failed_heap = heap_index;
        QAGlobalData.failed_descriptor_type_mask = descriptor_type;
        QAGlobalData.actual_descriptor_type_mask = actual_descriptor_type;
        QAGlobalData.failed_instruction = instruction;
        QAGlobalData.failed_shader_hash = uvec2(291u, 0u);
        memoryBarrierBuffer();
        QAGlobalData.fault_type = fault_type;
    }
}

uint descriptor_qa_check(uint heap_offset, uint descriptor_type_mask, uint instruction)
{
    uint _98 = QAHeapData.descriptor_count;
    uint _100 = QAHeapData.heap_index;
    uvec2 _102 = QAHeapData.cookies_descriptor_info[heap_offset];
    uint _110 = QAGlobalData.live_status_table[_102.x >> 5u];
    uint _121 = (uint(heap_offset >= _98) | (((_102.y & descriptor_type_mask) == descriptor_type_mask) ? 0u : 2u)) | (((_110 & (1u << (_102.x & 31u))) != 0u) ? 0u : 4u);
    if (_121 != 0u)
    {
        descriptor_qa_report_fault(_121, heap_offset, _102.x, _100, descriptor_type_mask, _102.y, instruction);
        return _98;
    }
    return heap_offset;
}

void main()
{
    uint _45 = descriptor_qa_check(registers._m1 + (INDEX + 40u), 1u, 1u);
    vec4 _132 = texture(nonuniformEXT(sampler2D(_14[_45], _18[registers._m3 + 30u])), vec2(UV));
    SV_Target.x = _132.x;
    SV_Target.y = _132.y;
    SV_Target.z = _132.z;
    SV_Target.w = _132.w;
}

 

With this scheme we can check for:

  • Is the descriptor type correct?
  • Is the resource destroyed already?
  • Did we index the heap out of bounds?

If we hit failures we get a neat atomic cookie in host memory we can sample at regular intervals on CPU. Works surprisingly well and tailor made for our needs, which is likely why it’s so fast compared to generic alternatives. We also only report the first failure. With this mode enabled, we allocate N + 1 descriptors, and reserve the last descriptor as a NULL descriptor which we can OpSelect on failure to avoid GPU hangs. (well, if someone did a faulty atomic compswap loop on a NULL descriptor I guess we’d timeout, but whatever, not much we can do about that).

Forcing the addition of VK_VALVE_mutable_descriptor_type

The combination of weirdly aliased descriptor types and full bindless is pretty deadly since Vulkan has typed descriptors, for good reason! This unfortunate mismatch forced us to create many descriptor sets, one per type, which looks very dumb. It consumes enough memory that we actually ran into serious performance issues on AMD hardware in the beginning. To fix all the immediate issues, we developed VK_VALVE_mutable_descriptor_type, which is kinda gross, but it lets us implement the D3D12 weird aliased descriptors directly, and all the problems went away. It’s a pretty goofy model though and not something Vulkan developers should be using directly. Not all descriptors have the same size in hardware, so there is gaps between descriptors, which is what I’m pretty sure is the case for native D3D12 as well, rip K$ … The more natural model would have been to have descriptor heaps where equally sized objects were allocated in different heaps, but we have to deal with whatever problems D3D12 throws our way and make it work …

Another “nice” side effect of using mutable was that some GPU hangs went away. If the game used the wrong descriptor type, it seemed to at least not read a descriptor that pointed to already freed memory, but rather just a descriptor of wrong type. Somehow, this helped certain games to run around the time the extension was released. It is deeply disturbing that games can ship in this state. :\

A game engine targeting Vulkan can be smarter about all these things. For example, having 1M sampled images seems reasonable, as that’s the number one thing you’d use bindless for, but do you need 1M CBVs, fully bindless? I doubt it. I’m far more in favor of a model where normal, packable, hoistable descriptors are used by default, and bindless is only tapped into as needed.

Enforces a fully bindless implementation

When implementing this binding model in other APIs, we’re forced to make everything fully bindless. Everything is accessed in terms of the global descriptor heap, and even if 99% of shaders use a very traditional D3D11-style binding model, we cannot do anything smart here because of one critical design flaw of D3D12 …

The great VOLATILE mistake of Root Signature 1.0

Root Signature 1.0 really doubles down on forcing full bindless everywhere, even when shaders don’t actually need it. All descriptors are considered VOLATILE, which means that we require all the Vulkan flags that are opt-in:

  • UPDATE_AFTER_BIND
  • UPDATE_WHILE_PENDING
  • PARTIALLY_BOUND
  • VARIABLE_COUNT

This fact completely blocks us from hoisting descriptors on the fly. Even if the shader is simple enough, we’re royally screwed because there is no guarantee that descriptors are actually valid at the time of command recording. Almost all games still cling to Root Signature 1.0, and probably rely on driver heroics to work around any performance implication. Table CBV on Pascal GPUs for example is disgustingly slow in vkd3d-proton. In Vulkan, these GPUs don’t even support UPDATE_AFTER_BIND UBOs! I can only wonder what utter depravities the D3D12 driver engineers had to do to make this work well … Hoist descriptors late with device generated commands based on the PSO? Bleh. The API forces us to implement these as bindless SSBO, and I die a little inside every time I have to think about this. Root CBV works perfectly fine, yet games insist on using table CBVs with VOLATILE just because …

On AMD, everything is fully bindless anyways, as descriptors are just memory, but it’s unfortunate that the D3D12 API went VOLATILE by default. It really should have been opt-in. It’s not like the poor Pascal souls can actually buy new GPUs these days even if they wanted to 🙂

The second mistake of Root Signature 1.1 – STATIC

The mistake of VOLATILE-by-default was recognized by the time 1.1 rolled around, and STATIC was now made the default. STATIC is the nicest mode possible for drivers, since it fully allows hoisting of descriptors, but you actually lose robustness guarantees (!). Most big game engines probably looked at this for 2 seconds and noped out.

D3D in general has gone to extreme lengths to ensure robustness guarantees, and engines of course rely on every esoteric OOB scenario to not fail. This is hopefully not by design, but engine bugs get masked for years if they never cause issues in practice. Losing that guarantee means no one is going to risk random GPU crashes in the wild for just a potential performance increase.

The actual fix that no game ever uses – STATIC + BOUNDS_CHECKED

The actual fix is fairly obscure. At some point a more reasonable mode was added which preserves bounds checking. Of course, no game I know of actually uses this. vkd3d-proton can take advantage of STATIC / BOUNDS_CHECKED CBVs in some cases where we will hoist them to push descriptors for significant NV perf gains, but alas, when no games use the APIs like intended, it might as well not exist …

Immutable sampler jank

The docs imply that the driver is supposed to implement this with an internal hashmap of sampler objects. Oddly enough, these sampler objects live outside the normal sampler heap you bind to the command list, and in fact, you don’t have to bind anything to use immutable samplers, it’s intended to be implemented through driver magic.

This magic is certainly odd, but in Vulkan we have to allocate and bind immutable samplers to a command buffer. This is because very little hardware actually supports true immutable samplers in the sense that they are embedded in the shader code itself. While we still have to allocate descriptor sets, at least we don’t have to actually write the immutable samplers to the descriptor set.

On AMD, the descriptors are read directly from scalar registers, so there’s nothing stopping a motivated compiler from emitting a bunch of immediate register moves and go to town, but sadly, most hardware don’t work like this, and I don’t think drivers even take advantage of this possibility yet.

My big question is how this is even supposed to work on D3D12 if the samplers have to live somewhere in memory. If the shader is simply storing constant offsets into some hidden sampler heap, then surely those constants are not stable across different runs, and thus shader caching fails. Patching in new constants seems pretty gross, but I can only speculate … Either way, in vkd3d-proton we end up creating a VkDescriptorSet per root signature which holds all immutable samplers for a given root signature. This is automatically bound when flushing dirty state in a draw or dispatch call.

Painful and awkward raw buffer types

In Vulkan we are blessed with the SSBO, the flexible and versatile buffer. D3D12 is a bit more weird and has extremely specific alignment and robustness guarantees which don’t map cleanly to anything.

First, we have ByteAddressBuffer.

  • Alignment requirement of buffer binding, 16 bytes (good)
  • Load1,2,3,4 variants, with 4 byte alignment requirement (2 for 16-bit). SM 6.2 improved this a fair bit, with a templated type, but for us in DXIL land, very little changes.
  • Robustness is checked per-component (why ._.)
  • Loads and stores are done at specific byte offsets, which is super ugly, but mostly for shader authors

The only natural way to implement this is a uint[] array, where we unroll loads and store per-component. Fun times. I would be surprised if hardware actually supported vectorized load-store with per-component robustness at scalar alignment …

The second style is StructuredBuffer which represents a buffer as a T data[] array. One deeply frustrating aspect of this is that data structures are tightly packed, meaning T could be float3, and we’d get 12 byte stride. StructuredBuffers must be bound in terms of elements, not byte offsets, so it’s perfectly valid to bind a structured buffer at 12 byte offset. This cannot be implemented safely on GPUs which report 16 byte SSBO alignment sadly. Due to all this, it’s almost impossible to cleanly declare a vec3 data[]; inside the SPIR-V, even if we have scalar block layout support. Just like ByteAddressBuffer, we kinda have to unroll loads and stores with uint data[] again, shame … We have gone to great lengths to support an “offset” buffer, which means we can bind an SSBO at 16 byte alignment, and nudge the offset inside the shader to fix things up, but it is horrible.

There are some mind-melting optimizations in flight for dxil-spirv which can vectorize these things, but it’s not a perfect solution. Weird robustness and alignment rules forces us to do some pretty horrible things, and we only have a few escape hatches which we can exploit, and dxil-spirv will try hard to squeeze every last legal vectorization opportunity out of the existing rules.

For example, StructuredBuffers have a rule saying that robustness only needs to be checked once, based on the structure element that is being accessed. For dynamic access into the element itself, it is undefined behavior to straddle the element boundary. Think StructuredBuffer<float4> where you do a dynamic access into the components, and index overflows.

I’m wondering if hardware actually has a special descriptor type they use to implement structured buffers. Perhaps the same ones which are used for vertex attribute fetch? And yes, even AMD uses tbuffer with stride pulled from descriptors, so saying it’s fully programmable fetch is a bit misleading I think.

Texel buffers are not a valid solution either. They can be bound at scalar alignment, which is nice, but we cannot safely use vectorized load-store either. Texel buffers can only be read and written to fully. No write masks allowed. Another problem is mix and match of 16-bit load-store which completely broke any hope of using texel buffers. In Vulkan, we can redeclare an SSBO multiple times with different data layouts, which is basically a human readable version of ByteAddressBuffer.

It’s a bit frustrating to deal with these problems, because almost no sane application is going to run into the edge cases, or need all the esoteric robustness features, but we have to make every possible edge case work. We cannot take shortcuts. To demonstrate the insanity we have to go through on certain GPUs …

StructuredBuffer<float3> B;

float3 main(uint index : INDEX) : SV_Target
{
	return B[index];
}

Since this can be bound at offset 12, we might see this horrible code. The offset buffer is associated with the descriptor heap, and we allocate a separate side channel buffer for this purpose.

layout(set = 1, binding = 0, std430) restrict readonly buffer SSBO_Offsets
{
    uvec2 _m0[];
} _13;

layout(set = 1, binding = 1, std430) restrict readonly buffer SSBO
{
    uint _m0[];
} _18[];

layout(push_constant, std430) uniform RootConstants
{
   // root parameters
} registers;

layout(location = 0) flat in uint INDEX;
layout(location = 0) out vec3 SV_Target;

void main()
{
    // Load extra offset + actual range. We cannot bind the SSBO tightly enough
    // on some GPUs.
    // Since the access is not nonuniform, we can give a strong hint to compiler
    // that the offsets can be loaded as a broadcast.
    uvec2 _37 = _13._m0[subgroupBroadcastFirst(registers._m1)] >> uvec2(2u);
    uint _41 = INDEX * 3u;
    // Trip explicit OOB if we fail.
    uint _47 = (_41 < _37.y) ? (_41 + _37.x) : 1073741820u;
    // Unroll 3 loads. ;_;
    vec3 _60 = uintBitsToFloat(uvec3(_18[registers._m1]._m0[_47], _18[registers._m1]._m0[_47 + 1u], _18[registers._m1]._m0[_47 + 2u]));
    SV_Target.x = _60.x;
    SV_Target.y = _60.y;
    SV_Target.z = _60.z;
}

The horrible workarounds

We have a fair amount of issues to deal with. A lot of invalid API behavior in D3D12 happens to work on native drivers, and many games ship with subtle bugs which force us to do horrible things. IHVs are going to have a fun time in 5 years when hardware details change 🙂

What is a raw buffer anyways?

A surprisingly common bug is that D3D12 does not complain if you create a typed buffer, but actually intended to create a raw buffer (ByteAddressBuffer) and vice versa. The difference is just one flag away and is easily missed. For us, that means SSBO vs texel buffer, a catastrophic error, but apparently, D3D12 does not care and drivers will happily consume typed buffers as raw and vice versa. I have no idea how this works, but somehow it does (I wrote tests)! Either way, it got bad enough that when emitting buffer descriptors, we have to emit both an SSBO variant and texel buffer variant and then the shader can pick the correct descriptor it wants. Of course, this being bindless and decoupled, we cannot pick the correct descriptor at draw time, sigh … Sometimes I think bindless all the things is going to be considered a huge mistake in 10 years from a software ecosystem point of view.

What is a NULL descriptor anyways?

A reasonably common bug is the NULL descriptor with wrong type. Technically in D3D12, NULL descriptors are still typed and you have to pass in a valid resource desc when creating one, but mistakes happen all the time. Using the wrong type works perfectly fine on existing hardware in almost all cases (I wrote a test!). For this reason, vkd3d-proton takes the conservative approach, and writing a NULL descriptor means splatting out NULL descriptors to all the different types of descriptors we have in the heap. For MUTABLE, this means two sets, for non-mutable, six! Voila, various GPU hangs disappeared, just like that.

VOLATILE is really bad for validation

Having simple validation is critical for development, and I think the real hidden mistake of VOLATILE being default, is that only GPU-assisted validation can catch any bugs! GPU validation is very heavy and usually avoided when possible. I feel a ton of these bugs could have been caught if descriptors could have been validated on CPU timeline. As long as an invalid descriptor is not dynamically accessed by a shader, it’s all OK! Dynamically accessed is the critical wording here. With this requirement, we can only validate on submit time if we can prove ahead of time that a shader will generate invocations which then statically access an invalid descriptor. Good luck proving that fragment shader invocations happen 100% of the time! Don’t get me wrong, bindless is very powerful and great for the use cases it enables, but derping everything into bindless doesn’t seem like the correct approach to me.

Local root signatures – one great kludge to rule them all

Local root signatures add yet another layer of hell to the binding model. It was annoying enough to implement a dozen different ways to access resources, and now we have do it all over again, but this time, replacing push constants with SBTs + some additional edge cases that are specific to local root signatures!

Local root signatures are used specifically for the shader record tables in ray tracing pipelines. In Vulkan, this data is accessed directly as a buffer, but D3D12 went the way of describing a view instead where each parameter is laid out in memory one after the other. As usual, table pointers is a thing, but this time, they consume 2 DWORDs instead of 1. Why on earth was table pointers described in terms of GPU VA and not just a simple u32 “offset into heap”? Sometimes I find D3D12’s obsession with GPU VA-s quite bizarre and counter-productive. It doesn’t even have buffer device address support, yet we’re using them everywhere, even when it doesn’t make much sense at all. In vkd3d-proton, we had to do the janky thing of returning artificial GPU VA-s for descriptor heaps so that it was fast to convert the VA into an offset.

While root descriptors in global root signatures can be implemented as descriptors, we are forced to use buffer device addresses since the VA is literally sourced from buffer memory. The fun thing is that you can even place RT acceleration structures here as a literal VA! And this is how OpConvertUToAccelerationStructureKHR was born … 🙂

Local root signatures also add the headache of also supporting immutable samplers! This time, we have to defer the creation of a second immutable sampler descriptor set to RTPSO creation, which is super annoying. At bind time, we therefore might run the risk of having to bind one immutable sampler set for the global root signature, then one which is tied to the union of all local root signatures for the RTPSO, so much fun! >_<

Implementing SM 6.6 bindless

SM 6.6 is actually straight forward. Our model in vkd3d-proton was always to map pre-6.6 style bindings to a SM 6.6 style “access the heap directly” model, so this translation is actually not that bad at all. The main annoyance is how there are now 5 (!) opcodes in DXIL to deal with resource handle creation, each with their own idiosyncrasies. I’m pretty sure all of this could have been squeezed into 2 …

  • CreateHandle
  • CreateHandleForLib
  • CreateHandleFromHeap
  • CreateHandleFromBinding
  • AnnotateHandle

Fun times indeed …

float4 main(uint index : INDEX, float2 uv : UV) : SV_Target
{
	Texture2D<float4> T = ResourceDescriptorHeap[index];
	return T.Load(int3(uv, 0));
}

#version 460
#extension GL_EXT_nonuniform_qualifier : require
#extension GL_EXT_samplerless_texture_functions : require

layout(set = 0, binding = 0) uniform texture2D _9[];

layout(location = 0) flat in uint INDEX;
layout(location = 1) in vec2 UV;
layout(location = 0) out vec4 SV_Target;

void main()
{
    vec4 _32 = texelFetch(_9[INDEX], ivec2(uvec2(uint(int(UV.x)), uint(int(UV.y)))), int(0u));
    SV_Target.x = _32.x;
    SV_Target.y = _32.y;
    SV_Target.z = _32.z;
    SV_Target.w = _32.w;
}

Nice. I recently completed SM 6.6 support for vkd3d-proton, have fun with that!

How this maps to AMD GPUs (teaser for next post)

If anything, it feels like the D3D12 binding model is tailored pretty well to AMD hardware. It probably makes sense given the Mantle heritage. The binding model on GCN is extremely neat and tidy and was designed with full bindless in mind since the early days, quite forward looking, indeed. Descriptors are just memory that is loaded into scalar registers and passed directly to various operations which take descriptors. Very little magic is going on here and once you understand how this work, the modern API binding models start making some sense. Root parameters and push constants map nicely to user SGPRs, and as long as we don’t use too many u32s, we have a very direct mental model of how this is going to map to hardware.

Where D3D12 suffers I think is on hardware where there are still remnants of legacy descriptor slots in hardware. We cannot take advantage of this due to poor design decisions made in D3D12, but it is what it is. There is only so much we can do as a translation layer.

We’ll explore this topic in a future post …

Conclusion

It has taken many, many months, if not over a year to get where we are today in vkd3d-proton. Handling all of this insanity pushed us to the breaking point, but somehow it works pretty well. There are no shortcuts we can take.

Despite the mammoth length of this post, I don’t think we’re quite done with the subject of binding models. I’ll probably need another post to cover codegen examples for AMD ISA as well as an overview of the million ways we can convert the same DXIL code to different SPIR-V depending on the root signature.

My personal hell of translating DXIL to SPIR-V – part 2

In the previous blog post, I began a long form exploration of the DXIL shader format, and how it translates to SPIR-V. In this part, we’ll look more closely at the LLVM format, how it is parsed, and how to interpret the parsed result.

The LLVM IR binary format is mostly undocumented. Very early on we have to dig through the source to understand what is going on. LLVM IR was never intended to be used as a “standard” format that is shipped between different software stacks. It’s clearly an ad-hoc serialization format that serves the purpose of LLVM internals. The IR format is backwards compatible at the very least, which is why we can parse DXIL LLVM 3.7 modules with modern LLVM versions.

As we’ll see, LLVM IR is very complex to parse compared to SPIR-V. There are some interesting similarities however, as SPIR-V shares some DNA of LLVM.

Layered architecture

LLVM IR is parsed in multiple layers. At the lower level is a compression scheme which feels somewhat like LZ compression. The bit-stream teaches the decoder how to decode the stream, by emitting “templates” (or “code book entries”), and these templates can then be instantiated to form complete records.

The low-level bit-stream parser

The initial part of the LLVM IR puzzle is pulled from RenderDoc’s implementation. The basic gist of it is documented here. To summarize however, the idea is that a module consists of one top-level “block”. A block is a structure of blocks and records. A record has an ID with an array of uint64_t operands (quite similar to SPIR-V, except SPIR-V is array of 32-bit operands).

Storing full uint64_t operands is of course very wasteful, and this is where the primitives types of LLVM IR come in. We can express primitive types compactly with:

  • Variable length integers (configurable chunk size)
  • Fixed width integers (configurable bit width)
  • 6-bit chars (useful for C-style identifiers, i.e. a-z, A-Z, 0-9 and _)

Variable length integers are encoded in a scheme where we look at N bits at a time, N – 1 bits contain useful data, and the MSB marks whether to keep looking at N more bits.

Blocks and records are invoked in an esoteric way, which is where abbreviations come in. When we parse, we’re parsing abbreviations one at a time, which either results in some action:

  • 0 – END_BLOCK – Ends block scope
  • 1 – ENTER_SUBBLOCK – Begins a new scope, can nest arbitrarily
  • 2 – DEFINE_ABBREV – Defines a template for how to build new records. For example, we can specify that a record is [vbr4, char6, literal constant] or something. That abbreviation implicitly gets a new ID, starting with 4, which can be invoked when parsing new abbreviations. When decoding this abbreviation, the parser knows ahead of time how to decode the bits into arguments. For char6 strings in particular, it’s also possible to specify an array abbreviation.
  • 3 – UNABBREV_RECORD – YOLO mode, directly decodes a record with a bunch of variable length integers, fairly inefficient. DXC seems to love to emit these 🙂
  • 4+ – Invoke user abbreviations.

In typical LLVM IR fashion, the number of bits used to encode the abbreviation ID is variable. It starts at 2 bits (since there are no user abbreviations to worry about yet), but can grow as needed. Fun!

The details are not super interesting for this post, but suffice to say, there’s a decent amount of detail that goes into parsing this.

The calling code ends up looking something like this:

LLVMBC::BitcodeReader reader(static_cast<const uint8_t *>(data), size);
LLVMBC::BlockOrRecord toplevel = reader.ReadToplevelBlock();

// The top-level block must be MODULE_BLOCK.
if (KnownBlocks(toplevel.id) != KnownBlocks::MODULE_BLOCK)
   return nullptr;

// We should have consumed all bits, only one top-level block.
if (!reader.AtEndOfStream())
   return nullptr;

The BlockOrRecord struct is fairly straight forward, simplified here as:

struct BlockOrRecord
{
  uint32_t id; // What kind of record or block is this?
  Type type; // block or record
  dxil_spv::Vector<BlockOrRecord> children; // If block
  dxil_spv::Vector<uint64_t> ops; // If record
};

Higher level parser

Now, we’re at a level where we have recovered structure from the bit-stream, and now we need to turn the BlockOrRecord structs into actual API objects, llvm::Module, llvm::Function, llvm::Value, etc, etc … dxil-spirv implements a LLVM C++ API drop-in replacement to be able to cross-reference our implementation against the reference implementation at any time (which has saved me many times). The implementation only implements exactly what is needed for DXIL however, so don’t expect too much of it. 🙂

Refer to objects by ID

Very similar to SPIR-V, types and values are referred to by an uint64_t ID. The annoying part however is that types and values implicitly allocate their own IDs, meaning that forgetting to parse something can be fatal. On top of this, IDs may or may not refer to other IDs through deltas relative to their own values or absolute values. It is somewhat context sensitive which one to use, which gets quite annoying to deal with.

Decoding llvm::Type

LLVM IR has a type hierarchy similar to SPIR-V. You start by declaring fundamental types like ints and float, and then upgrade them to vectors, array, pointers or structs. While parsing, the top-level block can contain TYPE blocks, which contains a bunch of records.

for (auto &child : toplevel.children)
{
   if (child.IsBlock())
   {
      switch (KnownBlocks(child.id))
      {
      case KnownBlocks::TYPE_BLOCK:
         for (auto &entry : child.children)
             parse_type(entry);
         break;
      }
   }
}

bool ModuleParseContext::parse_type(const BlockOrRecord &child)
{
   Type *type = nullptr;
   switch (TypeRecord(child.id))
   {
   case TypeRecord::VOID_TYPE:
   case TypeRecord::HALF:
   case TypeRecord::INTEGER:
   case TypeRecord::POINTER:
   case TypeRecord::ARRAY:
   case TypeRecord::FUNCTION:
   // you get the idea
   }
   types.push_back(type);
}

Integers deserve special mention, because they are somewhat whacky in LLVM. First, they have no signage associated with them. This kinda makes sense, since signage only actually matters in certain opcodes, like signed min/max, signed compare, arithmetic vs logical right shift, signed vs unsigned float <-> int conversion, etc. SPIR-V maintains sign for its integer type, but we can ignore it in most scenarios. (There is an esoteric exception to this however where DXIL kinda breaks down, once we dig into relaxed precision signed integers!) Another annoying exception we have to deal with all the time is stage IO and resource variables which are explicitly signed or unsigned in DXIL.

As the grizzled C programmer will know, signed overflow is undefined, but unsigned overflow is not. Does LLVM just not care? Well, it does. LLVM can mark operations as being “no signed wrap”, or “no unsigned wrap” for optimization purposes, but we don’t have to care about those at all fortunately.

Booleans are expressed as 1-bit integers, which kind of makes sense, but at the same time feels like a very LLVM thing to do … Logical operations reduce to simple arithmetic operations on 1-bit values instead.

The final whack part is that you can declare non-POT integer sizes. There are shaders in the wild which declare 11-bit integers and rely on wrapping on these values to work! (dear lord … <_<) I even tried to compile this to x86_64 and yes, it does actually deal with it correctly. I’m kind of amazed, and scared at the same time.

Overall though, type declaration in LLVM IR is pretty easy to understand if you understand SPIR-V.

Decoding constants

Similar as types, constants are records within a block. They can appear at function scope or global scope.

bool ModuleParseContext::parse_constants_record(
    const BlockOrRecord &entry)
{
    llvm::Constant *value = nullptr;
    switch (ConstantsRecord(entry.id))
    {
        case ConstantsRecord::SETTYPE:
        case ConstantsRecord::CONST_NULL:
        case ConstantsRecord::UNDEF:
        case ConstantsRecord::INTEGER:
        // ...
    }
    values.push_back(value);
}

Roughly speaking, this looks very similar, with some quirks. SETTYPE informs subsequent constant blocks which type is actually used. CONST_NULL is example of a fully context sensitive constant, similar to OpConstantNull in SPIR-V.

INTEGERs are converted through sign rotations. Since small negative numbers would be horribly inefficient to encode with VBR otherwise, the first bit is the sign bit, encoded in a sign magnitude scheme. -0 is interpreted as INT64_MIN.

Where LLVM constants get disgusting however, is the pseudo-specialization constant operation support. It is possible to encode a constant cast operation, or constant access chain into a global object (wtf?) this way. I don’t understand the motivation behind this, but there are lots of super weird edge cases here that took some time to iron out.

AGGREGATE is the first time we start to see how value IDs are referenced.

Vector<Value *> constants;
constants.reserve(entry.ops.size());

for (auto &op : entry.ops)
{
   constants.push_back(get_value(op, element_type,
      /* force absolute IDs */ true));
}

Value *value;
// Ah, yes. Why have VECTOR and ARRAY types when you can
// have a context sensitive one instead.
if (current_constant_type_is_vector)
{
   value = context->construct<ConstantDataVector>(
      get_constant_type(), std::move(constants));
}
else
{
   value = context->construct<ConstantDataArray>(
      get_constant_type(), std::move(constants));
}

get_value() is quite sneaky. In LLVM IR, it is valid to forward reference an ID, as long as the type is known. This leads to ProxyValue objects being created, which are resolved later. get_value() can be relatively indexed, or absolutely indexed depending on the context, which is always fun.

Global variables

Global variables are declared in top-level records. Typically these are only used for groupshared variables. In DXIL, a special pointer address space is reserved for this purpose. Global variables can also be used for global look-up tables. Global variables can also have optional initializers. This is very similar to SPIR-V overall. The equivalent is an OpVariable with either Workgroup or Private storage class.

Resource handles are declared in a completely different way … unless we’re DXR (more on that later, sigh v_v …)

Function prototypes

We also get to declare function prototypes at this stage. Some functions only have prototypes, and the common case here is various prototypes which declare dx.op intrinsics functions. If a prototype is declared to also have a body, we place that in a queue.

We also have to parse parameter attribute lists (surprisingly tricky!), just in case the function declares LLVM attributes. The only case we have to care about here is FP32 denorm handling. Why that isn’t a metadata entry, I’ll never know. DXIL really likes splitting its implementation across two completely different systems for no good reason …

Parsing functions

A function body is a block, consisting of records (which express normal opcodes), and other blocks (e.g. constant blocks). The first record we’ll typically see is the DECLAREBLOCKS one, which specifies the number of basic blocks in the function.

Basic blocks

A basic block is a fundamental building block of SSA-based IRs. A basic block enters execution at the first instruction, and executes in a straight line fashion until a terminator instruction executes. A terminator instruction can be anything which transfers control like a direct branch, conditional branch, switch statement, returns, etc. If you know SPIR-V, this is nothing new. It’s the exact same concept.

Unlike SPIR-V, where we have explicit OpLabel opcodes to begin a new block, LLVM makes this implicit. When we observe a terminator, the next instruction will be added to the next basic block.

Context sensitive parsing

The opcodes in the IR match closely to the type hierarchy of LLVM, let’s look at parsing llvm::BinaryOperation. A binary operation is any c = op(a, b) kind of instruction, it’s not necessarily just and/or/xor, etc. This is a catch all for FAdd, IMul, IAdd, Xor, etc.

case FunctionRecord::INST_BINOP:
{
   unsigned index = 0;
   auto lhs = get_value_and_type(entry.ops, index);
   if (!lhs.first)
      return false;
   auto *rhs = get_value(entry.ops, index, lhs.second);
   if (!lhs.first || !rhs)
      return false;
   if (index == entry.ops.size())
      return false;
   auto op = BinOp(entry.ops[index++]);
   auto *value = context->construct<BinaryOperator>(lhs.first, rhs, translate_binop(op, lhs.second));
   if (index < entry.ops.size())
   {
      // Only relevant for FP math,
      // but we only look at fast math state for
      // FP operations anyways.
      auto fast_math_flags = entry.ops[index];
      bool fast = (fast_math_flags &
          (FAST_MATH_UNSAFE_ALGEBRA_BIT |
           FAST_MATH_ALLOW_CONTRACT_BIT)) != 0;
      value->set_fast_math(fast);
   }
   if (!add_instruction(value))
      return false;
   break;
}

In SPIR-V a binary operation would be encoded as:

%id = OpMyBinOp %type %operand_a %operand_b

Very explicit and understandable. LLVM IR on the other hand is more clever, for better or worse.

First, the result %id is implicit, and is allocated linearly as new opcodes come in. The type of an instruction is context sensitive. First, we parse %operand_a. If we have seen this ID already, %type is deduced directly from the operand. If it is a forward reference, the type of %operand_a is encoded in the record explicitly.

IDs in most opcodes are encoded with a relative scheme. Since SSA requires that declarations of an ID dominates all uses of it, the common case is that uses of an ID come after the declaration of it, so this is a decent compression scheme. The implementation of get_value_and_type() is something like:

std::pair<Value *, Type *> ModuleParseContext::get_value_and_type(
   const Vector<uint64_t> &ops, unsigned &index)
{
   if (index >= ops.size())
      return {};

   uint64_t op = ops[index++];
   // Context sensitive, for backwards compat mostly, but
   // modules can choose to use absolute or relative encoding.
   if (use_relative_id)
      op = uint32_t(values.size() - op);

   if (op < values.size())
   {
      // Normal reference.
      return { values[op], values[op]->getType() };
   }
   else
   {
      // Forward reference, the type is encoded in the next element.
      if (index >= ops.size())
         return {};

      auto *type = get_type(ops[index++]);
      auto *proxy = context->construct<ValueProxy>(type, *this, op);
      pending_forward_references.push_back(proxy);
      return { proxy, type };
   }
}

I had tons of bugs where I didn’t handle the possible forward references. Very awkward. I was of the impression that only PHI instructions could possibly have forward references, but of course, it’s never that simple.

Speaking of PHI, get_value() changes here to a signed-aware variant, where the relative value ID is encoded with sign_rotation, just like for INTEGER constants. This is because we expect that PHI inputs are forward referenced just as often as backwards referenced.

Overall, it’s just a grind to implement all relevant opcodes. DXIL only uses a subset, but it’s not well documented which subset of LLVM IR is actually used. We just have to implement new stuff as it comes in. DXIL.rst has a list of which LLVM instructions are supported, but this list cannot be trusted because in DXR, DXC emits various vector instructions (so much for being a scalar IR format) and the unreachable terminator which is missing from the table.

Metadata

Metadata lives in its own block hierarchy and has a completely different set of types, llvm::MDNode, llvm::MDOperand, llvm::ConstantAsMetadata, etc.

At the top of the hierarchy we can declare NamedMDNodes, which we see in the LLVM assembly as:

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.resources = !{!4}
!dx.entryPoints = !{!7}

NamedNodes contain a list of MDNodes, which nest into more MDNodes, or terminate in constant values. These correspond to NAMED_NODE, NODE and VALUE record types, not too many surprises here.

Emitting SPIR-V opcodes

After we get through the parsing step, LLVM IR and SPIR-V has many similarities, and translating opcodes isn’t particularly difficult. For each LLVM basic block, we emit a basic block in SPIR-V and translate the opcodes one by one. We preserve the SSA nature as-is. There are of course a lot of details here, but they’re not very interesting, and too many details to enumerate. The two biggest problems we need to focus on are:

Resource access

Accessing textures, constant buffers, structured buffers and the weird and wonderful zoo of resource types is insanely intricate, and I’ll need to dedicate an entire blog post to this.

Control flow structurization

Another massive issue is the control flow. In LLVM, there is no structurization information whatsoever, and we’ll have to reconstruct this somehow. This is at least one more blog post. After emitting SPIR-V code into basic blocks, we need to rewrite the control flow and annotate the basic blocks with merge information, which then allows us to emit a final SPIR-V module, ready for driver consumption.

Conclusion

This was a rough overview of LLVM IR and how it is parsed from scratch. I think it’s safe to say it’s far more difficult to parse than SPIR-V, which is literally just a stream of [N x uint32_t] opcodes. Parsing IR is not the most exciting part of DXIL to SPIR-V conversion, but it had to be done. On the other hand, it might be useful starting knowledge for other projects.

For next post, we’ll look at how to translate the D3D12 binding model into Vulkan.

My personal hell of translating DXIL to SPIR-V – part 1

In vkd3d-proton, we’re translating D3D12 to Vulkan, and the single biggest piece I’ve contributed so far is translating shader model 6 to Vulkan. We already had DXBC support, but what is a graphics API without two completely incompatible IR formats, right? Getting working DXIL support in vkd3d-proton was the biggest target when I began working on vkd3d at the time. The result of this work is a standalone library and tool, dxil-spirv.

Introduction

In this blog series I’d like to go through the DXIL format, explain the problems I’ve had to solve and hopefully serve as an introduction to basic compiler theory, explained by yours truly who has no idea what they’re talking about. What could go wrong!?

I never learned any of this formally, but through working on SPIRV-Cross and dxil-spirv, I just had to learn this by trying and failing. Most of this theory is locked behind academic and computer science jargon, because it is a domain where rigor is actually necessary, and the algorithms for most things have been well known since the 70s (60s?). This isn’t graphics programming where we can be “close enough” and hand wave problems away. The smallest inaccuracy will break anything you come up with. Mix this with recursive tree traversal algorithms and all hell breaks loose when you least expect it. I wonder where compiler engineers pick this stuff up. Is there a secret club I’m not invited in? :p

DXIL has been, and continues to be extremely painful to translate correctly, with new edge cases popping up every week it seems. The main reasons for these are related to one particular problem, which this blog series intends to address over the course of the blog series.

Rewrite goto soup to strict structured control flow

Of the three core problems, unstructured control flow, aka goto soup, is the real difficult part, and solving this problem is one of the single most painful problems I’ve ever worked on. When I think I’ve solved it, the next AAA game we see finds a new way to break it. At the very least, it seems like it gets a little easier with every iteration, so we’re getting asymptotically correct at the very least!

It’s very likely that the existing solution in dxil-spirv is just garbage, and it needs to be rewritten from scratch at some point, but at least the current implementation can play some vidya, so, eh.

LLVM 3.7 IR

LLVM is another problem, since we need to consume a barely documented IR format which was never designed to be consumed as a standard format to begin with. The official DXC compiler is literally a fork of LLVM, forever frozen at the 3.7 version.

It is of course possible (and intended) to use the LLVM library directly, parse the IR that way, and then iterate over the llvm::Module to create SPIR-V, but this requires shipping yet another vendored LLVM copy, and we all love those don’t we … It is just not acceptable to ship that from a practical point of view. It clocks in at a nice 40 MB+ blob, compared to the 2 MB d3d12.dll binary vkd3d-proton compiles to.

The first iteration of dxil-spirv did indeed target LLVM APIs directly for practical reasons. Not having to spend months figuring the arcane byte code format certainly helped! Later, I had to make a LLVM API replacement – with some help from RenderDoc’s low-level parser to get started – which does exactly what is required to parse DXIL. Overall, it has been very helpful to keep a native LLVM port alive through API replication, since I can always cross check the parsing results against the real deal with -DDXIL_SPIRV_NATIVE_LLVM=ON. It worked out very well in the end!

Virtualized resource hell

HLSL, and by extension DXIL does not concern itself one bit how resources are actually accessed by the implementation. Given similar code, there’s about 8 ways to codegen it depending on context obtained by the global or local root signatures. Fun! On top of this, we need vendor specific hackery for perf and feature workarounds.

In this post

As this post is just an introduction to the format, no difficult problems are presented yet. 🙂

Anatomy of a shader blob

In our journey we begin at the output from DXC, a raw shader binary.

// $ dxc -Tps_6_0 -Fo test.dxil test.frag
float4 main(float4 a : A) : SV_Target
{
    return a;
}

When creating a PSO in D3D12, you hand the API this blob of data. This blob is basically the same as DXBC. The only real difference is the DXIL tag. Based on this, we know how to dispatch compilation in vkd3d-shader.

00000000 44 58 42 43 00 00 00 00 00 00 00 00 00 00 00 00 |DXBC............|
.......
00000140 b0 04 00 00 60 00 00 00 2c 01 00 00 44 58 49 4c |....`...,...DXIL|
....

For development, it is then useful to extract the raw LLVM IR part. We can do this with dxil-extract from dxil-spirv:

$ dxil-extract --output test.bc test.dxil # Extract raw LLVM bytecode
$ llvm-dis test.bc # Disassemble to LLVM assembly
$ cat test.ll
; ModuleID = 'test.bc'
source_filename = "test.bc"
target datalayout = "e-m:e-p:32:32-i1:32-i8:32-i16:32-i32:32-i64:64-f16:32-f32:32-f64:64-n8:16:32:64"
target triple = "dxil-ms-dx"

define void @main() {
%1 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
%2 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 1, i32 undef)
%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 2, i32 undef)
%4 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 3, i32 undef)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %1)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 1, float %2)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 2, float %3)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 3, float %4)
ret void
}

; Function Attrs: nounwind readnone
declare float @dx.op.loadInput.f32(i32, i32, i32, i8, i32) #0

; Function Attrs: nounwind
declare void @dx.op.storeOutput.f32(i32, i32, i32, i8, float) #1

attributes #0 = { nounwind readnone }
attributes #1 = { nounwind }

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.viewIdState = !{!4}
!dx.entryPoints = !{!5}

!0 = !{!"clang version 3.7 (tags/RELEASE_370/final)"}
!1 = !{i32 1, i32 0}
!2 = !{i32 1, i32 6}
!3 = !{!"ps", i32 6, i32 0}
!4 = !{[6 x i32] [i32 4, i32 4, i32 1, i32 2, i32 4, i32 8]}
!5 = !{void ()* @main, !"main", !6, null, null}
!6 = !{!7, !11, null}
!7 = !{!8}
!8 = !{i32 0, !"A", i8 9, i8 0, !9, i8 2, i32 1, i8 4, i32 0, i8 0, !10}
!9 = !{i32 0}
!10 = !{i32 3, i32 15}
!11 = !{!12}
!12 = !{i32 0, !"SV_Target", i8 9, i8 16, !9, i8 0, i32 1, i8 4, i32 0, i8 0, !10}

Convenient! This is the same stuff we would see in dxc if we don’t write to a file. The next question is what all of this means. The first entry point into understanding what is going on is DXIL.rst. Unfortunately, this documentation is only good for the bring-up phase of DXIL compilation. The documentation just … stops eventually. At that point we’re on our own, but it’s not like it’s that hard to figure out what is going on. We have the source after all and we can correlate test HLSL with the output.

What is DXIL, really?

At a fundamental level, DXIL is just LLVM 3.7 with some sugar on top. The sugar adds whatever magic that normal LLVM code cannot express, but LLVM has ways to express things like these in a generic way. The solutions feel very clunky at times, but it’s clearly a compromise to attempt to shoehorn a shading language into something very generic like LLVM.

Intrinsics

In this simple case, we’re starting to see code like:

declare float @dx.op.loadInput.f32(i32, i32, i32, i8, i32) #0
declare void @dx.op.storeOutput.f32(i32, i32, i32, i8, float) #1
%1 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %1)

If we cannot express features in raw LLVM, we use dx.op intrinsics, declared as external linkage. The first argument is a constant int, which encodes the actual opcode to use. The loadInput or storeOutput names are not significant here. There are over 200 opcodes like this, which is not surprising given how much junk we’ve accumulated in shading languages over the years. Some of the opcodes are documented, some are not. *shrug* 🙂

Scalar code

In SPIR-V, there is explicit support for vectors and matrices, but DXIL is flat and scalar (except when it isn’t, of course). With modern GPU architectures, 32-bit arithmetic is scalar anyways, but 16-bit and 8-bit arithmetic ops are packed operations on all modern GPUs I know of, so it feels a bit weird to go full scalar. I guess backend compilers just need to suck it up and learn the dark art of re-vectorization. We also end up with quite bloated code, since any vector operation has to be unrolled. On the upside, there are fewer cases to implement and test in dxil-spirv. As an example, we have a shader:

Texture2D<float4> T;
SamplerState S;

float4 main(float2 uv : TEXCOORD) : SV_Target
{
    return T.Sample(S, uv);
}

In places where we cannot ignore vectors, DXIL sometimes reaches for structs instead of the far more natural vector type.

// The resource API is quite peculiar, for later ...
%1 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false)
%2 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 3, i32 0, i32 0, i1 false)

// Load UV, one component at a time
%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
%4 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 1, i32 undef)

// Sample
%5 = call %dx.types.ResRet.f32 @dx.op.sample.f32(i32 60, %dx.types.Handle %1, %dx.types.Handle %2, float %3, float %4, float undef, float undef, i32 0, i32 0, i32 undef, float undef)

// Extract components one by one ...
%6 = extractvalue %dx.types.ResRet.f32 %5, 0
%7 = extractvalue %dx.types.ResRet.f32 %5, 1
%8 = extractvalue %dx.types.ResRet.f32 %5, 2
%9 = extractvalue %dx.types.ResRet.f32 %5, 3
// There is actually a 5th member! If that member is statically used
// the sample opcode is actually a sparse residency query.
// Isn't this fun? Welcome to my personal hell :3

// Store result one by one ...
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %6)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 1, float %7)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 2, float %8)
call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 3, float %9)

With a simple spot check, we get binary sizes:

  • DXC (-Tps_6_0 test.frag -Qstrip_reflect -Qstrip_debug): 2063 bytes
  • FXC (-Tps_5_0 test.frag -Qstrip_reflect -Qstrip_debug): 268 bytes
  • DXC SPIR-V (-Tps_6_0 test.frag -Qstrip_reflect -Qstrip_debug -spirv): 744 bytes

Fun … DXBC is very compact, at least it has that going for it.

Metadata

In SPIR-V, we have execution modes, decoration and various instructions to encode metadata. These instructions are instructions like any other opcode, but LLVM has a separate system that lives on the side, the metadata nodes. In the IR assembly, we see a spider web of weird data structures.

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.resources = !{!4}
!dx.viewIdState = !{!10}
!dx.entryPoints = !{!11}

!0 = !{!"clang version 3.7 (tags/RELEASE_370/final)"}
!1 = !{i32 1, i32 0}
!2 = !{i32 1, i32 6}
!3 = !{!"ps", i32 6, i32 0}
!4 = !{!5, null, null, !8}
!5 = !{!6}
!6 = !{i32 0, %"class.Texture2D<vector<float, 4> >"* undef, !"", i32 2, i32 1, i32 1, i32 2, i32 0, !7}
!7 = !{i32 0, i32 9}
!8 = !{!9}
!9 = !{i32 0, %struct.SamplerState* undef, !"", i32 4, i32 3, i32 1, i32 0, null}
!10 = !{[4 x i32] [i32 2, i32 4, i32 15, i32 15]}
!11 = !{void ()* @main, !"main", !12, !4, null}
!12 = !{!13, !17, null}
!13 = !{!14}
!14 = !{i32 0, !"TEXCOORD", i8 9, i8 0, !15, i8 2, i32 1, i8 2, i32 0, i8 0, !16}
!15 = !{i32 0}
!16 = !{i32 3, i32 3}
!17 = !{!18}
!18 = !{i32 0, !"SV_Target", i8 9, i8 16, !15, i8 0, i32 1, i8 4, i32 0, i8 0, !19}
!19 = !{i32 3, i32 15}

What does this even mean? These nodes encode the entry point(s), which resources exist, the stage IO variables, and everything like that. It is at least documented, just awkward to read.

!dx.entryPoints = !{!11}
!11 = !{
   void ()* @main, /* Pointer to function */
   !"main", /* Name of entry point. Not relevant before DXR. */
   !12 /* List of stage IO types */,
   !4 /* Resource lists */,
   null}

!12 = !{
   !13, /* Stage input */
   !17, /* Stage output */
   null /* Patch IO for tessellation */}

/* Encode things like semantic, register offsets, component offset, etc ... */
!14 = !{i32 0, !"TEXCOORD", i8 9, i8 0, !15, i8 2, i32 1, i8 2, i32 0, i8 0, !16}

!18 = !{i32 0, !"SV_Target", i8 9, i8 16, !15, i8 0, i32 1, i8 4, i32 0, i8 0, !19}

!4 = !{
   !5, /* SRVs */
   null, null, /* UAV, CBV */
   !8 /* Samplers */}

/* Encode register space, register #t, component types, etc, etc ... */
!6 = !{i32 0, %"class.Texture2D<vector<float, 4> >"* undef, !"", i32 2, i32 1, i32 1, i32 2, i32 0, !7}

The instructions which interact with resources work very different than SPIR-V. In SPIR-V, resources are represented as plain variables, but it is a bit more indirect in DXIL. For example:

%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)

Here we specify:

  • Refer to stage input by index (directly refer to metadata)
  • Refer to row (if the stage input is an array)
  • Refer to component (must be a compile time constant, handy!)

I kinda like this approach actually. It gives us a specific place to deal with stage IO translation, so we don’t have to analyze and track random memory instructions in LLVM, which would be kinda horrible. Especially builtin semantics can get pretty weird in translation and being able to handle all addressing logic in one go is a life saver. I think this is also done so that we avoid the problem of having to deal with vectors. (But again, as we shall see later, someone missed the memo when implementing DXR.)

Resources work similarly:

%1 = call %dx.types.Handle @dx.op.createHandle(i32 57,
   i8 0, /* SRV */
   i32 0, /* Index into metadata */
   i32 0, /* Array offset */
   i1 false) /* NonUniformResourceIndex? */

Again, we request a handle where we refer directly to the metadata. In this case, we request SRV with metadata index #0. This isn’t bad. (But again, as we shall see, there is a completely separate system for accessing resources in DXR … ._. Why …)

As a convenient feature, dxil-spirv can emit SPIR-V -> GLSL through SPIRV-Cross’s API, so we can see the resulting codegen, which in this case closely resembles a 1:1 instruction conversion.

#version 460

layout(set = 1, binding = 2) uniform texture2D _8;
layout(set = 3, binding = 4) uniform sampler _11;

layout(location = 0) in vec2 TEXCOORD;
layout(location = 0) out vec4 SV_Target;

void main()
{
  vec4 _31 = texture(sampler2D(_8, _11), vec2(TEXCOORD.x, TEXCOORD.y));
  // le sigh <_<
  SV_Target.x = _31.x;
  SV_Target.y = _31.y;
  SV_Target.z = _31.z;
  SV_Target.w = _31.w;
}

I’ve certainly seen much worse (DXBC cross compilation comes to mind … *shudder*), but this is not the true codegen that vkd3d-proton would use, since we need to transform this into fully bindless code. That’s for later.

Conclusion

This is a very basic view of how the format is put together. Next time, I’ll probably have a deeper look into the raw LLVM IR and how that is parsed. Better bring bleach for your eyes!

Compressed GPU texture formats – a review and compute shader decoders – part 3/3

This is a long overdue blog post. Last year, I began a blog series where I explored the weird and wonderful world of texture compression formats. Unfortunately, I didn’t have time and motivation to finish it at the time, but it’s time I get back to finish these posts.

Up until now, I’ve covered the major families: DXT (BC 1-5), ETC2 and BPTC (BC 6-7) in part 1 and part 2. Complexity has been steadily increasing, but now we’re finally going to tackle the end boss of texture compression, ASTC.

ASTC

ASTC, or Adaptive Scalable Texture Compression is the result of a fever dream which attempts to answer the question of “how complex can we make a texture compression format and still trick IHVs to make it into silicon”. Complexity in codec design generally translates to better quality / bit at the cost of more expensive decode/encode. As hardware designs evolve, we can afford more complex codecs. Finding a sweet spot is incredibly hard, and ASTC cranked it to 11, for better or worse.

It began life on the Mali Midgard GPU series way back in the day, but has since been widely adopted by the mobile GPU ecosystem by all relevant vendors in the Khronos sphere, at least the core LDR profile.

Profiles you say, in my texture formats? What is this, H.264? Well … we’ll get to that 🙂

Notably, it’s also supported on the Nintendo Switch (Tegra supports Android after all), causing all sorts of fun issues for emulators … There’s actually multiple independent compute shader decoders of ASTC out there in the wild!

ASTC (released in 2012) is still considered the state of the art, and I don’t think anyone else have made a serious attempt at introducing a new hardware accelerated texture compression format since. The state of the art in ASTC compression is still being developed with recent improvements from Pete Harris on astc-encoder, the official reference encoder, and that is to my knowledge the only encoder implementation of ASTC HDR/3D profiles.

Desktop IHVs other than Intel have so far refused to expose support for ASTC, which is annoying, since it seems we’re forever stuck with two different worlds of texture compression, BC variant zoo on desktop and ASTC on mobile.

The compute shader decoder

Last year, around the time I wrote blogs on BPTC, I also implemented an ASTC LDR/HDR decoder that produced bit-exact results compared to the reference implementation and real hardware in my validation suite. Compared to the BC6 and 7 shaders, the shader complexity is ridiculous, and somehow it works. It’s integrated in Granite such that I can load ASTC compressed scenes transparently, which is useful for testing in a “because I can!” sense.

The shader is certainly not optimized for speed. I focused on readability and debuggability, since it was hard enough to implement this as-is. It’s more than fast enough for my purposes at least.

Enter the horror here. Here be dragons.

One format to bind them all

ASTC has a goal of better quality / bit than every existing format at the same time. This is not easy, considering that BC went the route of many different specialized formats designed to solve specific texture compression needs and you would expect each format to excel at their specific use case. As far as I know, ASTC succeeds here, given a sufficiently sophisticated encoder.

To supplant every format, you’d need to support these features at the very least:

  • R (BC4, EAC), RG (BC5, EAC), RGB (BC1, ETC2), RGBA (BC3, ETC2, BC7)
  • 4bpp (BC1, ETC2 RGB, BC4), 8bpp (BC3, BC5, ETC2 RGBA, BC6, BC7), and even lower (2bpp PVRTC?)
  • LDR vs HDR (BC6)
  • Decent multi-partition support
  • Decorrelated channel support

ASTC goes beyond these requirements in an attempt to future-proof the format. Bit-rate is fine grained. From sub 1bpp, up to 8bpp in fine steps. 2D is boring, so ASTC also supports 3D blocks, although adoption of this is non-existing outside Arm as far as I know. For SDF rendering, 3D block compression seems useful in theory at least.

Codec concepts

The ASTC specification in the Khronos Data Format document is 35 pages (!). There’s a lot of ground to cover when implementing this format, but fortunately, the codec features are mostly orthogonal, i.e. the complexity is additive, not exponential, making it somewhat manageable.

Bit-rate scalability

The ASTC format like every other format we’ve seen up until now is block based, where each block consumes a fixed number of bits. ASTC chose 128 bits, but bit-rate scalability is achieved through different block sizes. No longer can we rely on nice 4×4 blocks :(. Instead, we have ridiculous things like 8×6 blocks or 5×5 blocks, all the way up to 12×12 blocks. Very little actually changes depending on block size, we just get more weights to decode.

In practice from what I have seen, encoders tend to focus on specific block sizes like 4×4, 6×6 and 8×8. In an ideal world, we’d be able to change the block size dynamically within a texture (think H.264 macro blocks being split into smaller blocks adaptively), but we need random access to stay sane.

Good old color endpoint + weight architecture

At a fundamental level, ASTC does not change how texture compression works. There’s still the concept of encoding color endpoints on a block level, and then per-texel weights are applied to interpolate between them. However, as we’ll see, a ton of complexity goes into efficiently encoding these color entry points and weights.

If you’re familiar with video encoding, there are some fun parallels to draw here, since H.264 and beyond have just piled on more and more complexity on top of the basic motion compensated DCT blocks since the 80s. Texture compression is very similar. Keep piling codec features on top of the same fundamental architecture forever, what could go wrong!

Bits are boring, trits and quints is where it’s at!

Up until now, color endpoints and weights have been encoded with bits, but especially for weights, it’s very hard to achieve fine grained bit rate deltas. What if we could spend a fractional number of bits instead? ASTC achieves this through trinary and quintary (is this a word?) numbers, encoded into a binary representation.

Every encoded weight or endpoint is given N bits as LSBs, and optionally one trit or quint as the MSBs.

The binary encoding works by grouping values together in blocks. For trits, we’re aiming to encode 5 values together into a single number, which is 3^5 = 243 combinations. If we encode this into 8 bits, we’re pretty close to the theoretically optimal encoding.

This is where we start running into sprinkles of incomprehensible code snippets in the spec which is a mix of C and Verilog designed to be fast in HDL, but unreadable nonsense in software. I have no idea what this code is trying to do, so to decode, I just build a LUT which converts 8 bits into 5 trits and call it a day. Quints are similar, where we encode 3 values together, which is 5^3 = 125 combinations, and fits snugly into 7 bits.

As an example, here’s what I ended up with in GLSL:

// Trit-decoding.
// quant.x = number of bits per value
int block = idiv5_floor(index); // Classic cute trick: (v * 0x3334) >> 16
int offset = index - block * 5;
start_bit += block * (5 * quant.x + 8);

int t0_t1_offset = start_bit + (quant.x * 1 + 0);
int t2_t3_offset = start_bit + (quant.x * 2 + 2);
int t4_offset = start_bit + (quant.x * 3 + 4);
int t5_t6_offset = start_bit + (quant.x * 4 + 5);
int t7_offset = start_bit + (quant.x * 5 + 7);

// ;__;
int t = (extract_bits(payload, t0_t1_offset, 2) << 0) |
    (extract_bits(payload, t2_t3_offset, 2) << 2) |
    (extract_bits(payload, t4_offset, 1) << 4) |
    (extract_bits(payload, t5_t6_offset, 2) << 5) |
    (extract_bits(payload, t7_offset, 1) << 7);

// LUT magic
t = int(texelFetch(LUTTritQuintDecode, t).x);
t = (t >> (3 * offset)) & 7;

int m_offset = offset * quant.x;
m_offset += idiv5_ceil(offset * 8); // (1) Explanation below

if (quant.x != 0)
{
    int m = extract_bits(payload, m_offset + start_bit, quant.x);
    ret = (t << quant.x) | m;
}

… and similar garbage code for quints.

Note for (1): The reason the T value is scattered around is a special feature where the bit stream can be terminated early if the block is only partially filled. Every trit value requires less than 8/5 bits to encode, so after a value is emitted, we know that we must have emitted ceil(8 * count / 5) bits to encode the trits in binary.

Complex weight un-quantization

For purposes of orthogonality, it’s generally desirable that one part of the decoding process does not affect other parts of the decoding process. Weights is one such case. The color interpolator shouldn’t have to care how many bits the weights were encoded with, and thus we decode weights to a fixed range. In our case, weights have a range of [0, 64].

This is very similar to BPTC. In BPTC, the codec defines un-quantization LUTs for 2bpp, 3bpp and 4bpp like this:

const int weight_table2[4] = int[](0, 21, 43, 64);
const int weight_table3[8] = int[](0, 9, 18, 27, 37, 46, 55, 64);
const int weight_table4[16] = int[](0, 4, 9, 13, 17, 21, 26, 30, 34, 38, 43, 47, 51, 55, 60, 64);

Why not just bit-replication you ask? Well, dividing by non-POT values after weight scale is a PITA, that’s why, and we cannot bit-replicate trits and quints either way. ASTC un-quantizes in a more roundabout way, but uses the same idea from BPTC where weight range is normalized to [0, 64].

Again, the un-quantization step is gnarly enough that I just made a LUT for that as well, because why not:

int decode_weight(uvec4 payload, int weight_index, ivec4 quant)
{
    int primary_weight = decode_integer_sequence(payload, 0,
        weight_index, quant.xyz);
    // quant.w is offset into unquant LUT.
    primary_weight = int(texelFetch(LUTWeightUnquantize,
        primary_weight + quant.w).x);
    return primary_weight;
}

Color endpoints

Color endpoints are weird in that there are multiple phases to it:

  • Decode N values of integer sequence; which N values to use depends on which partition the texel belongs to
  • Un-quantize N values to 8 bits in range [0, 0xff]
  • Interpret N 8-bit values in some magical way depending on the endpoint format (of which there are 16!)
  • Emit multiple RGBA values in UNORM8 (LDR) or UNORM12 (HDR)
void decode_endpoint(out ivec4 ep0, out ivec4 ep1, out int decode_mode,
    uvec4 payload, int bit_offset, ivec4 quant, int ep_mode,
    int base_endpoint_index, int num_endpoint_bits)
{
    num_endpoint_bits += bit_offset;
    // Need to explicitly terminate bitstream for color end points.
    payload &= build_bitmask(num_endpoint_bits);

    // Could of course use an array,
    // but that doesn't lower nicely to indexed registers on all GPUs.
    int v0, v1, v2, v3, v4, v5, v6, v7;

    // End point mode is designed so that the top 2 bits encode
    // how many value pairs are required.
#define DECODE_EP(i) \
    int(texelFetch(LUTEndpointUnquantize, quant.w + \
        decode_integer_sequence(payload, bit_offset, i + \
        base_endpoint_index, quant.xyz)).x)

    int hi_bits = ep_mode >> 2;
    v0 = DECODE_EP(0); v1 = DECODE_EP(1);
    if (hi_bits >= 1) { v2 = DECODE_EP(2); v3 = DECODE_EP(3); }
    if (hi_bits >= 2) { v4 = DECODE_EP(4); v4 = DECODE_EP(5); }
    if (hi_bits >= 3) { v6 = DECODE_EP(6); v7 = DECODE_EP(7); }

    switch (ep_mode) { ... }
}

The end point modes are defined as:

Every endpoint has code snippets explaining how to take the n values and turn them into RGBA values, which are to be fed to the weight interpolator stage. As expected, we just need a horrible switch statement which handles every case. :<

Each partition can even select its own encoding format, which is pretty wild. Poor encoder that has to consider all these scenarios …

LDR endpoints

The endpoint decoding process starts off with trivial cases like luminance endpoints.

Mode 0 (direct)

e0 = (v0, v0, v0, 0xFF);
e1 = (v1, v1, v1, 0xFF);

Since we have an explicit decode step here, ASTC also allows us to take advantage of redundancy between the endpoint values, e.g.:

Mode 1 (base + offset)

L0 = (v0 >> 2) | (v1 & 0xC0);
L1 = L0 + (v1 & 0x3F);
if (L1 > 0xFF) { L1 = 0xFF; }
e0 = (L0, L0, L0, 0xFF);
e1 = (L1, L1, L1, 0xFF);

This encoding scheme improves precision when L0 and L1 are sufficiently close, i.e. correlated. BC7 don’t really attempt to take advantage of correlation between end points, outside the minimal shared endpoint / subset bits. BC6 does however, through use of delta bits.

One question that immediately pops into my head is how this is supposed to work in practice for trits and quint encoding. Since bits are reinterpreted and shuffled around like this, any un-quantization that’s not just bit replication would probably create unexpected results in the MSBs.

Luminance + Alpha (Mode 4/5)

ASTC is a little awkward in how 2-component textures are encoded. The common case for 2-component textures is normal maps, where we normally encode it as RG, ala BC5 or EAC. ASTC only supports luminance alpha, so in order to efficiently encode these formats, we have to pre-swizzle the texture into (R, R, R, G), and apply a (R, W, x, x) swizzle in the VkImageView.

Similar to luminance-only, there is a direct mode, and a correlated mode.

RGB

Other highlights from the LDR formats are base + scale, where two endpoints are encoded RGB and RGB * A. Seems very nice for luminance gradients, and quite compact! We also have direct modes, with a special feature which takes some ideas from YUV. Blue contraction improves color accuracy close to gray. Of course, there’s a base + offset mode that we already saw for luma and luma + alpha formats.

RGBA

RGBA encoding is very similar to RGB, with two tacked on alpha values, nothing particularly interesting here.

HDR endpoint insanity

While LDR encoding modes are pretty straight forward once you stare at it long enough, HDR is just incomprehensible. Where BC6 was fairly naive w.r.t. endpoints, ASTC is the complete opposite.

Similar to BC6, HDR is implemented in a way where we interpret a floating point format as UNORM. This means we’re interpolating in a pseudo-logarithmic domain, i.e. the perceptual domain.

While BC6 basically directly interpolates in U/SNORM16 with a direct conversion to FP16, ASTC is a little more … particular about it.

When decoding HDR end points, UNORM12 values are generated. It starts innocent enough in mode 2 where 8-bit inputs are just shifted in place:

int y0, y1;
if (v1 >= v0)
{
    y0 = v0 << 4;
    y1 = v1 << 4;
}
else
{
   // Oh hai thar BPTC shared bits +
   // BC1 symmetry exploitation
   y0 = (v1 << 4) + 8;
   y1 = (v0 << 4) - 8;
}

ep0 = ivec4(ivec3(y0), 0x780 /* 1.0 */);
ep1 = ivec4(ivec3(y1), 0x780 /* 1.0 */);

but eventually you just have to give up trying to reason about the spec when this is the wall of code that greets you.

This is the point where you yell WTF out loud, cry a little inside, contemplate your life decisions and copy paste the spec.

Partition hell

Like BPTC, ASTC supports partitions. As we’ve seen before, this feature is designed to deal with sharp transitions in the block. BPTC made it pretty simple where it was possible to spend 6 bits to select one of the 64 partition layouts. This works fine for BPTC since it’s locked to 4×4 blocks and using LUTs makes sense.

ASTC does not have fixed block sizes, so what to do? One LUT for every possible combination? No. ASTC went the route of procedurally generated partition assignments using a 10 bit seed. This works for any block size and partition count, which “solves” that problem. Again, the spec has a long, incomprehensible code snippet defining the process.

Screw this. As usual, look-up texture it is. ASTC can support up to 4 partitions which is pretty wild. No idea how useful this is in practice, as we’ll probably end up spending all bits just encoding color endpoints at that rate …

This seems like a nightmare for encoders. Most of the resulting partitions are worthless garbage, and exhaustively testing over 1000 partition combinations per block does not seem very practical …

Mode hell

Speaking of many combinations, there’s a lot of different modes. BC7 has only 8 modes, and BC6H has 14. From what I’ve seen, even BPTC encoders just focus on 2 or 3 modes at most. ASTC has several thousand modes if we follow the same logic! 😀

Most of the mode bits are spent on encoding things like:

  • Quantization mode of weight grid
  • Resolution of weight grid [see below]
  • Decorrelated colors
  • Void-extent [see below]
  • Reserved [see below]

Once we have looked at mode bits, we can compute how many bits are required to encode weights, and based on that result, we also know how many bits are required to encode color endpoints. The color endpoint quantization level is thus implicit (so much fun …). There are also tons of edge cases we have to account for and handle error cases …

Weight grid interpolation

A special ASTC feature is that the weight grid isn’t actually 1:1 with texels like any other format we’ve seen so far. The weight grid can be much lower resolution than the block itself, and the real weights are reconstructed with bi-linear interpolation. This seems quite useful for scenarios where the block encodes a smooth gradient and the weights are highly predictable. Might as well spend more bits on encoding color endpoints instead. The spec outlines a specific fixed-point scheme that must be followed exactly, and finally the code snippets in the spec actually make some sense!

This feature is a must-have for larger block sizes. There just isn’t room for a full weight grid when we start to hit big boy block sizes like 8×8. At this low bit-rate, the only tool we have to encode high frequency features is partitions …

Void extent

Encoding constant high-precision colors can actually be kind of awkward in most texture compression formats with limited endpoint precision, but ASTC has a special mode for it, where exact FP16 values can be encoded if desired. The feature even goes so far as to specify that a region outside the block also has the same color, which theoretically allows a texture filtering unit to short circuit. This doesn’t sound very practical, and I’m not sure if this feature is actually used.

Painful error handling

Another “feature” of ASTC is how errors are handled. Unlike other texture compression formats, there are many encodings which are illegal, and a correct decoder must detect all error scenarios and return the proper error color. This error color is either saturated magenta or NaN (ææææææææ!) depending on the implementation and whether or not HDR is supported.

Inconsistent decode formats

ASTC decoders will either decode to UNORM8 or FP16, depending on the profile supported by the GPU. This is kinda annoying, especially when we consider formats like SRGB8 where apparently the RGB portion is supposed to decode to UNORM8 viewed as SRGB, but alpha is still FP16? (._.) Not even the reference decoder seems to get that right, so I just decode alpha as UNORM8 in that case.

Interpolation and final encode

Finally, a breather. Once color endpoints and weights are fully decoded, we enter interpolation. It’s not all trivial though. That would be boring.

The first step is always to expand the UNORM8 or UNORM12 results into UNORM16. This is context sensitive on things like the endpoint format as well as how we’re encoding the value after interpolation. This final encode is the raw data that is fed into the texture filtering unit.

if (decode_mode == MODE_HDR)
{
    ep0 <<= 4; // Simple expansion 12 -> 16-bit
    ep1 <<= 4;
}
else if (decode_mode == MODE_HDR_LDR_ALPHA)
{
    ep0.rgb <<= 4;
    ep1.rgb <<= 4;
    ep0.a *= 0x101; // Bit replicate UNORM8 to UNORM16,
    ep1.a *= 0x101; // for FP16 conv later
}
else if (DECODE_8BIT)
{
    ep0 = (ep0 << 8) | ivec4(0x80); // Treat the data as 8.8 fixed point.
    ep1 = (ep1 << 8) | ivec4(0x80); // Also bake in 0.5 rounding factor.
}
else
{
    ep0 *= 0x101;
    ep1 *= 0x101;
}

// This is why weights have [0, 64] range and not [0, 63].
ivec4 color = (ep0 * (64 - weight) + ep1 * weight + 32) >> 6;

Now we have a complete UNORM16 color value that needs to be encoded.

Encode to UNORM8 / SRGB8

For 8-bit decode, we make use of the 8.8 fixed point expansion, and just shift the result down. Easy!

 imageStore(OutputImage, coord.xy, uvec4(final_color >> 8));

Encode LDR to FP16

This one is annoying. When interpolating, we did bit replication, and we need to convert the UNORM16 value to FP16. It’s not as simple as just dividing by 0xffff, of course. We explicitly need to treat 0xffff as 1.0, and for other values, we divide by 2^16 with round-to-zero semantics. We have to do this by hand of course. Soft-float is so much fun! <_<

// ASTC has a very peculiar way of converting the decoded result to FP16.
// 0xffff -> 1.0, and for everything else we get
// roundDownQuantizeFP16(vec4(c) / vec4(0x10000)).
ivec4 msb = findMSB(color);
ivec4 shamt = msb;
ivec4 m = ((color << 10) >> shamt) & 0x3ff;
ivec4 e = msb - 1;
uvec4 decoded = m | (e << 10);
uvec4 denorm_decode = color << 8;
decoded = mix(decoded, uvec4(denorm_decode), lessThan(e, ivec4(1)));
decoded = mix(decoded, uvec4(0x3c00), equal(color, ivec4(0xffff)));
return decoded;

Encode HDR to FP16

The intention is that the interpolation is logarithmic, but floats are just piecewise logarithmic. BC6 doesn’t care at all and just reinterprets the resulting UNORM16 interpolation as FP16, but ASTC tries to be a little smarter. The mantissa is tweaked a bit.

// Interpret the value as FP16, but with some extra fixups along the way to make the interpolation more
// logarithmic (apparently). From spec:
ivec4 e = color >> 11;
ivec4 m = color & 0x7ff;
ivec4 mt = 4 * m - 512;
mt = mix(mt, ivec4(3 * m), lessThan(m, ivec4(512)));
mt = mix(mt, ivec4(5 * m - 2048), greaterThanEqual(m, ivec4(1536)));

From what I can tell, there is no signed float support in ASTC, except for the void extent.

Decode format extensions

In Vulkan, there is a weird extension called VK_EXT_astc_decode_mode which allows a VkImageView to select which format an ASTC block should decode to. Why is this even a thing? One quirk of ASTC is that even UNORM blocks full of LDR data must be decoded to bit-exact FP16. This means that if a GPU architecture decodes compressed textures into L1 cache, we would end up consuming a lot more cache than strictly necessary. The whole point of the decode mode extension is to allow us to explicitly decode to a lower precision format to improve cache hit rates, presumably. (That makes me wonder what GPUs actually do in practice …)

In Granite, I setup the decode mode as:

// ...
if (format_is_srgb(create_info.format))
   return true;

if (format_is_compressed_hdr(create_info.format))
{
   auto &features = device->get_device_features();
   if (features.astc_decode_features.decodeModeSharedExponent)
      astc_info.decodeMode = VK_FORMAT_E5B9G9R9_UFLOAT_PACK32; // Nice
   else
      astc_info.decodeMode = VK_FORMAT_R16G16B16A16_SFLOAT;
}
else
{
   astc_info.decodeMode = VK_FORMAT_R8G8B8A8_UNORM;
}
// ...

Fortunately, there are now special SFLOAT format variants for ASTC in the VkFormat enum, which are tied to the ASTC HDR extension, and can be used to notify applications if the blocks likely contain HDR blocks. The awkward part of ASTC is that even when using the UNORM format, you can still decode HDR blocks, and it just happens to work on GPUs which support ASTC HDR. Profiles are fun, right? ._. With decode mode, we can at least clamp down on these shenanigans. HDR blocks fail to decode in UNORM8 mode.

Conclusion

So far, this is the endgame of texture compression, and I really don’t want to look at this again. 🙂 It’s been a year since I looked into this last time, so I might have missed some details.

There’s tons of smart ideas in this format, but I also feel it’s too clever for its own good. Writing a decoder was difficult enough, and I don’t even want to think about how painful an encoder would be. A format is only as good as its encoding ecosystem after all.

YUV sampling in Vulkan – a niche and complicated feature – VK_KHR_sampler_ycbcr_conversion

Sometimes I like to implement Vulkan extensions in Granite just because. This time around we’re looking at YUV.

For anyone who have worked with multimedia before, YUV (or YCbCr as it’s also referred to) is a nightmare. The act of converting YUV to RGB is very simple as it’s just a color space transform with a 3×3 matrix, but YUV means many things. There is no end to how overloaded YUV can be, and making sure you know exactly which YUV flavor you’re dealing with can be quite tricky. From a system integration point of view, YUV is a serious pain.

When dealing with video content in Vulkan, you will have to consume YUV somehow.

YUV has some characteristics which all make sense for video compression, but complicate things.

  • Planar: Each color component is often packed in different 2D images.
  • Luma/Chroma: Y refers to luminance, UV (or CbCr) refers to chrominance (color). This is critical for video compression.
  • Downsampled chroma. Human eyes have higher resolution for luminance vs color, so less bandwidth on color is an easy way to save space.

There is a lot of variance here between different YUV formats, as there is no obvious standard for these kinds of things.

  • How many planes? 2 or 3 are common (Y, U, V) vs (Y, packed UV). Packed single plane can be used in some other obscure scenarios. (YUYV and variations on that).
  • For packed representations, which component comes first?
  • How many bits per component? 8 is the most common, but 10-bit YUV content can be found in the wild.
  • How much is chroma downsampled? 2x horizontally and vertically is by far the most common, often referred to as YUV420p (the naming convension in YUV makes no sense, don’t try to find any).
  • Where is the texel center for the chroma samples? Common values are co-sited at every other luma sample, or in the mid-point between groups of 2×2 luma samples.
  • What is the exact color space conversion matrix from YUV to RGB?
  • How is chroma reconstructed to full resolution?

Dealing with YUV without fancy extensions

To render a YUV frame in RGB is not necessarily a difficult task, but depending on how many formats you need to deal with, shader variants can quickly get out of hand. You’ll typically do something like this:

layout(binding = 0) uniform TexLuma;
layout(binding = 1) uniform TexCb;
layout(binding = 2) uniform TexCr;

layout(location = 0) out vec3 FragColor;
layout(location = 0) in vec2 TexCoord;

const mat3 yuv_to_rgb_matrix = mat3(....);

void main()
{
    float Luma = textureLod(TexLuma, TexCoord, 0.0).x;
    float Cb = textureLod(TexCb, TexCoord, 0.0).x; // For mid-point chroma
    float Cr = textureLod(TexCr, TexCoord, 0.0).x;
    vec3 yuv = vec3(Luma, Cb, Cr);
    // Possibly expand range here if using TV YUV range and not PC YUV range.
    yuv = rescale_yuv(yuv);
    FragColor = yuv_to_rgb_matrix * yuv;
}

This is fine, but there is some motivation to do better here. Since watching video is one of the most common operations a GPU does, it would be nice if we could do this more efficiently, especially on low-power mobile devices with batteries attached to them. It would also be really nice if we could sample a video texture in our shader and have the GPU just “deal with it”.

The planar texture formats

VK_KHR_sampler_ycbcr_conversion adds a lot of new texture formats to Vulkan. For YUV420p, we’re going to look at this format, VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM.

In planar formats, we see that 3 color components are spread out across 3 planes, with a “420” here to mean that the second and third components are half resolution. “GBR” is a bit strange, but G refers to Y, and B and R refer to Cb and Cr respectively. Green is the strongest contributor to the luma Y channel after all, so I guess it kinda makes sense like that …

By having planar texture formats, it means that a texture unit of the GPU is actually able to sample 3 samples at once, meaning we will put a lot less stress on the GPU texturing unit. If we think of rendering video, this is a large part of the work done on the GPU. For lower-end mobile devices rendering 4K video without melting, this is actually really important.

The image aspects

In order to be able to refer to each plane separately when copying data in and out of the texture, we can use VK_IMAGE_ASPECT_PLANE_{0,1,2}_BIT. When copying to and from plane 1 and 2 in YUV420p, the resolutions of those planes are halved. VK_IMAGE_ASPECT_COLOR_BIT refers to the whole “GBR” as a whole, and it’s only useful when sampling the image.

Disjoint image allocation

Since we have 3 image planes, it is not weird to desire combining three separate images together. E.g. we can allocate three R8_UNORM images and make it planar later. This is supported and we’ll have to create and allocate images in a particular way.

First, we create the R8_UNORM images with VK_IMAGE_CREATE_ALIAS_BIT. This means that we will alias the image meaningfully with another image, even when using OPTIMAL image layout and image layouts are shared across aliases! We’ll use this to alias with a plane inside the planar texture later.

Be aware of alignment requirement. I’ve found that planar textures can need larger alignment than the single-component textures might need, so either using standalone allocations per plane, or bumping alignment to something like 64k works around that.

When we create the planar texture, we specify DISJOINT_BIT and ALIAS_BIT. For disjoint, it means we need to query allocation requirements and bind memory separately for each plane. To do this, we need to use vkGetImageMemoryRequirements2 and vkBindImageMemory2. Here, we just bind the same memory we used for our separate textures.

Setting up a sampler conversion object

The vkCreateSamplerYcbcrConversion function creates an object which encodes exactly how we will convert the planar components into RGB values.

VkSamplerYcbcrConversionCreateInfo info = { VK_STRUCTURE_TYPE_SAMPLER_YCBCR_CONVERSION_CREATE_INFO };

// Which 3x3 YUV to RGB matrix is used?
// 601 is generally used for SD content.
// 709 for HD content.
// 2020 for UHD content.
// Can also use IDENTITY which lets you sample the raw YUV and
// do the conversion in shader code.
// At least you don't have to hit the texture unit 3 times.
info.ycbcrModel = VK_SAMPLER_YCBCR_MODEL_CONVERSION_YCBCR_709;

// TV (NARROW) or PC (FULL) range for YUV?
// Usually, JPEG uses full range and broadcast content is narrow.
// If using narrow, the YUV components need to be
// rescaled before it can be converted.
info.ycbcrRange = VK_SAMPLER_YCBCR_RANGE_ITU_NARROW;

// Deal with order of components.
info.components = {
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
};

// With NEAREST, chroma is duplicated to a 2x2 block for YUV420p.
// In fancy video players, you might even get bicubic/sinc
// interpolation filters for chroma because why not ...
info.chromaFilter = VK_FILTER_LINEAR;

// COSITED or MIDPOINT? I think normal YUV420p content is MIDPOINT,
// but not quite sure ...
info.xChromaOffset = VK_CHROMA_LOCATION_MIDPOINT;
info.yChromaOffset = VK_CHROMA_LOCATION_MIDPOINT;

// Not sure what this is for.
info.forceExplicitReconstruction = VK_FALSE;

// For YUV420p.
info.format = VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM;

VkSamplerYcbcrConversion conversion;
table->vkCreateSamplerYcbcrConversionKHR(device, &info, nullptr,
                                         &conversion);

Passing along to VkImageView and VkSampler

Both an image view and sampler which are used with KHR_sampler_ycbcr_conversion must have this sampler conversion object passed into it via pNext. This is because part of this information will be split up between the two objects. Planar and swizzle information is likely part of image view, while filtering and chroma siting is likely part of sampler object.

Immutable sampler

Some information is also relevant for the shader compiler, so for YCbCr sampling there are some restrictions. You have to use a COMBINED_IMAGE_SAMPLER and the sampler must be immutable in the descriptor set layout. Since that immutable sampler contains the conversion object, this allows the shader compiler to see how to complete the transform where hardware support stops. This might just be performing the YUV to RGB conversion, or it lets the shader implement almost everything on its own.

In Granite, I pass along immutable sampler information in a rather crude way where I look at the declared name of a combined image sampler in the shader.

Putting it all together I can write shader code like this in Granite and sample YUV420p content directly:

#version 450

// "LinearYUV420P" guides reflection to use an immutable sampler.
// Not ideal since I need to know which YUV conversion to use ...
// But it works for now.
layout(set = 0, binding = 0) uniform sampler2D uSamplerLinearYUV420P;
layout(location = 0) in vec2 vUV;
layout(location = 0) out vec4 FragColor;

void main()
{
    // It's just like sampling a regular texture :D
    FragColor = textureLod(uSamplerLinearYUV420P, vUV, 0.0);
}

Getting fancy with hardware video decoding

The natural way to extend this is to use hardware video decoding, export each plane as external memory, and import it in Vulkan. DISJOINT allocation comes in handy here since we don’t have to rely on a video decoder exporting a full planar image as one large unit. I haven’t looked into this yet though, but I suggest looking at MPV here. There seems to be some support for this kind of zero-copy flow with Vulkan.

Was all of this really all that useful?

Not really, but I have some ideas to add a “video texture material” in the future to Granite, so it might come in handy later. There might be some use for it in graphics programming where bandwidth is saved by encoding post-process render targets to YUV. That way, sampling the YUV post targets shouldn’t require a lot of manual hackery.

Emulating a fake retro GPU in Vulkan compute

RetroWarp – a fake retro GPU

Lately, I’ve been fiddling with a side project which ended being quite interesting.

The goal of this side project was to prototype out a system which implements software rasterization in compute shaders using modern GPU features like Vulkan 1.1 subgroups and async compute to improve performance. Then, I wanted to apply this to emulation of retro GPUs, in particular, a more low-level approach.

I believe compute shader rasterization has some key advantages in the domain of low-level emulation. Chasing full accuracy means not being able to make use of the key fixed function aspects of the graphics pipeline on Vulkan GPUs and most of the reasons to use fragment shaders goes away. With compute, there is no fixed function baggage to grapple with, but it does mean a lot of the things we take for granted must be implemented in software.

I didn’t aim to dive straight into a concrete retro GPU with this prototype, but rather I designed a straight forward rasterizer which supports the basic features found in particular old GPUs. My approach here is that this could be used as a starting point when going further and emulating a real legacy chip.

The repository is available on Github: https://github.com/Themaister/RetroWarp

The high level system

Rather than reiterate everything presented in the slide deck, I will link to it directly instead, however, it’s useful to discuss the system at a very high level.

Presentation slides

I presented this work at the Khronos Munich meetup in October 2019. You can find the presentation slides here.

Implementing Low Level GPU – Hans-Kristian – Munich 2019

Tile-based

Going tile-based is practically necessary for any compute shader implementation. I implemented 8×8 and 16×16 tile modes. Smaller tiles are more suitable for lower resolution like 320×240 and 640×480, but 16×16 was useful for 720p and up.

If you are familiar with tile deferred shading and friends, you know where I’m going with this.

Coarse-then-fine binning

To be tile based, we need to assign primitives to tiles. This is a quite intensive process when the tile size is small as time scales as resolution times number of primitives. To optimize, I bin at a low resolution (e.g. 64×64 tiles) and then refine the binning at full tile resolution.

Bitmap instead of primitive list

A common way to bin is to build an array of primitives which affects a tile, and then the renderer can just loop through that array of indices on a per-tile basis. This is problematic in the worst case where a lot of primitives end up filling the entire screen, there simply might not be memory available to store all these lists. We cannot allocate memory arbitrary on the GPU, and we really want to do tile binning on the GPU and not CPU.

Instead, each tile gets a fixed array of u32 bitmasks, where 1 bit is used per primitive. Bit-scan loops are used instead. To speed up the process where there are many gaps in the bitmap (there certainly is), there is a hierarchy, where the first hierarchy of bits marks a bit if any primitives is binned in groups of 32 primitives. If we find a bit set here, we go down the hierarchy and loop some more. A more concrete example here is:

  • Maximum of 16k primitives (arbitrary limit we choose)
  • Bitmap is u32[16k / 32] to contain all state
  • Coarse bitmap is u32[16k / (32 * 32)]

If more than 16k primitives are used, we can just split this into multiple render passes. These old GPUs don’t exactly support indirect rendering, so that’s not really a problem.

Ubershader vs. split shader architecture

After binning, we could simply implement an ubershader from doom where we deal with any possible render state the GPU supports in one 5000+ line monster. This is very problematic for performance, particularly with register pressure/occupancy on the shader cores.

One of my key deviations from the norm here was to implement a split shader architecture. Rather than rely on ubershaders, it is possible for the depth/blending state to consume pre-shaded tiles which contains color/depth/coverage information necessary to run these stages.

To create color/depth/coverage information, we can generate indirect dispatches and use specialization constants to carve out the code paths we need to run instead. This keeps register pressure down. The key downside of this approach is that we need to allocate memory and bandwidth for the pre-shaded data.

Async compute + graphics queue compute

Depth/blending is the only stage which needs to happen in-order. We can happily do binning and shading and feed the results to the final shading stage. I run everything except for depth/blending in the async compute queue, and depth/blending can run in the graphics queue.

Performance uplifts

See presentation slides for more detailed results.

Subgroup optimizations gave a solid ~20% uplift on AMD/NV/Intel. Async compute gave further 10-20% uplift on AMD/NV. Overall, I’m quite happy with this.

Recreating the tone filter from NieR:Automata

Audio tech in games is rarely particularly interesting. Sadly, most of it seems to have been made into commodity over the last couple of decades. Once in while, some games have very clever audio tech, and in this case it was NieR:Automata’s tone filter which caught my attention. The blog post explaining this tech is found here. If you haven’t played the game, it is highly recommended to read the post and watch the videos to understand what it’s doing. That saves me a lot of explanation. Their blog is very sparse on technical implementation details, but I wanted to try recreating it as there was just enough high-level detail in there to get me started.

Outside graphics, I’ve done a fair bit of audio programming in the past. It’s been too long since I did any significant audio DSP programming.

In short, the goal of the filter is to attempt to turn normal high-fidelity soundtracks into something with an 8-bit feel on-demand. Being able to introduce dynamic aspects to the music with a pure filter is interesting.

The tone filter as explained

The goal of the filter is to extract musical notes, and emphasize them. By having a few notes playing with a classic waveform like square or saw waves, we can recreate a retro 8-bit feel.

The blog describes 48 filters, spanning 4 octaves. Each octave in the (western music) scale is divided into 12 tones. What I deduced from this is that the 48 filters should be 48 very sharp bandpass filters.

Whiteboard description from blog

A theoretically perfect bandpass filter will output a pure sine wave if the tone exists, and nothing if it doesn’t.

The distortion is tough to do well, and I spent a lot of time fiddling with this. We want to try making the sine wave become something like a square wave. I tried many variants but I ended up with something very simple like

https://www.desmos.com/calculator/qvymx5qf8t

If you’ve done HDR tone-mapping, this formula will look very familiar to you. Sometimes cross-domain knowledge comes in handy.

After distorting, there is the levelling stage, which I spent a lot of time fiddling with. Basically, if we run this system as is, we end up with a ton of noise in the signal with all the chromatic tones playing over each other. Needless to say, this sounded absolutely terrible.

There are little to no details on how this should be implemented, so I tried a very crude model which seems to work reasonably well. Basically, each of the 48 channels have a running power estimate which is computed right after the filter. We can compare that against the running power estimate of the unfiltered audio. This lets us get an idea how much of the audio energy is concentrated into each individual tone. If the energy is low enough, it falls off in a power-of-4 fashion to avoid leaking in audio from completely unrelated tones. Percussion sounds will generally have energy in almost the entire audio spectrum, and we need to filter that out as well as we can. If the ratio is too high, we just cap it. This is the fiddly part. There’s a lot of magic constants to tweak to get it sounding pleasing.

At the end we mix our mono output signal back into the original audio, and when it works well, it gives a nice harmonic edge. I believe it’s reasonably close to the original game now. Here’s an example from the NieR:Automata OST. I’m visualizing all the 48 bands, and the colors are:

  • Blue: Below threshold, severely muted.
  • Green: Over threshold, should be heard.
  • Red: Saturated, hitting max threshold.

The tones in one octave form one row, and the four octaves are stacked on top of each other. The top-left starts at A3 – 220 Hz. If you know some music theory, maybe you can figure out which key the tune is in? 🙂

Implementation

First we mix stereo down to mono. This is kind of trivial. Just take the average of left and right channels.

Ultra-sharp bandpass, resonance filters

I went through a few failed iterations to get here. My first attempts were to do all of this in the frequency domain with FFTs, but that plan failed very quickly. What I ended up with in the end was a simple biquad resonance filter. This filter is characterized by having two zeroes and two poles in DSP parlance, or in other words, FIR (finite impulse response) and IIR (infinite impulse response). In code, this would look something like:

y[t] = n0 * x[t] + n1 * x[t - 1] + n2 * x[t - 2] - d0 * y[t - 1] - d1 * y[t - 2]

In the Z-domain, this looks like

H(z) = 1/n0 * (1 + z^-1 * n1/g + z^-2 * n2/g) / (1 + z^-1 * d0 + z^-2 * d1)

The zeroes and poles occur where the roots of the polynomials go to zero in the numerator and denominator respectively. Basically, I designed the filter by deliberately placing zeroes and poles in the Z-domain, factoring the expressions out and converting it back to a normal FIR and IIR form.

I placed a zero at DC and the Nyquist frequency (w = pi). The poles were placed very close to the unit circle at w = +/- 2 * pi * freq / samplerate, and amplitude 0.9999. Then I evaluated the filter response at the resonance frequency and adjusted the FIR portion of the filter so that we got an estimated unit gain at the resonance frequency.

Basically, the frequency response at the resonance frequency will be very close to dividing by zero, so near-infinite response, but not quite. Numerical stability can easily throw off the filter if we’re not careful. This is one of the major issues with IIR filters in general. I initially tried an 8-pole filter but it was impossible to get this stable even in FP64, so I just gave up and tried a simple biquad instead which worked just fine.

SIMD

Since we’re doing 48 IIR filters in parallel, this was a perfect case for SIMD optimizations. I made everything into a struct-of-arrays (SoA) form, and just vectorized the scalar IIR filter directly. Normally, small IIR filters are tricky to vectorize since there are inter-dependencies between samples, but not here.

I optimized the filter in NEON, SSE1 and AVX and got a very nice performance boost, more on that later.

This would have been a great case for ISPC, but I considered it a too large dependency for something simple like this.

Distortion

The distortion function must be nicely SIMD-friendly and not too expensive. I landed on the classic x/(1+abs(x)) operator. The divide can be done fast with reciprocal estimations. We didn’t need high accuracy.

Slight low-pass

After we have mixed together the 48 distorted streams, we run a weak low-pass filter on top to remove some of the harshest harmonics. This is done with a trivial 1-pole IIR filter.

Performance

I tested performance on a Ryzen 7 1800x @ 3.8 GHz as well as a high-end phone (Galaxy S9 Exynos) to measure NEON performance. The benchmark pushes 20 million white noise samples through the filter and then times the result. The test doesn’t take that long, so this should be assumed to be absolute peak performance without any thermal / power consideration. The results below are given in samples processed per second. Normal audio clips are 44.1 kHz, so 0.441 M/s should correspond to 1x real-time performance. The C++ version is written without any intrinsics with -O3 -ffast-math. The SIMD versions are written with the standard intrinsics.

ChipC++SSEAVXAArch64 NEON
Samsung Exynos 9810
1.8 M/s6.8 M/s
Ryzen 7 1800x @ 3.8 GHz3.6 M/s7.1 M/s11.5 M/s

Basically, we’re 100x realtime performance here, even on a mobile CPU, nice. I’m surprised how close the performance ended up when comparing SSE and NEON. I didn’t see any auto-vectorization activate in the C++ variant, so I wonder what is going on with just 2x scaling in SSE. I got similar results on MSVC and GCC for what it’s worth … NEON gets close to ideal 4x scaling though, nice.

This uses quite a bit of processing power, so we can’t run wild with effects like this right now. But I look forward to being able to take advantage of systems like this for even more precise operations in the future.


The original implementation probably does more work on more gimped CPU hardware (AMD Jaguar consoles), but 100x real-time is pretty fast in my book. 😉

Source

The implementation is out there, but don’t expect to be able to use it as-is. This is a hobby project after all.

https://github.com/Themaister/Granite/blob/master/audio/dsp/tone_filter.cpp

https://github.com/Themaister/Granite/blob/master/tests/tone_filter_bench.cpp

VST plugin

I implemented a simple VST plugin with builds for Windows and macOS, both 64-bit. Feel free to try it out. It’s ultra bare bones.

Windows 64-bit

macOS 64-bit

An unusual recompiler experiment – MIPS to LLVM IR – Part 4

This is the final part in my blog series on my adventure recompiling MIPS to LLVM IR. If you’re new to this series you can read:

  • Part 1 – Explains the goals, MIPS ELF format, etc.
  • Part 2 – Explains how to generate code using the LLVM APIs.
  • Part 3 – Explains how we recompile MIPS code to LLVM.

In this post, I’m going to test performance on some applications and get a feel for how the various different codegen options we have can affect performance.

Due our lack of extensive syscall support, there is a limit to what we can test without going out of our way to port stuff, so I’ll be focusing on some tests which don’t require much beyond simple stdio.

STB PNG read + write

This test is based on the STB library’s PNG implementation. The test will load a PNG file from disk and compress it again.

#include "stb_image.h"
#include "stb_image_write.h"
#include <stdlib.h>

int main(int argc, char *argv[])
{
	for (int i = 0; i < 20; i++)
	{
		int x, y, chan;
		stbi_uc *data = stbi_load("/tmp/test.png", &x, &y, &chan, 4);
		if (!data)
			return 1;

		if (!stbi_write_png("/tmp/output.png", x, y, 4, data, 4 * x))
			return 2;
	}

	return 0;
}

Native performance (32-bit)

To make the comparison a bit more fair, we’ll compile this using 32-bit x86 targeting i486 with -O3.

Time: 20.6 s

For reference, this matters quite a lot, in x86-64, we get 15.38 s. I will use the i486 result as a baseline, since both i486 and MIPS I are ancient ISAs from around the same era of computing.

MIPS on-demand JIT (baseline)

To begin our benchmarking, we’re going to test fully on-line JIT-ing. This is what needs to happen at least the first time we’re running an application. The results here will be affected by a balance between optimization in run-time and having to do less work while JIT-ing.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf

In this first test, we will apply the following options:

  • Function calls will link directly to their targets. This increases JIT workload significantly, since we need to JIT all possible call paths to be able to link code directly. However, runtime should be faster once we have JIT-ed.
  • No IR optimizations are enabled.

Time: 71.43 s

The up-front cost of JIT-ing is quite long. But overall, 3.5x slower isn’t terrible. Let’s see if we can do it better.

On-demand JIT with optimizations

The JIT-er can perform some in-place optimizations. We’ll see if it helps here.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --optimize

Time: 75.43 s

It seems like the optimization passes made it a bit slower.

On-demand JIT with thunked calls

Rather than aggressively JIT-ing call possible call paths, we can try just JIT-ing functions we are actually calling. All direct calls are translated into indirect calls, and every call requires a lookup. This should reduce the JIT overhead a lot, but potentially have worse runtime performance. Without –optimize, this should be the most efficient option if we want to avoid JIT overhead.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls

Time: 70.0 s

Interesting. This might be the sweet spot for on-demand JIT.

On-demand JIT with thunked all the things

We can also use thunked load-store operations, rather than emit IR code to translate addresses for every memory operation. This should reduce code bloat, and might help when we’re doing on-demand JIT.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls --thunk-load-store

Time: 90.4 s

Ouch.

Assuming well behaved calls and returns?

Unfortunately, my assumption that GCC would generate expected code for returns was wrong, or my implementation was buggy. I couldn’t get it to work for non-trivial test cases, so I can’t test performance here.

Ahead-of-time recompiled IR

Now we’re starting to get into interesting territory which I haven’t seen much of in the past.

We need to run the application here, dump IR code to disk, and recompile into a dynamic library. For this case, we should be able to generate pretty good code and avoid any run-time recompilation. This is the ideal scenario if we can deduce all known call-paths.

Let’s start with the optimal case. No thunking.

~/git/jitter/cmake-build-release/mipsvm --dump-llvm /tmp/llvm stb-test.elf

This dumps out a whopping 68 MB of LLVM IR. Time to turn this ball of mud into a dynamic library.

#!/bin/bash

OUTPUT="$1"
LLDIR="$2"

echo "== Linking LLVM IR =="
llvm-link -o __llvm_linked.bc "$LLDIR"/*.ll
echo "== Optimizing offline LLVM IR =="
opt -O3 -o __llvm_opt.bc __llvm_linked.bc -disable-inlining
#cp __llvm_linked.bc __llvm_opt.bc
echo "== Compiling static library to object file with LLC =="
llc -relocation-model=pic -filetype obj -o __linked.o __llvm_opt.bc -O3
echo "== Linking shared library =="
gcc -o "$OUTPUT" -shared __linked.o

rm -f __llvm_linked.bc
rm -f __llvm_opt.bc
rm -f __linked.o

This operation takes 53.4 seconds and generates a 3.4 MB binary. The original binary is 792 kB due to the statically linked glibc.

This should yield us the absolute best performance we can hope for. So let’s try it.

~/git/jitter/cmake-build-release/mipsvm test --static-lib ~/git/jitter/test_linked.so --static-symbols /tmp/llvm/addr.bin

Time: 46.2 s

That’s a pretty great improvement. Compared to 20 seconds for a native binary with 32-bit/i486. It starts up basically instantly since there is no recompilation necessary. We only need to recompile if we find new code we haven’t looked at yet.

From here, we can get a better idea of what runtime cost we have by removing optimizations, and adding thunking. Let’s see if opt -O3 helps at all by just going straight to llc.

Building the native binary just takes 24 seconds now.

Time: 57.0 s

opt -O3 is clearly doing something well. Let’s add back the optimization and use thunked calls. For thunked calls, the IR dump is just 21 MB. We can see here that we were JIT-ing out a lot of useless code we never had to actually run.

The binary is 984 kB now.

~/git/jitter/cmake-build-release/mipsvm stb-test.elf --static-lib ~/git/jitter/test_unlinked.so --static-symbols /tmp/llvm-nolink/addr.bin

Time: 60.0 s

The win from linking directly is nothing to sneeze at. 46.2s to 60.0 s. Let’s thunk the load-store calls and see where we get.

Time: 92.7 s

Yup. Clearly, we can get a 2x speedup by just inlining the load-store code and directly calling functions rather than rely on thunking. We’re not that far away from 2x differential from native code in the best case!

Best of both worlds codegen?

If we’re dumping code with thunking to disk to improve JIT overhead, we can imagine that we can optimize the thunked calls to direct code off-line if we write our own LLVM optimization pass. Just an idea …

We need to go one level deeper

Let’s try something silly. We will recompile a cross-compiled cross-compiler. What? Well, I’ve built SPIRV-Cross for MIPS big-endian this time around. This was actually a useful exercise, because I can now verify that SPIRV-Cross works for both MIPS and big-endian at the same time 😛 Nice. SPIRV-Cross uses C++11, a fair bit of STL and exceptions. Can we host libstdc++ properly? Let’s see. With a statically linked libstdc++, the binary is 3.2 MB.

Let’s dump some LLVM …

~/git/jitter/cmake-build-release/mipsvm spirv-cross --dump-llvm /tmp/llvm-spirv -- spirv-cross /tmp/test.spv

I tried running this with a test shader in the SPIRV-Cross repository. It makes use of FP64, so we can see if we support doubles. Here we also see that we can pass arguments to argv. We end up with 168 MB of LLVM IR, which sure is intense. Let’s recompile it. This process takes over 2 minutes and creates a 8.6 MB binary.

~/git/jitter/cmake-build-release/mipsvm spirv-cross --static-lib ~/git/jitter/test_spirv.so --static-symbols /tmp/llvm-spirv/addr.bin -- spirv-cross /tmp/test.spv

Now it runs almost instantly and correctly.

Conclusion

This has been a fun little side project. The overhead of JIT-ing is rather high as we would expect, but the peak runtime performance is surprisingly good. We’re in the 2-3x slower ballpark against natively compiled code for ahead-of-time compiled code. I haven’t tested a lot of code out there, but STB’s PNG implementation, SPIRV-Cross, glibc and libstdc++ should represent reasonably complex and varied code.

Release

I’ve released this project on GitHub under an MIT license. Please read the disclaimer.

An unusual recompiler experiment – MIPS to LLVM IR – Part 3

In part 1 and part 2 we laid the groundwork to start recompiling MIPS code to LLVM IR. Strap your seatbelts, we’re going to MIPS and x86 assembly land.

The top-level run loop

The top level code fundamentally needs to be able to translate the program counter (short-hand, PC) to an executable function pointer. We can choose a hash map (large address space) or flat array (small address space) here.

If we need to call a PC we have not seen before, we will need to recompile a new LLVM module, starting at that PC, and then we can execute it.

Self-modifying code?

An immediate question is self-modifying code. This is a fairly ugly topic to deal with since our previously compiled function might become invalid if the underlying code changes. I think the solution for that is to keep a JIT block cache which translates a hash to function pointer and do some analysis of code blocks we don’t have a function pointer for yet. Any i-cache invalidations will clear out the relevant function pointers which triggers hashing in some form. Most likely the code for our function in particular did not change, so we can likely reuse the code blocks we generated.

For our purpose, we will not deal with self-modifying code here. A real emulator will have to deal with it, but self-modifying code should be rarer and rarer the more modern hardware we’re dealing with.

Recompiling a function

So, given a PC to execute, we’ll do some analysis where we map out all execution paths from that PC. We do this by mapping out all the basic blocks. See part 2 for more detail on what basic blocks do in LLVM.

Basic block

Basic blocks are represented as a starting PC and an end, where the execution flow is linear. The end of a basic block occurs where we see some kind of branch instruction (except for call instructions!). In this analysis we only care about these “special” instructions. Normal opcodes like arithmetic and load/store are ignored since they cannot affect control flow.

Branch delay slots

A very important part of MIPS is the use of a branch delay slot. It is a very unique design aspect of the architecture, which is considered a design flaw today because it was hard-coded to help a very specific micro-architecture. Exposing micro-architecture details like this should be considered bad taste. Whenever a branch is taken, unconditionally or not, the next instruction is always executed. Let’s see a trivial example:

int foo(int a)
{
	return a + 10;
}
00000000 <foo>:
   0:	03e00008 	jr	ra
   4:	2482000a 	addiu	v0,a0,10

“jr $ra” jumps to an address stored in a register, and $ra is used for the return address of a function. However, we can see that the add instruction comes afterwards. GCC exploits the delay slot here to do the useful computation inside it. Note that if you write MIPS assembly, you can get the assembler to perform this reordering for you. Often you will see “nop” after a branch if there is nothing useful to do in the delay slot.

One thought you might have now is, what happens if you have multiple branches back to back, branching in a delay slot? Well, if you actually thought of that then congratulations, have a cookie. This is explicitly banned in MIPS ISA, because it is non-sensical and undefined. While the hardware behavior could be well defined for a particular chip, it is still extremely broken, because if there is any exception or hardware interrupt happening in the middle of this sequence, it is impossible to recover from it. MIPS interrupt handlers typically have to deal with delay slots and fix-up the PC register accordingly.

The practical effect of the delay slot for us is that whenever we recompile a branch instruction, we recompile the following instruction first, then perform the branch. If the following instruction is also a branch instruction, we know that it cannot legally be taken, and branch instructions cannot have side effects (except for jal/jalr, but those always take the branch, illegal!), so we just skip it.

Load delay slots

MIPS I also has a delay for loads. We cannot use the target register in the instruction following a load. However, for recompilation purpose we can ignore this. While clever code might attempt to abuse the fact that a target register for a load instruction hasn’t been updated yet, this is also unsafe in the real world. If an interrupt triggers in the middle of this sequence, the register will be updated anyways, breaking the assumption of the clever code. Thus, we simply ignore the existence of the load delay slot because correct code cannot rely on this hack.

Conditional branches

MIPS has a few conditional branch instructions. When we see a conditional branch, we can branch to one of two basic blocks. Either we take the branch, or we don’t. We recursively analyze the new basic blocks we found if their target PCs haven’t been analyzed already. Some instructions to look out for are

  • BEQ
  • BNE
  • BLEZ
  • BLTZ
  • BGEZ
  • BGTZ
  • BC1T (floating point compare)
  • BC1F (floating point compare)

Direct branches

The direct branch in MIPS is the “J” instruction. We need to be careful with this instruction because it is commonly used in two ways:

  • Branch to basic block
  • Tail call to an unrelated function

If we mistakenly treat a J as a basic block where it should have been a function, we will end up inlining huge functions into our own, where we should have just “called” them instead. Too much inlining will bloat the JIT and make recompilation slower. Let’s see an example.

#include <stdlib.h>

// Make sure we don't get inlining optimizations.
__attribute__((noinline))
static void *wrapped_malloc(size_t size)
{
	return malloc(size);
}

void *my_malloc(size_t size)
{ 
	return wrapped_malloc(size * 4); // Tail-call
}
00000000 <wrapped_malloc>:
   0:	3c1c0000 	lui	gp,0x0
   4:	279c0000 	addiu	gp,gp,0
   8:	8f990000 	lw	t9,0(gp)
   c:	00000000 	nop
  10:	03200008 	jr	t9
  14:	00000000 	nop

00000018 <my_malloc>:
  18:	08000000 	j	0 <wrapped_malloc>
  1c:	00042080 	sll	a0,a0,0x2

Here we need to see that J is actually a tail call to “wrapped_malloc”, and not a branch to a basic block. The heuristic I ended up with was that if the J target refers to a basic block through the use of conditional branches elsewhere, we can assume J refers to a branch to a basic block ala if/else or switch blocks. If not, we assume it’s a tail call.

There are other static branches we can find in MIPS. The conditional branches can become static branches if the $0 register is used. This seems to be mostly useful with position-independent code since we can branch to an address relative to our PC rather than fixed address with J. We should try to detect these “static” branches as well. There is no need in analyzing code which can never be executed.

Indirect branches

Indirect branches in MIPS are a bit tricky to handle. They are implemented using the JR instruction. The edge case we need to handle is that JR is also used to return from a function. Either way, JR will always end a basic block. The implementation logic will end up being something like this:

// JR
uint32_t target_pc = registers[instr.register];
if (target_pc == return_prediction_stack.top())
{
    // This is actually a return!
    predicition_stack.pop();
    return;
}
else
{
    // This might have to recompile new code if we haven't seen target_pc before!
    auto *target = mips_resolve_call_target(target_pc);
    return target(mips_state); // Tail-call.
}

There are a few main use cases for JR:

  • Returning from a function, almost always using “jr $ra”.
  • Jump tables
  • Tail calls in dynamically loaded code.

We might be able to add a few optimizations for “well-behaved” code, where we can safely assume that “jr $ra” always means return, and that $ra always refers to the correct return address. That is not guaranteed, but I think GCC will always generate sane code at least.

Illegal instructions

If we find an illegal instruction, we can call out to the VM host, and request a SIGILL signal to be raised to our thread. This also ends the basic block.

Putting it together

Now we have gone through all instructions which can trigger an end of a basic block. Let’s take a more complex function and split it up into basic blocks.

int number_of_even(const int *values, int count)
{
	int res = 0;
	for (int i = 0; i < count; i++)
		if ((values[i] & 1) == 0)
			res++;
	return res;
}
 00000000 <number_of_even>:
  // ConditionalBranch -> 0x38 or 0x8
  // Note that the basic block does not end until after
  // the delay slot has executed.
   0:	18a0000d 	blez	a1,38 <number_of_even+0x38>
   4:	00052880 	sll	a1,a1,0x2

  // ConditionalBranch -> 0x28 or 0x24
   8:	00852821 	addu	a1,a0,a1
   c:	00001025 	move	v0,zero
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x28 or 0x24
  // Should split up here because 0x10 is a branch target.
  // Current implementation does not split up basic blocks
  // to allow branching to the middle of another basic block.
  // Instead we end up duplicating some code.
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop // A wild load-delay slot appears.
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x30 or 0x10
  24:	24420001 	addiu	v0,v0,1
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // ConditionalBranch -> 0x30 or 0x10
  // Same here w.r.t. code duplication,
  // 0x24 and 0x28 are both branch targets.
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  30:	03e00008 	jr	ra
  34:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  38:	03e00008 	jr	ra
  3c:	00001025 	move	v0,zero

Once we know all the basic blocks, we can create LLVM basic blocks for them, and then recompile the blocks directly and link them together with BranchInst. This way of analyzing and recompiling is fairly ISA agnostic actually and it’s not that hard to change MIPS into something else once the basic structure is in place. The recompiler itself which sets up this is actually completely MIPS agnostic, it only asks for “given a start PC, where does the basic block end, and what kind of basic block is it.”

Register allocation and branches

While we’re working on registers, we ideally want the MIPS registers to be reflected by our native hardware registers. It’s obvious a 1:1 mapping is not possible. MIPS has 32 (well, 31) general purpose registers, 32 floating-point registers and various control registers. This isn’t going to fit on x86 or Arm.

Fortunately, we do not have to really care about register allocation when using LLVM. We just need to make sure we don’t emit CreateStore/CreateLoad as much as possible, and LLVM should take care of the rest. Within a basic block, this is very easy since we always know which SSA value a register refers to as the control flow is linear. I implemented a simple RegisterTracker class which lets me translate registers to SSA values. If we haven’t used a register yet, load it from memory, if we modify a register, just replace the SSA value and remember that we eventually have to write it back to memory later, i.e. the register bank.

The real problem is how to deal with branches. We learned last time that to pass values to other basic blocks we can use PHI nodes. I tried implementing a scheme like this, where I would build a full CFG and try to link up register values using PHI nodes, but I gave up. The biggest complication is that our registers can become invalidated when calling other functions (since they modify registers as well), and we will have a real hard time handling register dirty tracking. If we have say a basic block C which can be entered from basic block A and B, A might write registers 1 through 15 and B might write registers 16 through 31. If we want to use PHI nodes, we’ll need to create one for every possible register all predecessors of C might have touched. We also don’t really know which registers are dirty and need to be moved back to memory after the function ends, and emitting branches just to conditionally move registers back to memory is dumb. Because of all these complications and pathological cases I went with a very simple scheme. At the end of a basic block or before a function call, all dirty registers are flushed to memory. On entry of a basic block, we will have to load all the registers we need from memory. Ideally, LLVM should be able to optimize this back to SSA/PHI form, but it might be rather expensive to do so. Even if LLVM does not optimize for this, the register bank should be 100% hot in L1 cache, so I’m not too worried about performance. x86 is a very register starved architecture to begin with and moving data to and from L1 cache is very common.

Call instructions

MIPS has several ways of “calling” functions. These functions do not necessarily end a basic block, since we expect control flow to return to the instruction following the branch delay slot.

  • JAL
  • JALR
  • BLTZAL
  • BGEZAL
  • J (deduced tail call)
  • BEQ/BGEZ/BLTZ (deduced position-independent tail call)

The L stands for link, which means that $pc + 8 is written to the return register $ra before jumping. As we saw earlier, we can return by jumping indirectly to $ra. Unlike x86, there is no “return address is on stack”.

JAL is the easiest one to understand, as it means “call this address”. JALR is a variant where we call a function pointer. BLTZAL and BGEZAL are very interesting as they conditionally call a function. They are also useful for position independent calls since they use the PC-relative addressing mode. All of these instructions are fundamentally implemented in the same way.

Return stack prediction

We want to be as friendly as possible to our CPUs branch predictor. The return instruction is one of the best prediction methods we can exploit. When we return, the CPU can be almost 100% sure where we are going to branch unless we were the subject of a stack smash attack or something. The CPU keeps an internal stack where it expects a return to go, and that can be used to predict returns perfectly if our code is well behaved.

Of course, we cannot assume the code we’re running is perfect, but we can optimize for it. Whenever we are executing a link instruction, we can push the link target to a prediction stack. Whenever we see a JR instruction later, we check if it equals the top of the prediction stack. If so, we can pop the stack and simply return, no extra JIT compilation necessary. If JR is not a return, we might have to compile some more code.

One problem of the return stack is that MIPS code is free to just call JAL over and over and over, since JAL just writes to the link register, and doesn’t actually affect the stack pointer $sp.

To deal with the situation where the return stack grows towards infinity, we will just need to deal with it by setting a rational upper limit. In the worst case where our return stack for some reason grows too large, we can use the nuclear option in our arsenal, longjmp! The top level code uses setjmp, and if at any time we’ve reached a hopeless situation, longjmp unwinds the entire stack at once, and we can re-enter with our new PC. However, this is kinda terrible for performance since all return instructions will now fail to optimize to a simple return, and might have to JIT out random code which followed a call instruction. We’ll hope this never happens for real.

To thunk or not to thunk

While indirect calls must have a lookup to determine what we are actually calling in runtime, it’s possible for direct call instructions to directly call another function in LLVM. In this case, we avoid any runtime lookups. We risk recursively having to recompile the callee functions to be able to link such a function, so the initial JIT step can become really slow. I added an option which lets JAL pretend to be JALR and have all call instructions go through an indirection. LLVM can support lazy JIT to alleviate this problem, but I don’t know how to make that work, so, meh. Our grand plan is to optimize all of this stuff offline anyways later 😉

Putting it all together

It’s time to look at some real code, real MIPS output and the resulting LLVM IR. In the VM, I added a mode which lets me call any function by name. This is very useful to facilitate small test cases, so I don’t have to go through the entire libc init step just to test some basic arithmetic. $ra will be 0, and I treat returning to PC 0 as “I’m done with the test, dump registers”.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        return foo(40, 50);
}
004005ec <main>:
  4005ec:       24050032        li      a1,50
  4005f0:       08100208        j       400820 <foo>
  4005f4:       24040028        li      a0,40

00400820 <foo>:
  400820:       03e00008        jr      ra
  400824:       00851021        addu    v0,a0,a1

Doesn’t get much simpler to start with this test. main calls foo through a tail call, let’s see what the LLVM looks like completely unoptimized:

; ModuleID = '_004005ec'
source_filename = "_004005ec"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004005ec(%0*) {
entry:
  br label %_004005ec

_004005ec:                                        ; preds = %entry
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 40, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 50, i32* %a1Ptr
  tail call void @_00400820(%0* %0)
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %v0 = add i32 %a0Loaded, %a1Loaded
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

The first thing we notice is

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

which expresses the MIPS state which we pass around to our JIT functions. 64 i32 values are reserved for the general purpose registers (32 + a couple other hidden registers), 64 FP registers (32 + a couple extra), and finally, the page table. We inline it in the struct to be able to load and store memory as efficiently as possible. The code should be fairly easy to follow until we reach the return in foo()

  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

Here we call to our externally defined function in the VM to check if return stack prediction worked. Either we tail call or just simply return. $ra in this case will be 0, and we just end execution here.

The registers are dumped at the end to read:

...
  v0 = 90
  v1 = 0
  a0 = 40
  a1 = 50
...

Very nice! $v0 is the return register in the MIPS ABI and $a0/$a1 are the first and second arguments respectively.

Loads and stores

Let’s have a look what happens when we cannot rely on tail calls.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        int a = foo(1, 2);
        a += foo(3, 4);
        return a;
}
004005ec <main>:
  4005ec:       27bdffe0        addiu   sp,sp,-32
  4005f0:       24050002        li      a1,2
  4005f4:       afbf001c        sw      ra,28(sp)
  4005f8:       0c100210        jal     400840 <foo>
  4005fc:       24040001        li      a0,1
  400600:       24050004        li      a1,4
  400604:       24040003        li      a0,3
  400608:       0c100210        jal     400840 <foo>
  40060c:       00401825        move    v1,v0
  400610:       8fbf001c        lw      ra,28(sp)
  400614:       00621021        addu    v0,v1,v0
  400618:       03e00008        jr      ra
  40061c:       27bd0020        addiu   sp,sp,32

We only need to load and store to stack, but we’ll see the codegen in action.

_004005ec:                                        ; preds = %entry
  %spPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded = load i32, i32* %spPtr
  %sp = add i32 %spLoaded, -32
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %SWAddr = add i32 %sp, 28

; Translate virtual address to page + offset
  %PageIndex = lshr i32 %SWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %SWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  store i32 %raLoaded, i32* %PagePtr

; Flush registers before calling foo
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 2, i32* %a1Ptr
  %spPtr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp, i32* %spPtr1
  %raPtr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195840, i32* %raPtr2

; Predict the return address
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195840)
; Direct call to foo, no indirection needed here.
  call void @_00400840(%0* %0)
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v0Loaded, i32* %v1Ptr
  %a0Ptr3 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 3, i32* %a0Ptr3
  %a1Ptr4 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4, i32* %a1Ptr4
  %raPtr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195856, i32* %raPtr5
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195856)
  call void @_00400840(%0* %0)
  %spPtr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded7 = load i32, i32* %spPtr6
  %LWAddr = add i32 %spLoaded7, 28
  %PageIndex8 = lshr i32 %LWAddr, 12
  %Page9 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex8
  %PageLoaded10 = load i8*, i8** %Page9
  %Page3211 = bitcast i8* %PageLoaded10 to i32*
  %PageOffset12 = lshr i32 %LWAddr, 2
  %PageOffsetMasked13 = and i32 %PageOffset12, 1023
  %PagePtr14 = getelementptr inbounds i32, i32* %Page3211, i32 %PageOffsetMasked13
  %Loaded = load i32, i32* %PagePtr14
  %v0Ptr15 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded16 = load i32, i32* %v0Ptr15
  %v1Ptr17 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  %v1Loaded = load i32, i32* %v1Ptr17
  %v0 = add i32 %v1Loaded, %v0Loaded16
  %sp18 = add i32 %spLoaded7, 32
  %v0Ptr19 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr19
  %spPtr20 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp18, i32* %spPtr20
  %raPtr21 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 %Loaded, i32* %raPtr21
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %Loaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004005ec
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004005ec
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

There is a fair bit of noise here with loading and storing to memory. We have to emulate the virtual address space, so that means translating addresses into pages and offsets. What about the x86-64 output?

0000000000000000 <_004005ec>:
   0:   53                      push   %rbx
   1:   48 89 fb                mov    %rdi,%rbx
   4:   8b 47 74                mov    0x74(%rdi),%eax
   7:   8b 4f 7c                mov    0x7c(%rdi),%ecx
   a:   8d 50 e0                lea    -0x20(%rax),%edx
   d:   83 c0 fc                add    $0xfffffffc,%eax
  10:   89 c6                   mov    %eax,%esi
  12:   c1 ee 0c                shr    $0xc,%esi
  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)
  28:   48 b8 01 00 00 00 02    movabs $0x200000001,%rax
  2f:   00 00 00 
  32:   48 89 47 10             mov    %rax,0x10(%rdi)
  36:   89 57 74                mov    %edx,0x74(%rdi)
  39:   c7 47 7c 00 06 40 00    movl   $0x400600,0x7c(%rdi)
  40:   be 40 08 40 00          mov    $0x400840,%esi
  45:   ba 00 06 40 00          mov    $0x400600,%edx
  4a:   e8 00 00 00 00          callq  4f <_004005ec+0x4f>
  4f:   48 89 df                mov    %rbx,%rdi
  52:   e8 79 00 00 00          callq  d0 <_00400840>
  57:   8b 43 08                mov    0x8(%rbx),%eax
  5a:   89 43 0c                mov    %eax,0xc(%rbx)
  5d:   48 b8 03 00 00 00 04    movabs $0x400000003,%rax
  64:   00 00 00 
  67:   48 89 43 10             mov    %rax,0x10(%rbx)
  6b:   c7 43 7c 10 06 40 00    movl   $0x400610,0x7c(%rbx)
  72:   be 40 08 40 00          mov    $0x400840,%esi
  77:   ba 10 06 40 00          mov    $0x400610,%edx
  7c:   48 89 df                mov    %rbx,%rdi
  7f:   e8 00 00 00 00          callq  84 <_004005ec+0x84>
  84:   48 89 df                mov    %rbx,%rdi
  87:   e8 44 00 00 00          callq  d0 <_00400840>
  8c:   8b 43 0c                mov    0xc(%rbx),%eax
  8f:   8b 4b 74                mov    0x74(%rbx),%ecx
  92:   8d 51 1c                lea    0x1c(%rcx),%edx
  95:   89 d6                   mov    %edx,%esi
  97:   c1 ee 0c                shr    $0xc,%esi
  9a:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  a1:   00 
  a2:   c1 ea 02                shr    $0x2,%edx
  a5:   81 e2 ff 03 00 00       and    $0x3ff,%edx
  ab:   8b 34 96                mov    (%rsi,%rdx,4),%esi
  ae:   83 c1 20                add    $0x20,%ecx
  b1:   01 43 08                add    %eax,0x8(%rbx)
  b4:   89 4b 74                mov    %ecx,0x74(%rbx)
  b7:   89 73 7c                mov    %esi,0x7c(%rbx)
  ba:   48 89 df                mov    %rbx,%rdi
  bd:   e8 00 00 00 00          callq  c2 <_004005ec+0xc2>
  c2:   48 85 c0                test   %rax,%rax
  c5:   74 06                   je     cd <_004005ec+0xcd>
  c7:   48 89 df                mov    %rbx,%rdi
  ca:   5b                      pop    %rbx
  cb:   ff e0                   jmpq   *%rax
  cd:   5b                      pop    %rbx
  ce:   c3                      retq   
  cf:   90                      nop

Ouch. A lot of this is noise to deal with register moves. We can see the code sequence which performs loads and stores here:

  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)

Good news is that this is very straight forward code, so the CPU should churn through most of this like butter unless we’re missing the page table reads in L1. It will be interesting to benchmark this code against natively compiled C code later.

Loops

Let’s try to JIT the number_of_even function we made earlier and see if LLVM can preserve data in registers across loop iterations.

__attribute__((noinline))
int number_of_even(const int *values, int count)
{
        int res = 0;
        for (int i = 0; i < count; i++)
                if ((values[i] & 1) == 0)
                        res++;
        return res;
}

int main(void)
{
        static const int values[] = { 1, 2, 3, 4 };
        return number_of_even(values, 4); 
}
00400820 <number_of_even>:
  400820:       18a0000d        blez    a1,400858 <number_of_even+0x38>
  400824:       00052880        sll     a1,a1,0x2
  400828:       00852821        addu    a1,a0,a1
  40082c:       00001025        move    v0,zero
  400830:       8c830000        lw      v1,0(a0)
  400834:       00000000        nop
  400838:       30630001        andi    v1,v1,0x1
  40083c:       14600002        bnez    v1,400848 <number_of_even+0x28>
  400840:       24840004        addiu   a0,a0,4
  400844:       24420001        addiu   v0,v0,1
  400848:       14a4fff9        bne     a1,a0,400830 <number_of_even+0x10>
  40084c:       00000000        nop
  400850:       03e00008        jr      ra
  400854:       00000000        nop
  400858:       03e00008        jr      ra
  40085c:       00001025        move    v0,zero

004005ec <main>:
  4005ec:       3c040047        lui     a0,0x47
  4005f0:       24050004        li      a1,4
  4005f4:       08100208        j       400820 <number_of_even>
  4005f8:       2484a330        addiu   a0,a0,-23760
  4005fc:       00000000        nop
define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %BLEZ = icmp sle i32 %a1Loaded, 0
  %a1 = shl i32 %a1Loaded, 2
  %a1Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a1, i32* %a1Ptr1
  br i1 %BLEZ, label %_00400858, label %_00400828

_00400858:                                        ; preds = %_00400820
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

_00400828:                                        ; preds = %_00400820
  %a1Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded3 = load i32, i32* %a1Ptr2
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %a14 = add i32 %a0Loaded, %a1Loaded3
  %LWAddr = add i32 %a0Loaded, 0
  %PageIndex = lshr i32 %LWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %LWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  %Loaded = load i32, i32* %PagePtr
  %v1 = and i32 %Loaded, 1
  %BNE = icmp ne i32 %v1, 0
  %a0 = add i32 %a0Loaded, 4
  %v0Ptr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr5
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v1, i32* %v1Ptr
  %a0Ptr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a0, i32* %a0Ptr6
  %a1Ptr7 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a14, i32* %a1Ptr7
  br i1 %BNE, label %_00400848, label %_00400844

_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Ptr8 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded9 = load i32, i32* %a0Ptr8
  %a1Ptr10 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded11 = load i32, i32* %a1Ptr10
  %BNE12 = icmp ne i32 %a1Loaded11, %a0Loaded9
  br i1 %BNE12, label %_00400830, label %_00400850

_00400830:                                        ; preds = %_00400844, %_00400848
  %a0Ptr13 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded14 = load i32, i32* %a0Ptr13
  %LWAddr15 = add i32 %a0Loaded14, 0
  %PageIndex16 = lshr i32 %LWAddr15, 12
  %Page17 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex16
  %PageLoaded18 = load i8*, i8** %Page17
  %Page3219 = bitcast i8* %PageLoaded18 to i32*
  %PageOffset20 = lshr i32 %LWAddr15, 2
  %PageOffsetMasked21 = and i32 %PageOffset20, 1023
  %PagePtr22 = getelementptr inbounds i32, i32* %Page3219, i32 %PageOffsetMasked21
  %Loaded23 = load i32, i32* %PagePtr22
  %v124 = and i32 %Loaded23, 1
  %BNE25 = icmp ne i32 %v124, 0
  %a026 = add i32 %a0Loaded14, 4
  %v1Ptr27 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v124, i32* %v1Ptr27
  %a0Ptr28 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a026, i32* %a0Ptr28
  br i1 %BNE25, label %_00400848, label %_00400844

_00400844:                                        ; preds = %_00400830, %_00400828
  %v0Ptr29 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr29
  %v0 = add i32 %v0Loaded, 1
  %a0Ptr30 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded31 = load i32, i32* %a0Ptr30
  %a1Ptr32 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded33 = load i32, i32* %a1Ptr32
  %BNE34 = icmp ne i32 %a1Loaded33, %a0Loaded31
  %v0Ptr35 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr35
  br i1 %BNE34, label %_00400830, label %_00400850

_00400850:                                        ; preds = %_00400844, %_00400848
  %raPtr36 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded37 = load i32, i32* %raPtr36
  %jump_addr38 = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded37)
  %jump_addr_cmp41 = icmp ne void (%0*)* %jump_addr38, null
  br i1 %jump_addr_cmp41, label %IndirectJumpPath39, label %IndirectJumpReturn40

IndirectJumpPath:                                 ; preds = %_00400858
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400858
  ret void

IndirectJumpPath39:                               ; preds = %_00400850
  tail call void %jump_addr38(%0* %0)
  ret void

IndirectJumpReturn40:                             ; preds = %_00400850
  ret void
}

Again, pretty noisy output, and this is unoptimized output after all. If we look at the x86-64 output, then as expected, it’s pretty bad:

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 47 14                mov    0x14(%rdi),%eax
  17:   8d 0c 85 00 00 00 00    lea    0x0(,%rax,4),%ecx
  1e:   85 c0                   test   %eax,%eax
  20:   89 4f 14                mov    %ecx,0x14(%rdi)
  23:   7f 19                   jg     3e <_00400820+0x2e>
  25:   8b 73 7c                mov    0x7c(%rbx),%esi
  28:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  2f:   48 89 df                mov    %rbx,%rdi
  32:   e8 00 00 00 00          callq  37 <_00400820+0x27>
  37:   48 85 c0                test   %rax,%rax
  3a:   75 50                   jne    8c <_00400820+0x7c>
  3c:   5b                      pop    %rbx
  3d:   c3                      retq   
  3e:   8b 43 10                mov    0x10(%rbx),%eax
  41:   89 c1                   mov    %eax,%ecx
  43:   c1 e9 0c                shr    $0xc,%ecx
  46:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  4d:   00 
  4e:   89 c2                   mov    %eax,%edx
  50:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  56:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  59:   01 43 14                add    %eax,0x14(%rbx)
  5c:   83 c0 04                add    $0x4,%eax
  5f:   83 e1 01                and    $0x1,%ecx
  62:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  69:   89 4b 0c                mov    %ecx,0xc(%rbx)
  6c:   89 43 10                mov    %eax,0x10(%rbx)
  6f:   75 23                   jne    94 <_00400820+0x84>
  71:   8b 43 14                mov    0x14(%rbx),%eax
  74:   ff 43 08                incl   0x8(%rbx)
  77:   3b 43 10                cmp    0x10(%rbx),%eax
  7a:   75 20                   jne    9c <_00400820+0x8c>
  7c:   8b 73 7c                mov    0x7c(%rbx),%esi
  7f:   48 89 df                mov    %rbx,%rdi
  82:   e8 00 00 00 00          callq  87 <_00400820+0x77>
  87:   48 85 c0                test   %rax,%rax
  8a:   74 06                   je     92 <_00400820+0x82>
  8c:   48 89 df                mov    %rbx,%rdi
  8f:   5b                      pop    %rbx
  90:   ff e0                   jmpq   *%rax
  92:   5b                      pop    %rbx
  93:   c3                      retq   
  94:   8b 43 14                mov    0x14(%rbx),%eax
  97:   3b 43 10                cmp    0x10(%rbx),%eax
  9a:   74 e0                   je     7c <_00400820+0x6c>
  9c:   8b 43 10                mov    0x10(%rbx),%eax
  9f:   89 c1                   mov    %eax,%ecx
  a1:   c1 e9 0c                shr    $0xc,%ecx
  a4:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  ab:   00 
  ac:   89 c2                   mov    %eax,%edx
  ae:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  b4:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  b7:   83 e1 01                and    $0x1,%ecx
  ba:   83 c0 04                add    $0x4,%eax
  bd:   89 4b 0c                mov    %ecx,0xc(%rbx)
  c0:   89 43 10                mov    %eax,0x10(%rbx)
  c3:   85 c9                   test   %ecx,%ecx
  c5:   74 aa                   je     71 <_00400820+0x61>
  c7:   eb cb                   jmp    94 <_00400820+0x84>

Not a lot of register use here, what happens if the run the LLVM through opt first though?

; ....
_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Loaded9 = phi i32 [ %a026, %_00400830 ], [ %a0, %_00400828 ]
  %v0Loaded3 = phi i32 [ %v0Loaded4, %_00400830 ], [ 0, %_00400828 ]
  %BNE12 = icmp eq i32 %a14, %a0Loaded9
  br i1 %BNE12, label %_00400850, label %_00400830

_00400830:                                        ; preds = %_00400848, %_00400844
  %a0Loaded14 = phi i32 [ %a0Loaded9, %_00400848 ], [ %a0Loaded31, %_00400844 ]
  %v0Loaded4 = phi i32 [ %v0Loaded3, %_00400848 ], [ %v0, %_00400844 ]
; ...

Sure enough, we’re seeing some loads and stores getting promoted to phi nodes, excellent. The x86-64 codegen is improved a bit as well. Still kinda hard to read though …

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 4f 14                mov    0x14(%rdi),%ecx
  17:   8d 04 8d 00 00 00 00    lea    0x0(,%rcx,4),%eax
  1e:   85 c9                   test   %ecx,%ecx
  20:   89 47 14                mov    %eax,0x14(%rdi)
  23:   0f 8e 84 00 00 00       jle    ad <_00400820+0x9d>
  29:   8b 53 10                mov    0x10(%rbx),%edx
  2c:   01 d0                   add    %edx,%eax
  2e:   89 d6                   mov    %edx,%esi
  30:   8d 4a 04                lea    0x4(%rdx),%ecx
  33:   48 c1 ea 0c             shr    $0xc,%rdx
  37:   48 8b 94 d3 00 02 00    mov    0x200(%rbx,%rdx,8),%rdx
  3e:   00 
  3f:   81 e6 fc 0f 00 00       and    $0xffc,%esi
  45:   8b 34 32                mov    (%rdx,%rsi,1),%esi
  48:   31 d2                   xor    %edx,%edx
  4a:   83 e6 01                and    $0x1,%esi
  4d:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  54:   89 73 0c                mov    %esi,0xc(%rbx)
  57:   89 4b 10                mov    %ecx,0x10(%rbx)
  5a:   89 43 14                mov    %eax,0x14(%rbx)
  5d:   74 2f                   je     8e <_00400820+0x7e>
  5f:   39 c8                   cmp    %ecx,%eax
  61:   74 34                   je     97 <_00400820+0x87>
  63:   89 ce                   mov    %ecx,%esi
  65:   c1 ee 0c                shr    $0xc,%esi
  68:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  6f:   00 
  70:   89 cf                   mov    %ecx,%edi
  72:   c1 ef 02                shr    $0x2,%edi
  75:   81 e7 ff 03 00 00       and    $0x3ff,%edi
  7b:   8b 34 be                mov    (%rsi,%rdi,4),%esi
  7e:   83 e6 01                and    $0x1,%esi
  81:   83 c1 04                add    $0x4,%ecx
  84:   89 73 0c                mov    %esi,0xc(%rbx)
  87:   89 4b 10                mov    %ecx,0x10(%rbx)
  8a:   85 f6                   test   %esi,%esi
  8c:   75 d1                   jne    5f <_00400820+0x4f>
  8e:   ff c2                   inc    %edx
  90:   39 c8                   cmp    %ecx,%eax
  92:   89 53 08                mov    %edx,0x8(%rbx)
  95:   75 cc                   jne    63 <_00400820+0x53>
  97:   8b 73 7c                mov    0x7c(%rbx),%esi
  9a:   48 89 df                mov    %rbx,%rdi
  9d:   e8 00 00 00 00          callq  a2 <_00400820+0x92>
  a2:   48 85 c0                test   %rax,%rax
  a5:   74 1d                   je     c4 <_00400820+0xb4>
  a7:   48 89 df                mov    %rbx,%rdi
  aa:   5b                      pop    %rbx
  ab:   ff e0                   jmpq   *%rax
  ad:   8b 73 7c                mov    0x7c(%rbx),%esi
  b0:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  b7:   48 89 df                mov    %rbx,%rdi
  ba:   e8 00 00 00 00          callq  bf <_00400820+0xaf>
  bf:   48 85 c0                test   %rax,%rax
  c2:   75 e3                   jne    a7 <_00400820+0x97>
  c4:   5b                      pop    %rbx
  c5:   c3                      retq 

I suspect some of the issues are related with lack of noalias attributes. LLVM might think that storing to virtual memory might alias with the register bank, and generate very conservative code. Something to have a look at later.

Optimizing well-behaved calls

If we know that the application is well-behaved w.r.t. calls and returns, we can remove the thunk calls to __recompiler_predict_return and checks for JR. If jr $ra is seen, we statically translate that to a return.

Floating point

In MIPS I, floating point math is handled by coprocessor 1, CP1. We can load 32-bit values directly into the FP registers, move to and from integer registers, and fiddle with the control register. The control register controls rounding modes. I haven’t bothered emulating correct rounding modes for now, but the control register is used to deal with floating point conditional branches, so the register needs to be emulated at least. Just like SSE, the actual data type of the FP register can vary depending on the instruction, so we will need a lot of bitcasts, fortunately, this is a native construct in LLVM.

Let’s try implementing an FMA loop for good measure.

__attribute__((noinline))
float my_fma(const float *a, const float *b, int count)
{
        float res = 0.0f;
        for (int i = 0; i < count; i++)
                res += a[i] * b[i];
        return res;
}

int main(void)
{
        const float as[] = { 1.0f, 2.0f, 3.0f, 4.0f };
        const float bs[] = { 10.0f, -2.0f, 50.0f, -4.0f };
        float result = my_fma(as, bs, 4);
        return (int)result;
}
004008c0 <my_fma>:
  4008c0:       18c0000c        blez    a2,4008f4 <my_fma+0x34>
  4008c4:       00063080        sll     a2,a2,0x2
  4008c8:       44800000        mtc1    zero,$f0
  4008cc:       00863021        addu    a2,a0,a2
  4008d0:       c4820000        lwc1    $f2,0(a0)
  4008d4:       c4a40000        lwc1    $f4,0(a1)
  4008d8:       24840004        addiu   a0,a0,4
  4008dc:       46041082        mul.s   $f2,$f2,$f4
  4008e0:       24a50004        addiu   a1,a1,4
  4008e4:       14c4fffa        bne     a2,a0,4008d0 <my_fma+0x10>
  4008e8:       46020000        add.s   $f0,$f0,$f2
  4008ec:       03e00008        jr      ra
  4008f0:       00000000        nop
  4008f4:       44800000        mtc1    zero,$f0
  4008f8:       03e00008        jr      ra
  4008fc:       00000000        nop

We implement the floating point registers by bitcasting all the things, and keeping the register bank as integer always. Otherwise, the code-gen to LLVM IR is reasonably straight forward. In the generated x86-64 we end up seeing the magic instructions we want to see buried in the noise.

...  
  ea:   f3 0f 59 0c 98          mulss  (%rax,%rbx,4),%xmm1
  ef:   f3 0f 58 c1             addss  %xmm1,%xmm0
...

Syscalls

To end this post on a less intense note, let’s write hello world without the support of libc setup and run it in our VM. Unfortunately, we will have to write this in assembly as the C code we generate assumes that libc is up and running (something something $gp register), so raw assembly it is.

.data
str:
.ascii "Hello World!\n"

.text
.global __start
__start:
# write syscall is 4004
        li $v0, 4004
        li $a0, 1
        la $a1, str
        li $a2, 13
        syscall

# exit syscall is 4001
        li $v0, 4001
        li $a0, 0
        syscall

# Should never get here.
        jr $ra
; ModuleID = '_004000f0'
source_filename = "_004000f0"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004000f0(%0*) {
entry:
  br label %_004000f0

_004000f0:                                        ; preds = %entry
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4004, i32* %v0Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4260128, i32* %a1Ptr
  %a2Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 6
  store i32 13, i32* %a2Ptr
  call void @__recompiler_syscall(%0* %0, i32 4194564, i32 0)
  %v0Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4001, i32* %v0Ptr1
  %a0Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 0, i32* %a0Ptr2
  call void @__recompiler_syscall(%0* %0, i32 4194576, i32 0)
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004000f0
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004000f0
  ret void
}

declare void @__recompiler_syscall(%0*, i32, i32)

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

To handle syscalls, we simply create a hook into the VM host, and handle it. The syscall number goes in the $v0 register, and arguments follow as normal. To implement the write syscall we simply need to copy over data from our virtual address space and call write in our native environment.

void MIPS::syscall_write()
{
	int fd = scalar_registers[REG_A0];
	Address addr = scalar_registers[REG_A1];
	uint32_t count = scalar_registers[REG_A2];
	std::vector<uint8_t> output;
	output.resize(count);

	addr_space.copy_from_user(output.data(), addr, count);

	scalar_registers[REG_V0] = write(fd, output.data(), count);

	if (scalar_registers[REG_V0] < 0)
		scalar_registers[REG_A3] = errno;
	else
		scalar_registers[REG_A3] = 0;
}

Of course, to run this code on Windows, we’d have to do a lot of extra work to emulate these syscalls, but meh :p That is boring.

Syscalls are generally easy to deal with, but the exception is mmap() and friends. These interact directly with the virtual address space, and we need to implement our own virtual page allocator. glibc requires this to implement malloc(), so any non-trivial code is going to need a decent mmap() implementation. Getting all the weird edge cases working took a surprising amount of time. We also need to implement the obscure brk() syscall which predates mmap(). brk() is used by glibc until it fails, and then it falls back to mmap() to allocate heap memory. mmap() can also refer to non-memory resources, so we cannot just assume we have a nice, big and flat address space which we allocate from.

ioctl() will also be a nightmare, and I have not bothered with this syscall yet. We cannot translate generic structs between the two completely different ABIs since ioctl() just takes a void *. Fortunately, glibc does not require ioctl to work properly to host a full C++ application.

Conclusion

We have seen how we’re taking MIPS code and turning it into running code through LLVM. In the next post we will bring up a fully-fledged C application and even a C++ application, and do some benchmarking to compare native applications vs recompiled MIPS applications, stay tuned!

An unusual recompiler experiment – MIPS to LLVM IR – Part 2

In part 1 we parsed a MIPS ELF file and set up a stack, we should now consider how to translate the MIPS machine code to LLVM IR. Before we get there, we need to learn how to use the LLVM APIs and how code-gen to LLVM IR works.

If you know LLVM well already, this post is not for you, wait for part 3 when we start considering the design of the recompiler. If you don’t, this post should give you all understanding needed to be able to generate some code in LLVM yourself and understand the implementation.

This turned out rather dry, but hey, it’s limited how exciting writing C++ to generate IR can be. 😀 At least it’s very educational.

Which LLVM version to target?

The LLVM JIT APIs are unfortunately rather unstable. The APIs appear to break between major versions. A common workaround to this seems to be to simply stick to a version and depend on that unless you’re ready to update. I got it working on the current LLVM 7, and also the upcoming LLVM 8, which seems to shuffle a lot of internal APIs around. For the sample here, we’re going to consider LLVM 7, which is the latest stable major version. The API changes as far as I know are only related to the JIT runtime, not codegen, so we can hide away the ugliness behind a clean interface at least.

LLVM IR basics

To get us started, let’s make a trivial function, which adds two arguments together and returns it. In C, we would represent this as:

int LLVMAdd(int a, int b);

To do this, we need to create an LLVMContext, an LLVM module (sort of like a translation unit) and a function to that module. We’ll need to create some types as well.

// We'll need a ton of headers for later, might as well just add it now.
#include "llvm/ExecutionEngine/Orc/CompileUtils.h"
#include "llvm/ExecutionEngine/Orc/Core.h"
#include "llvm/ExecutionEngine/Orc/ExecutionUtils.h"
#include "llvm/ExecutionEngine/Orc/IRCompileLayer.h"
#include "llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/DataLayout.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/ExecutionEngine/SectionMemoryManager.h"

void test()
{
  llvm::LLVMContext context;
  llvm::Module module("mymodule", context);

  llvm::Type *int32 = llvm::Type::getInt32Ty(context);
  llvm::Type *result_type = int32;
  llvm::Type *argument_types[] = { int32, int32 };
  bool vararg = false;

  llvm::FunctionType *function_type =
    llvm::FunctionType::get(result_type, argument_types, vararg);
  llvm::Function *function =
    llvm::Function::Create(function_type,
                           llvm::Function::ExternalLinkage,
                           "LLVMAdd", module);

  llvm::errs() << module << "\n";
}
; ModuleID = 'mymodule'
source_filename = "mymodule"

declare i32 @LLVMAdd(i32, i32)

For now, we have simply declared the function. If we do nothing more, this is equivalent to using “extern” in C. We can call it, but the linker needs to find the function elsewhere. Let’s expand it with some code.

Basic blocks

Basic blocks are the foundation of LLVM and a function consists of any number of basic blocks. Basic blocks represent linear control flow within a function. At the end of a basic block, the block must be terminated. Here, and only here can we do things like:

  • Branch to another basic block, conditionally or not
  • Return
  • And some other special cases

For our first test, we can get away with one basic block.

Single static assignment (SSA)

Another core principle of LLVM is the use of single-static assignment. In short, it basically means that once you assign a value to a variable, it cannot ever change. If a variable is modified multiple times, each time it is assigned, it needs to be assigned to a new SSA value. You will see this when we start generating some code. This has various benefits for optimization and code-gen later in the pipeline, but that’s LLVM’s problem. Let’s implement the add function.

llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", function);
llvm::IRBuilder<> builder(bb);
llvm::Value *arg0 = &function->arg_begin()[0];
llvm::Value *arg1 = &function->arg_begin()[1];
llvm::Value *value = builder.CreateAdd(arg0, arg1, "added_value");
builder.CreateRet(value);

Here we add a basic block to our function. The first basic block is where the function starts executing. The IRBuilder is the class which lets you build actual code. Here we create an add instruction, and return it. This ends the basic block, and we’re done. Not too bad! If you print it using the existing line of code, you’ll now see:

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32) {
entry:
  %added_value = add i32 %0, %1
  ret i32 %added_value
}

Compiling IR to native code

We could just dump this directly into the JIT engine now, but let’s try compiling this as a dynamic library, using the offline tools. Copy this IR code into a file, e.g. test.ll. Let’s compile this to native code.

$ llc -relocation-model=pic -filetype obj -o test.o test.ll -O2
$ ld -shared -o test.so test.o
$ objdump -d test.so

0000000000001000 <LLVMAdd>:
    1000:       8d 04 37                lea    (%rdi,%rsi,1),%eax
    1003:       c3                      retq

LLVM can usually target other architectures with one binary, so try adding -mtriple=mips to llc, disassemble with the MIPS binutils, and you’ll get:

00000000 <LLVMAdd>:
   0:   03e00008        jr      ra
   4:   00851021        addu    v0,a0,a1

This code might confuse you, it’s returning before the add? We’ll get to that later, it’s one of the weirder parts of MIPS, delay slots.

Branches

We’ll need some control flow. Let’s try implementing this function:

int foo(int a, int b, int c)
{
    if (c > 0)
        return a * b;
    else
        return a + b;
}
using namespace llvm;
// ...
// Remember to update this to take 3 arguments!
Type *argument_types[] = { int32, int32, int32 };
// ...
BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];
Value *cmp = builder.CreateICmpSGT(
    arg0,
    ConstantInt::get(Type::getInt32Ty(context), 0),
    "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *mul = builder.CreateMul(arg1, arg2, "multiplied");
builder.CreateRet(mul);

builder.SetInsertPoint(false_path);
Value *add = builder.CreateMul(arg1, arg2, "added");
builder.CreateRet(add);

Notice that the compare instruction is “signed greater than”. Normally, LLVM does not care about signedness, except for operations where this matters. To branch, we create a branch instruction, which ends the first basic block.

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %multiplied = mul i32 %1, %2
  ret i32 %multiplied

false:                                            ; preds = %entry
  %added = add i32 %1, %2
  ret i32 %added
}

And this becomes in x86-64 (sorry for the AT&T syntax):

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 06                   jle    a <LLVMAdd+0xa>
   4:   0f af f2                imul   %edx,%esi
   7:   89 f0                   mov    %esi,%eax
   9:   c3                      retq   
   a:   01 d6                   add    %edx,%esi
   c:   89 f0                   mov    %esi,%eax
   e:   c3                      retq 

Memory and pointers

We can’t always use pure SSA values, and we have to deal with good old memory. LLVM has an “interesting” approach to pointers, but let’s start easy. Loading and storing memory in LLVM is usually done in two stages:

  • Compute address with “get element pointer”
  • Load/store with CreateLoad/CreateStore

The backend is responsible for translating this according to the memory addressing the CPU can deal with. Get element pointer translates very naturally to C-like pointer arithmetic. We don’t need to access in byte offsets and all sorts of gunk, but array elements and struct members. Let’s try an example with structs.

struct Foo
{
    int a;
    float b;
    int c;
    float d;
};

int LLVMAdd(Foo *foo)
{
    foo[1].c += 10;
}
Type *int32 = Type::getInt32Ty(context);
Type *float32 = Type::getFloatTy(context);

Type *struct_types[] = {
    int32,
    float32,
    int32,
    float32
};

Type *struct_type = StructType::get(context, struct_types, false);
Type *p_struct_type = PointerType::get(struct_type, 0);

Type *result_type = int32;
Type *argument_types[] = { p_struct_type };

// ...
// ...

BasicBlock *bb = BasicBlock::Create(context, "entry", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *indices[] = {
    ConstantInt::get(int32, 1), // pointer -> array index
    ConstantInt::get(int32, 2), // struct -> member index
};
Value *ptr = builder.CreateInBoundsGEP(arg0, indices, "ptr");
Value *loaded = builder.CreateLoad(ptr, "loaded");
Value *added = builder.CreateAdd(
    loaded,
    ConstantInt::get(int32, 10), "added");
builder.CreateStore(added, ptr);
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd({ i32, float, i32, float }*) {
entry:
  %ptr = getelementptr inbounds { i32, float, i32, float }, { i32, float, i32, float }* %0, i32 1, i32 2
  %loaded = load i32, i32* %ptr
  %added = add i32 %loaded, 10
  store i32 %added, i32* %ptr
  ret i32 %added
}
0000000000000000 <LLVMAdd>:
   0:   8b 47 18                mov    0x18(%rdi),%eax
   3:   83 c0 0a                add    $0xa,%eax
   6:   89 47 18                mov    %eax,0x18(%rdi)
   9:   c3                      retq   

The mystical PHI node

I can’t really talk about SSA without mentioning PHI nodes. SSA values do not live in memory, so what happens when the control flow changes? Let’s look at an example function:

int foo(int a, int b, int c)
{
    int v;
    if (a > 0)
        v = b + c;
    else
        v = b - c;

    return v + a;
}

If we translate this directly to IR, we’ll see that v suddenly has two versions of it, but SSA says *single* static assignment. How is this resolved? A PHI node is used. The purpose of this node is to pull together multiple versions of a value into one, and you specify all basic blocks which can branch into your block. This is rather bizarre, since it’s kinda like an inverse goto.

Of course, we could translate this to memory allocated on the stack and just store to v rather than deal with this, and have LLVM optimize it. For completion, let’s try that first.

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);
BasicBlock *merge = BasicBlock::Create(context, "merge", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];

Value *v = builder.CreateAlloca(int32, ConstantInt::get(int32, 1),
    "alloca");
Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0),
    "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *bc_add = builder.CreateAdd(arg1, arg2, "b_plus_c");
builder.CreateStore(bc_add, v);
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *bc_sub = builder.CreateSub(arg1, arg2, "b_minus_c");
builder.CreateStore(bc_sub, v);
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
Value *loaded = builder.CreateLoad(v, "loaded");
Value *added = builder.CreateAdd(loaded, arg0, "added");
builder.CreateRet(added);
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %alloca = alloca i32
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %b_plus_c = add i32 %1, %2
  store i32 %b_plus_c, i32* %alloca
  br label %merge

false:                                            ; preds = %entry
  %b_minus_c = sub i32 %1, %2
  store i32 %b_minus_c, i32* %alloca
  br label %merge

merge:                                            ; preds = %false, %true
  %loaded = load i32, i32* %alloca
  %added = add i32 %loaded, %0
  ret i32 %added
}

If we just give this to llc we see that it is sadly using stack space.

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 0d                   jle    11 <LLVMAdd+0x11>
   4:   01 d6                   add    %edx,%esi
   6:   89 74 24 fc             mov    %esi,-0x4(%rsp)
   a:   03 7c 24 fc             add    -0x4(%rsp),%edi
   e:   89 f8                   mov    %edi,%eax
  10:   c3                      retq   
  11:   29 d6                   sub    %edx,%esi
  13:   89 74 24 fc             mov    %esi,-0x4(%rsp)
  17:   03 7c 24 fc             add    -0x4(%rsp),%edi
  1b:   89 f8                   mov    %edi,%eax
  1d:   c3                      retq

What’s going on, we gave llc the -O3 option? Well, we didn’t optimize the IR first. This problem is extremely common when compiling C and C++ code of course, so we can optimize in the IR domain. Here we can do the classic “mem2reg” optimization, which replaces stack space with SSA values, which is much easier for the backend to deal with.

$ opt -O3 -o test.bc test.ll
$ llvm-dis test.bc -o optimized.ll
$ cat optimized.ll

; ModuleID = 'test.bc'
source_filename = "mymodule"

; Function Attrs: norecurse nounwind readnone
define i32 @LLVMAdd(i32, i32, i32) local_unnamed_addr #0 {
entry:
  %cmp = icmp sgt i32 %0, 0
  %3 = sub i32 0, %2
  %alloca.0.p = select i1 %cmp, i32 %2, i32 %3
  %alloca.0 = add i32 %1, %0
  %added = add i32 %alloca.0, %alloca.0.p
  ret i32 %added
}

attributes #0 = { norecurse nounwind readnone }

0000000000000000 <LLVMAdd>:
   0:   89 d1                   mov    %edx,%ecx
   2:   f7 d9                   neg    %ecx
   4:   85 ff                   test   %edi,%edi
   6:   0f 4f ca                cmovg  %edx,%ecx
   9:   8d 04 3e                lea    (%rsi,%rdi,1),%eax
   c:   01 c8                   add    %ecx,%eax
   e:   c3                      retq 

That’s pretty nifty, and it even replaced the branch with select, but it also rewrote the entire function because it found some other optimizations. Let’s not rely on opt and rewrite this without stack space using PHI.

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];

Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0), "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *bc_add = builder.CreateAdd(arg1, arg2, "b_plus_c");
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *bc_sub = builder.CreateSub(arg1, arg2, "b_minus_c");
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
PHINode *phi = builder.CreatePHI(int32, 2, "phi");
phi->addIncoming(bc_add, true_path);
phi->addIncoming(bc_sub, false_path);
Value *added = builder.CreateAdd(phi, arg0, "added");
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %b_plus_c = add i32 %1, %2
  br label %merge

false:                                            ; preds = %entry
  %b_minus_c = sub i32 %1, %2
  br label %merge

merge:                                            ; preds = %false, %true
  %phi = phi i32 [ %b_plus_c, %true ], [ %b_minus_c, %false ]
  %added = add i32 %phi, %0
  ret i32 %added
}

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 07                   jle    b <LLVMAdd+0xb>
   4:   01 d6                   add    %edx,%esi
   6:   01 fe                   add    %edi,%esi
   8:   89 f0                   mov    %esi,%eax
   a:   c3                      retq   
   b:   29 d6                   sub    %edx,%esi
   d:   01 fe                   add    %edi,%esi
   f:   89 f0                   mov    %esi,%eax
  11:   c3                      retq

This is a very pure translation of our IR, nice.

Calling functions

This is quite straight forward fortunately. Let’s replace the two branches with actual function calls. You can probably tell by now what the C code was supposed to look like.

FunctionType *function_type = FunctionType::get(result_type, argument_types, vararg);
Function *function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMAdd",
    module);
Function *true_function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMTrue",
    module);

Function *false_function = Function::Create(
    function_type, 
    Function::ExternalLinkage,
    "LLVMFalse", module);

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);
BasicBlock *merge = BasicBlock::Create(context, "merge", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];
Value *call_args[] = { arg0, arg1, arg2 };

Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0), "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *true_call = builder.CreateCall(true_function, call_args);
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *false_call = builder.CreateCall(false_function, call_args);
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
PHINode *phi = builder.CreatePHI(int32, 2, "phi");
phi->addIncoming(true_call, true_path);
phi->addIncoming(false_call, false_path);
Value *added = builder.CreateAdd(phi, arg0, "added");
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %3 = call i32 @LLVMTrue(i32 %0, i32 %1, i32 %2)
  br label %merge

false:                                            ; preds = %entry
  %4 = call i32 @LLVMFalse(i32 %0, i32 %1, i32 %2)
  br label %merge

merge:                                            ; preds = %false, %true
  %phi = phi i32 [ %3, %true ], [ %4, %false ]
  %added = add i32 %phi, %0
  ret i32 %added
}

declare i32 @LLVMTrue(i32, i32, i32)

declare i32 @LLVMFalse(i32, i32, i32)

Here we’re trying to call functions which haven’t been defined in the module. It is up to the linker to resolve this later. Let’s see what happens if we build this as a dynamic library.

0000000000001000 <.plt>:
    1000:       ff 35 02 30 00 00       pushq  0x3002(%rip)        # 4008 <_GLOBAL_OFFSET_TABLE_+0x8>
    1006:       ff 25 04 30 00 00       jmpq   *0x3004(%rip)        # 4010 <_GLOBAL_OFFSET_TABLE_+0x10>
    100c:       0f 1f 40 00             nopl   0x0(%rax)

0000000000001010 <LLVMFalse@plt>:
    1010:       ff 25 02 30 00 00       jmpq   *0x3002(%rip)        # 4018 <LLVMFalse>
    1016:       68 00 00 00 00          pushq  $0x0
    101b:       e9 e0 ff ff ff          jmpq   1000 <.plt>

0000000000001020 <LLVMTrue@plt>:
    1020:       ff 25 fa 2f 00 00       jmpq   *0x2ffa(%rip)        # 4020 <LLVMTrue>
    1026:       68 01 00 00 00          pushq  $0x1
    102b:       e9 d0 ff ff ff          jmpq   1000 <.plt>

Disassembly of section .text:

0000000000001030 <LLVMAdd>:
    1030:       53                      push   %rbx
    1031:       89 fb                   mov    %edi,%ebx
    1033:       85 ff                   test   %edi,%edi
    1035:       7e 0b                   jle    1042 <LLVMAdd+0x12>
    1037:       89 df                   mov    %ebx,%edi
    1039:       e8 e2 ff ff ff          callq  1020 <LLVMTrue@plt>
    103e:       01 d8                   add    %ebx,%eax
    1040:       5b                      pop    %rbx
    1041:       c3                      retq   
    1042:       89 df                   mov    %ebx,%edi
    1044:       e8 c7 ff ff ff          callq  1010 <LLVMFalse@plt>
    1049:       01 d8                   add    %ebx,%eax
    104b:       5b                      pop    %rbx
    104c:       c3                      retq  

The noise before the function is dynamic library gunk. If we had linked with –no-undefined, we would see:

ld: test.o: in function `LLVMAdd':
mymodule:(.text+0xa): undefined reference to `LLVMTrue'
ld: mymodule:(.text+0x15): undefined reference to `LLVMFalse'

This is basically how we will call into our emulator host code to do various things. Handle syscalls, deal with function calls to other emulated code, etc. The possibilities are endless now. When we JIT, we can pass function pointers of our own host code into the symbol resolver and the JIT can patch in straight function calls into the code. Nifty!

At this point, with some searching around, you should be able to figure out how to generate the IR you want. We’ve covered the basics I think.

JIT compilation

JIT-ing these llvm::Modules is in theory quite straight forward, there’s just a lot of API noise to go through. We create:

  • llvm::LLVMContext
  • llvm::orc::ExecutionSession
  • llvm::orc::RTDyldObjectLinkingLayer
  • llvm::orc::IRCompileLayer<>
  • llvm::TargetMachine
  • llvm::orc::MangleAndInterner
  • llvm::DataLayout

There is too much code to deal with, just have a look at the gists I made instead for now 😀 This should be reusable if you want to play around with it.

For LLVM 7, you’ll want to define JITTER_LLVM_VERSION_LEGACY. For LLVM 8, don’t.

The API usage should be fairly simple.

int my_add(int a, int b) { return a + b };

// ...

using namespace llvm;
JITTIR::Jitter jitter;

// Let JIT link against this symbol.
jitter.add_external_symbol("my_add", my_add);

// Allocate a module.
auto module = jitter.create_module("mymodule");
auto &context = module->getContext();

// Build the IR
Type *int32 = Type::getInt32Ty(context);
Type *result_type = int32;
Type *argument_types[] = { int32, int32 };
bool vararg = false;

FunctionType *function_type = FunctionType::get(
    result_type,
    argument_types,
    vararg);

Function *function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMAdd",
    *module);

Function *my_add_function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "my_add",
    *module);

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *args[] = { arg0, arg1 };
Value *added = builder.CreateCall(my_add_function, args, "added");
builder.CreateRet(added);

// Add the module to the JIT, it's immediately compiled.
jitter.add_module(std::move(module));

// Get function pointer to symbol and execute it.
auto *fn_ptr =
    reinterpret_cast<int (*)(int, int)>(jitter.get_symbol_address("LLVMAdd"));
printf("10 + 20 = %d\n", fn_ptr(10, 20));

Which should print 30.

(gdb) disas fn_ptr,fn_ptr+20
Dump of assembler code from 0x7fb9d7bcc000 to 0x7fb9d7bcc014:
   0x00007fb9d7bcc000:	push   %rax
   0x00007fb9d7bcc001:	movabs $0x55d0584c33ea,%rax
   0x00007fb9d7bcc00b:	callq  *%rax
   0x00007fb9d7bcc00d:	pop    %rcx
   0x00007fb9d7bcc00e:	retq   
   0x00007fb9d7bcc00f:	add    %al,(%rax) <- zero memory
   0x00007fb9d7bcc011:	add    %al,(%rax)
   0x00007fb9d7bcc013:	add    %al,(%rax)

This is just the bare minimum, simplest approach we can take to JIT compilation in LLVM, but it works. The code which calls external functions seems a bit strange though, but it might be this way so it’s easy to patch in other call addresses later, I’m not sure.

Conclusion

Hopefully this was informative at least. In part 3, we will use these tools at our disposal and bring up a MIPS.