In vkd3d-proton, we’re translating D3D12 to Vulkan, and the single biggest piece I’ve contributed so far is translating shader model 6 to Vulkan. We already had DXBC support, but what is a graphics API without two completely incompatible IR formats, right? Getting working DXIL support in vkd3d-proton was the biggest target when I began working on vkd3d at the time. The result of this work is a standalone library and tool, dxil-spirv.
Introduction
In this blog series I’d like to go through the DXIL format, explain the problems I’ve had to solve and hopefully serve as an introduction to basic compiler theory, explained by yours truly who has no idea what they’re talking about. What could go wrong!?
I never learned any of this formally, but through working on SPIRV-Cross and dxil-spirv, I just had to learn this by trying and failing. Most of this theory is locked behind academic and computer science jargon, because it is a domain where rigor is actually necessary, and the algorithms for most things have been well known since the 70s (60s?). This isn’t graphics programming where we can be “close enough” and hand wave problems away. The smallest inaccuracy will break anything you come up with. Mix this with recursive tree traversal algorithms and all hell breaks loose when you least expect it. I wonder where compiler engineers pick this stuff up. Is there a secret club I’m not invited in? :p
DXIL has been, and continues to be extremely painful to translate correctly, with new edge cases popping up every week it seems. The main reasons for these are related to one particular problem, which this blog series intends to address over the course of the blog series.
Rewrite goto soup to strict structured control flow
Of the three core problems, unstructured control flow, aka goto soup, is the real difficult part, and solving this problem is one of the single most painful problems I’ve ever worked on. When I think I’ve solved it, the next AAA game we see finds a new way to break it. At the very least, it seems like it gets a little easier with every iteration, so we’re getting asymptotically correct at the very least!
It’s very likely that the existing solution in dxil-spirv is just garbage, and it needs to be rewritten from scratch at some point, but at least the current implementation can play some vidya, so, eh.
LLVM 3.7 IR
LLVM is another problem, since we need to consume a barely documented IR format which was never designed to be consumed as a standard format to begin with. The official DXC compiler is literally a fork of LLVM, forever frozen at the 3.7 version.
It is of course possible (and intended) to use the LLVM library directly, parse the IR that way, and then iterate over the llvm::Module to create SPIR-V, but this requires shipping yet another vendored LLVM copy, and we all love those don’t we … It is just not acceptable to ship that from a practical point of view. It clocks in at a nice 40 MB+ blob, compared to the 2 MB d3d12.dll binary vkd3d-proton compiles to.
The first iteration of dxil-spirv did indeed target LLVM APIs directly for practical reasons. Not having to spend months figuring the arcane byte code format certainly helped! Later, I had to make a LLVM API replacement – with some help from RenderDoc’s low-level parser to get started – which does exactly what is required to parse DXIL. Overall, it has been very helpful to keep a native LLVM port alive through API replication, since I can always cross check the parsing results against the real deal with -DDXIL_SPIRV_NATIVE_LLVM=ON. It worked out very well in the end!
Virtualized resource hell
HLSL, and by extension DXIL does not concern itself one bit how resources are actually accessed by the implementation. Given similar code, there’s about 8 ways to codegen it depending on context obtained by the global or local root signatures. Fun! On top of this, we need vendor specific hackery for perf and feature workarounds.
In this post
As this post is just an introduction to the format, no difficult problems are presented yet. 🙂
Anatomy of a shader blob
In our journey we begin at the output from DXC, a raw shader binary.
// $ dxc -Tps_6_0 -Fo test.dxil test.frag float4 main(float4 a : A) : SV_Target { return a; }
When creating a PSO in D3D12, you hand the API this blob of data. This blob is basically the same as DXBC. The only real difference is the DXIL tag. Based on this, we know how to dispatch compilation in vkd3d-shader.
00000000 44 58 42 43 00 00 00 00 00 00 00 00 00 00 00 00 |DXBC............| ....... 00000140 b0 04 00 00 60 00 00 00 2c 01 00 00 44 58 49 4c |....`...,...DXIL| ....
For development, it is then useful to extract the raw LLVM IR part. We can do this with dxil-extract from dxil-spirv:
$ dxil-extract --output test.bc test.dxil # Extract raw LLVM bytecode $ llvm-dis test.bc # Disassemble to LLVM assembly $ cat test.ll
; ModuleID = 'test.bc' source_filename = "test.bc" target datalayout = "e-m:e-p:32:32-i1:32-i8:32-i16:32-i32:32-i64:64-f16:32-f32:32-f64:64-n8:16:32:64" target triple = "dxil-ms-dx" define void @main() { %1 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef) %2 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 1, i32 undef) %3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 2, i32 undef) %4 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 3, i32 undef) call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %1) call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 1, float %2) call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 2, float %3) call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 3, float %4) ret void } ; Function Attrs: nounwind readnone declare float @dx.op.loadInput.f32(i32, i32, i32, i8, i32) #0 ; Function Attrs: nounwind declare void @dx.op.storeOutput.f32(i32, i32, i32, i8, float) #1 attributes #0 = { nounwind readnone } attributes #1 = { nounwind } !llvm.ident = !{!0} !dx.version = !{!1} !dx.valver = !{!2} !dx.shaderModel = !{!3} !dx.viewIdState = !{!4} !dx.entryPoints = !{!5} !0 = !{!"clang version 3.7 (tags/RELEASE_370/final)"} !1 = !{i32 1, i32 0} !2 = !{i32 1, i32 6} !3 = !{!"ps", i32 6, i32 0} !4 = !{[6 x i32] [i32 4, i32 4, i32 1, i32 2, i32 4, i32 8]} !5 = !{void ()* @main, !"main", !6, null, null} !6 = !{!7, !11, null} !7 = !{!8} !8 = !{i32 0, !"A", i8 9, i8 0, !9, i8 2, i32 1, i8 4, i32 0, i8 0, !10} !9 = !{i32 0} !10 = !{i32 3, i32 15} !11 = !{!12} !12 = !{i32 0, !"SV_Target", i8 9, i8 16, !9, i8 0, i32 1, i8 4, i32 0, i8 0, !10}
Convenient! This is the same stuff we would see in dxc if we don’t write to a file. The next question is what all of this means. The first entry point into understanding what is going on is DXIL.rst. Unfortunately, this documentation is only good for the bring-up phase of DXIL compilation. The documentation just … stops eventually. At that point we’re on our own, but it’s not like it’s that hard to figure out what is going on. We have the source after all and we can correlate test HLSL with the output.
What is DXIL, really?
At a fundamental level, DXIL is just LLVM 3.7 with some sugar on top. The sugar adds whatever magic that normal LLVM code cannot express, but LLVM has ways to express things like these in a generic way. The solutions feel very clunky at times, but it’s clearly a compromise to attempt to shoehorn a shading language into something very generic like LLVM.
Intrinsics
In this simple case, we’re starting to see code like:
declare float @dx.op.loadInput.f32(i32, i32, i32, i8, i32) #0 declare void @dx.op.storeOutput.f32(i32, i32, i32, i8, float) #1 %1 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef) call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %1)
If we cannot express features in raw LLVM, we use dx.op intrinsics, declared as external linkage. The first argument is a constant int, which encodes the actual opcode to use. The loadInput or storeOutput names are not significant here. There are over 200 opcodes like this, which is not surprising given how much junk we’ve accumulated in shading languages over the years. Some of the opcodes are documented, some are not. *shrug* 🙂
Scalar code
In SPIR-V, there is explicit support for vectors and matrices, but DXIL is flat and scalar (except when it isn’t, of course). With modern GPU architectures, 32-bit arithmetic is scalar anyways, but 16-bit and 8-bit arithmetic ops are packed operations on all modern GPUs I know of, so it feels a bit weird to go full scalar. I guess backend compilers just need to suck it up and learn the dark art of re-vectorization. We also end up with quite bloated code, since any vector operation has to be unrolled. On the upside, there are fewer cases to implement and test in dxil-spirv. As an example, we have a shader:
Texture2D<float4> T; SamplerState S; float4 main(float2 uv : TEXCOORD) : SV_Target { return T.Sample(S, uv); }
In places where we cannot ignore vectors, DXIL sometimes reaches for structs instead of the far more natural vector type.
// The resource API is quite peculiar, for later ... %1 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false) %2 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 3, i32 0, i32 0, i1 false) // Load UV, one component at a time %3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef) %4 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 1, i32 undef) // Sample %5 = call %dx.types.ResRet.f32 @dx.op.sample.f32(i32 60, %dx.types.Handle %1, %dx.types.Handle %2, float %3, float %4, float undef, float undef, i32 0, i32 0, i32 undef, float undef) // Extract components one by one ... %6 = extractvalue %dx.types.ResRet.f32 %5, 0 %7 = extractvalue %dx.types.ResRet.f32 %5, 1 %8 = extractvalue %dx.types.ResRet.f32 %5, 2 %9 = extractvalue %dx.types.ResRet.f32 %5, 3 // There is actually a 5th member! If that member is statically used // the sample opcode is actually a sparse residency query. // Isn't this fun? Welcome to my personal hell :3 // Store result one by one ... call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %6) call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 1, float %7) call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 2, float %8) call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 3, float %9)
With a simple spot check, we get binary sizes:
- DXC (-Tps_6_0 test.frag -Qstrip_reflect -Qstrip_debug): 2063 bytes
- FXC (-Tps_5_0 test.frag -Qstrip_reflect -Qstrip_debug): 268 bytes
- DXC SPIR-V (-Tps_6_0 test.frag -Qstrip_reflect -Qstrip_debug -spirv): 744 bytes
Fun … DXBC is very compact, at least it has that going for it.
Metadata
In SPIR-V, we have execution modes, decoration and various instructions to encode metadata. These instructions are instructions like any other opcode, but LLVM has a separate system that lives on the side, the metadata nodes. In the IR assembly, we see a spider web of weird data structures.
!llvm.ident = !{!0} !dx.version = !{!1} !dx.valver = !{!2} !dx.shaderModel = !{!3} !dx.resources = !{!4} !dx.viewIdState = !{!10} !dx.entryPoints = !{!11} !0 = !{!"clang version 3.7 (tags/RELEASE_370/final)"} !1 = !{i32 1, i32 0} !2 = !{i32 1, i32 6} !3 = !{!"ps", i32 6, i32 0} !4 = !{!5, null, null, !8} !5 = !{!6} !6 = !{i32 0, %"class.Texture2D<vector<float, 4> >"* undef, !"", i32 2, i32 1, i32 1, i32 2, i32 0, !7} !7 = !{i32 0, i32 9} !8 = !{!9} !9 = !{i32 0, %struct.SamplerState* undef, !"", i32 4, i32 3, i32 1, i32 0, null} !10 = !{[4 x i32] [i32 2, i32 4, i32 15, i32 15]} !11 = !{void ()* @main, !"main", !12, !4, null} !12 = !{!13, !17, null} !13 = !{!14} !14 = !{i32 0, !"TEXCOORD", i8 9, i8 0, !15, i8 2, i32 1, i8 2, i32 0, i8 0, !16} !15 = !{i32 0} !16 = !{i32 3, i32 3} !17 = !{!18} !18 = !{i32 0, !"SV_Target", i8 9, i8 16, !15, i8 0, i32 1, i8 4, i32 0, i8 0, !19} !19 = !{i32 3, i32 15}
What does this even mean? These nodes encode the entry point(s), which resources exist, the stage IO variables, and everything like that. It is at least documented, just awkward to read.
!dx.entryPoints = !{!11} !11 = !{ void ()* @main, /* Pointer to function */ !"main", /* Name of entry point. Not relevant before DXR. */ !12 /* List of stage IO types */, !4 /* Resource lists */, null} !12 = !{ !13, /* Stage input */ !17, /* Stage output */ null /* Patch IO for tessellation */} /* Encode things like semantic, register offsets, component offset, etc ... */ !14 = !{i32 0, !"TEXCOORD", i8 9, i8 0, !15, i8 2, i32 1, i8 2, i32 0, i8 0, !16} !18 = !{i32 0, !"SV_Target", i8 9, i8 16, !15, i8 0, i32 1, i8 4, i32 0, i8 0, !19} !4 = !{ !5, /* SRVs */ null, null, /* UAV, CBV */ !8 /* Samplers */} /* Encode register space, register #t, component types, etc, etc ... */ !6 = !{i32 0, %"class.Texture2D<vector<float, 4> >"* undef, !"", i32 2, i32 1, i32 1, i32 2, i32 0, !7}
The instructions which interact with resources work very different than SPIR-V. In SPIR-V, resources are represented as plain variables, but it is a bit more indirect in DXIL. For example:
%3 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)
Here we specify:
- Refer to stage input by index (directly refer to metadata)
- Refer to row (if the stage input is an array)
- Refer to component (must be a compile time constant, handy!)
I kinda like this approach actually. It gives us a specific place to deal with stage IO translation, so we don’t have to analyze and track random memory instructions in LLVM, which would be kinda horrible. Especially builtin semantics can get pretty weird in translation and being able to handle all addressing logic in one go is a life saver. I think this is also done so that we avoid the problem of having to deal with vectors. (But again, as we shall see later, someone missed the memo when implementing DXR.)
Resources work similarly:
%1 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, /* SRV */ i32 0, /* Index into metadata */ i32 0, /* Array offset */ i1 false) /* NonUniformResourceIndex? */
Again, we request a handle where we refer directly to the metadata. In this case, we request SRV with metadata index #0. This isn’t bad. (But again, as we shall see, there is a completely separate system for accessing resources in DXR … ._. Why …)
As a convenient feature, dxil-spirv can emit SPIR-V -> GLSL through SPIRV-Cross’s API, so we can see the resulting codegen, which in this case closely resembles a 1:1 instruction conversion.
#version 460 layout(set = 1, binding = 2) uniform texture2D _8; layout(set = 3, binding = 4) uniform sampler _11; layout(location = 0) in vec2 TEXCOORD; layout(location = 0) out vec4 SV_Target; void main() { vec4 _31 = texture(sampler2D(_8, _11), vec2(TEXCOORD.x, TEXCOORD.y)); // le sigh <_< SV_Target.x = _31.x; SV_Target.y = _31.y; SV_Target.z = _31.z; SV_Target.w = _31.w; }
I’ve certainly seen much worse (DXBC cross compilation comes to mind … *shudder*), but this is not the true codegen that vkd3d-proton would use, since we need to transform this into fully bindless code. That’s for later.
Conclusion
This is a very basic view of how the format is put together. Next time, I’ll probably have a deeper look into the raw LLVM IR and how that is parsed. Better bring bleach for your eyes!