Mip-mapping is hard – importance of keeping your quads full
Sampling textures with mip-mapping is ancient, but it’s still hard apparently. Implicit LOD calculation is the first instance where we poke a hole into the “single threaded” abstraction of high level shading languages like GLSL and HLSL and dive into the maddening world of warps, waves, quads and everything in-between. For fragment shading, at least a group of 2×2 threads (a quad) need to run side by side so we can have gradient information over the screen. On any modern GPU, these 2×2 threads are actually running in lock-step as we’ll see when looking at GPU ISA later …
The Vulkan/GL/GLES ecosystem has always specified that implicit LOD instructions must happen in dynamically uniform control flow. Dynamically uniform just means that either all threads have to execute a texture() instruction, or no threads do. This ensures that there is always 4 valid texture coordinates from which to compute derivatives. The easiest way to ensure that this guarantee holds is simply to never sample in control flow, but that’s not really practical in more interesting shaders.
If you’re sampling in control flow you better make sure you uphold the guarantees of the spec.
Having to be dynamically uniform over an entire draw call is a bit silly, so the Vulkan specification recently tightened the scope such that if you have subgroupSize >= 4, you only need to be dynamically uniform on a per-quad granularity. This makes sense. We only need correct derivatives in the quad, we shouldn’t have to care if some unrelated quad or even triangle is diverging.
An interesting case came up recently where apparently developers expect that you actually can sample with implicit LOD in diverging control flow. Apparently HLSL “defines” this control flow to be valid code.
vec2 uv = from_somewhere(); if (weight > 0.0) sum += weight * texture(Texture, uv);
The idea is that we shouldn’t have to sample the texture unless we’re going to use it, but it’s still nice to provide UV for LOD purposes. Unfortunately, there is no obvious way to express this optimization in high level languages. UV is well defined in the outer scope which is dynamically uniform, so that’s something … Intuitively, this code makes sense, but it gets really murky once we dig deeper.
With subgroup ops, we can probably get a good approximation on the HLL side.
bool quadAny(bool value) { // Perhaps this can be translated into s_wqm on AMD // if compiler checks this pattern? return subgroupClusteredOr(int(value), 4) != 0; } vec2 uv = from_somewhere(); // Hoist texture sampling out of branch and force quad uniformity. vec4 tex; if (quadAny(weight > 0.0)) tex = texture(Texture, uv); if (weight > 0.0) sum += weight * tex;
Querying gradients and then sampling with that in the branch is fine as well, but it is slow, and not really a fix, at best a workaround.
HLSL seems a bit murky about if this kind of code is legal, it’s all “that one app did this thing that one time and now we’re screwed”. From my understanding compilers can do some heroics here to work around this in applications.
I wanted to try this kind of code on all Vulkan devices I have available to see what happens. We’re in undefined territory as far as LOD goes, so anything can happen. There’s three outcomes I’m looking for which seem like plausible HW behavior:
- It just happens to work. This is kinda scary, since it’ll probably break in 5 years anyways.
- The LOD computed is garbage.
- The LOD is forced to some value on divergence.
Here’s the concrete shader I’m using, from https://github.com/Themaister/Granite/blob/master/tests/assets/shaders/divergent_lod.frag. A test to run the shader is https://github.com/Themaister/Granite/blob/master/tests/divergent_lod_test.cpp
#version 450 layout(location = 0) in vec2 vUV; layout(location = 0) out vec4 FragColor; layout(set = 0, binding = 1) uniform sampler2D uSampler; layout(set = 0, binding = 0, std140) uniform Weights { vec4 weights[4]; }; void main() { vec3 tex = vec3(0.0); float lod = -10.0; vec2 uv = vUV; if (weights[int(gl_FragCoord.x) + 2 * int(gl_FragCoord.y)].x > 0.0) { tex = texture(uSampler, uv).rgb; lod = textureQueryLod(uSampler, uv).y; } FragColor = vec4(tex, lod); }
I render this on a 2×2 frame buffer with a full-screen “expanded triangle” to not get any helper lane shenanigans. Let’s try to run this across a wide range of hardware and see what happens. NOTE: any result here is equally valid in Vulkan, this is intentionally going out of spec.
AMD
I tested this on a Navi card. RDNA ISA seems similar enough to GCN … We effectively have 4 driver stacks for AMD cards now to test.
RADV (LLVM 10)
Garbage LOD
main: BB16_0: s_mov_b64 s[0:1], exec s_wqm_b64 exec, exec v_cvt_i32_f32_e32 v3, v3 s_mov_b32 s6, s3 s_movk_i32 s7, 0x8000 v_cvt_i32_f32_e32 v2, v2 v_mov_b32_e32 v5, 0xc1200000 s_load_dwordx4 s[8:11], s[6:7], 0x0 v_lshlrev_b32_e32 v3, 1, v3 v_add_lshl_u32 v2, v3, v2, 4 s_waitcnt lgkmcnt(0) buffer_load_dword v4, v2, s[8:11], 0 offen v_mov_b32_e32 v2, 0 v_mov_b32_e32 v3, v2 s_waitcnt vmcnt(0) v_cmp_lt_f32_e32 vcc, 0, v4 against 0 v_mov_b32_e32 v4, v2 s_and_saveexec_b64 s[8:9], vcc s_cbranch_execz BB16_2 BB16_1: s_mov_b32 s3, s7 s_add_i32 s6, s2, 0x50 s_mov_b32 m0, s4 s_load_dwordx8 s[12:19], s[2:3], 0x0 s_load_dwordx4 s[20:23], s[6:7], 0x0 v_interp_p1_f32_e32 v7, v0, attr0.x v_interp_p1_f32_e32 v8, v0, attr0.y v_interp_p2_f32_e32 v7, v1, attr0.x v_interp_p2_f32_e32 v8, v1, attr0.y s_waitcnt lgkmcnt(0) image_sample v[2:4], v[7:8], s[12:19], s[20:23] dmask:0x7 dim:SQ_RSRC_IMG_2D image_get_lod v5, v[7:8], s[12:19], s[20:23] dmask:0x2 dim:SQ_RSRC_IMG_2D BB16_2: v_nop s_or_b64 exec, exec, s[8:9] s_and_b64 exec, exec, s[0:1] s_waitcnt vmcnt(0) exp mrt0 v2, v3, v4, v5 done vm s_endpgm
We see that v7 and v8 hold the UV coordinates, but they are actually only computed inside the branch (v_interp). The optimizer is allowed to place UV computation inside the branch here. If there is divergence in a quad, the disabled lanes won’t get correct values for v7 and v8 (since execution is masked), and LOD becomes garbage.
RADV (ACO)
Coupled with Navi cards, this is probably the most bleeding edge setup you can run. It’s a completely new compiler backend for AMD cards, not based on LLVM.
Just happens to work
BB0: s_wqm_b64 exec, exec s_mov_b32 s0, s3 s_movk_i32 s1, 0x8000 s_load_dwordx4 s[8:11], s[0:1], 0x0 s_mov_b32 m0, s4 v_interp_p1_f32_e32 v4, v0, attr0.y v_cvt_i32_f32_e32 v2, v2 v_cvt_i32_f32_e32 v3, v3 v_lshl_add_u32 v2, v3, 1, v2 v_lshlrev_b32_e32 v2, 4, v2 s_waitcnt lgkmcnt(0) buffer_load_dword v2, v2, s[8:11], 0 offen v_interp_p2_f32_e32 v4, v1, attr0.y v_interp_p1_f32_e32 v0, v0, attr0.x v_interp_p2_f32_e32 v0, v1, attr0.x v_mov_b32_e32 v1, v4 s_waitcnt vmcnt(0) v_cmp_lt_f32_e32 vcc, 0, v2 s_and_saveexec_b64 s[0:1], vcc s_cbranch_execz BB3 BB1: s_movk_i32 s3, 0x8000 s_load_dwordx8 s[4:11], s[2:3], 0x0 s_load_dwordx4 s[12:15], s[2:3], 0x50 s_waitcnt lgkmcnt(0) image_sample v[2:4], v[0:1], s[4:11], s[12:15] dmask:0x7 dim:SQ_RSRC_IMG_2D image_get_lod v0, v[0:1], s[4:11], s[12:15] dmask:0x2 dim:SQ_RSRC_IMG_2D BB3: s_andn2_b64 exec, s[0:1], exec s_cbranch_execz BB6 BB4: v_mov_b32_e32 v0, 0xc1200000 v_mov_b32_e32 v2, 0 v_mov_b32_e32 v3, 0 v_mov_b32_e32 v4, 0 BB6: s_mov_b64 exec, s[0:1] s_waitcnt vmcnt(0) exp mrt0 v2, v3, v4, v0 done vm s_endpgm
This time, UV is interpolated outside the branch, so sampling in divergent control flow ends up working after all. The registers are well defined as they enter the branch. For AMD, it seems like it just comes down to whether or not the lanes have correct values placed in them and not having them be clobbered by the time we get around to sampling. There doesn’t seem to be any hardware level checks for divergence.
AMDVLK
Garbage LOD
AMDVLK uses the same LLVM stack that RADV LLVM uses, and no surprise, same result, and basically same exact ISA is generated.
Windows
Also just happens to work
I guess it’s the exact same case as the ACO compiler here. No need to paste disassembly.
Intel
Tested on UHD 620 (8th gen mobile CPU I think).
Anvil (Mesa)
The Mesa compiler can spit out assembly, which is nice.
Just happens to work
ISA (a little too wide to embed): https://gist.github.com/Themaister/7c5b011cde3c7585459b089f80f897e2
From what I can make out of the ISA, the UV is interpolated outside control flow, and then only the sampling takes place in control flow. It seems like Intel has similar behavior as AMD here, in that just as long as the registers are valid, divergent sampling “works”.
Windows
Just happens to work
Doesn’t seem to be a way to get ISA from Windows driver, but I suppose it’s same as ANV here.
Nvidia
Tested on a Turing GPU on Linux. Didn’t bother testing on Windows as well considering the driver stack is basically the same.
LOD is clamped to 0, textureQueryLod returns -32.0.
Apparently, now we start seeing interesting behavior. Unfortunately, there is no public ISA to look at. The -32.0 LOD might look weird, but this is kind of expected. This is apparently the smallest possible representable LOD on this GPU. LOD is usually represented in some kind of fixed point, log2(0) = -inf after all.
I confirmed it worked as expected when using non-divergent execution as a sanity check.
Arm
Tested on Mali-G72.
LOD is clamped to 0, textureQueryLod returns -128.0.
Very similar behavior to Nvidia here, except the LOD is -128.0 rather than -32.0. I confirmed it worked as expected when using non-divergent execution as a sanity check.
QCOM
Tested on Adreno 506.
Garbage LOD
Again, no ISA to look at. I confirmed it worked as expected when using non-divergent execution as a sanity check.
Conclusion
Never ever rely on LOD behavior with divergent quads (EDIT: at least the way it’s specced out and implemented on Vulkan drivers right now) . You’d be contributing to the pain and suffering of compiler engineers the world over. Staying quad-uniform is fine though.