Hardcore Vulkan debugging – Digging deep on Linux + AMDGPU

Everyone battle hardened in the world of Vulkan and D3D12 knows that debugging is ridiculously hard once we enter the domain of crashes and hangs. No one wants to do it, and seeing a random GPU crash show up is enough to want to quit graphics programming and take up farming on a remote island. Someone has to do it though, and given how questionable a lot of D3D12 content is w.r.t. correctness, this comes up a lot more often that we’d like in vkd3d-proton land.

The end goal of this blog is to demonstrate the magical UMR tool on Linux, which I would argue is the only reasonable post-mortem debugging method currently available on PC, but before we go that deep, we need to look at the current state of crash debugging on PC and the bespoke tooling we have in vkd3d-proton to deal with crashes.

Eating just crumbs makes for a miserable meal

Breadcrumbs is a common technique that most modern APIs have some kind of implementation of. The goal of breadcrumbs is simply to narrow down which draws or dispatches caused the failure. This information is extremely limited, but can sometimes be enough to figure out a crash if you’re very lucky and you have knowledge about the application’s intentions with every shader (from vkd3d-proton’s point of view, we don’t obviously).

Depending on the competency of the breadcrumb tool, you’d get this information:

  • A range of draws or dispatches which could potentially have caused failure.
    • Ideally, exactly the draw or dispatch which caused failure.
  • If page fault, which address caused failure?
    • Which resource corresponds to that failure? It is also possible that the address does not correspond to any resource. Causing true OOB on D3D12 and Vulkan is very easy.

As far as I know, this is where D3D12 on Windows ends, with two standard alternatives:

  • WriteBufferImmediate (Basically VK_AMD_buffer_marker)
  • DRED

There are vendor tools at least from NVIDIA and AMD which should make this neater, but I don’t have direct experience with any of these tools in D3D12, so let’s move on to the Vulkan side of things.

VK_AMD_buffer_marker breadcrumbs

Buffer markers is the simplest possible solution for implementing breadcrumbs. The basic idea is that a value is written to memory either before the GPU processes a command, or after work is done. On a device lost, counters can be inspected. The user will have to instrument the code somehow, either through a layer or directly. In vkd3d-proton, we can enable debug code which automatically does this for all D3D12 commands with VKD3D_CONFIG=breadcrumbs (not available in release builds).

For example, from our dispatch implementation:

VK_CALL(vkCmdDispatch(list->vk_command_buffer, x, y, z));
VKD3D_BREADCRUMB_AUX32(x);
VKD3D_BREADCRUMB_AUX32(y);
VKD3D_BREADCRUMB_AUX32(z);
VKD3D_BREADCRUMB_COMMAND(DISPATCH);

Then it’s a matter of writing the breadcrumb somewhere:

cmd.type = VKD3D_BREADCRUMB_COMMAND_SET_BOTTOM_MARKER;
cmd.count = trace->counter;
vkd3d_breadcrumb_tracer_add_command(list, &cmd);

VK_CALL(vkCmdWriteBufferMarkerAMD(list->vk_command_buffer,
  VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT,
  host_buffer, ...);

trace->counter++;

cmd.type = VKD3D_BREADCRUMB_COMMAND_SET_TOP_MARKER;
cmd.count = trace->counter;
vkd3d_breadcrumb_tracer_add_command(list, &cmd);

VK_CALL(vkCmdWriteBufferMarkerAMD(list->vk_command_buffer,
  VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
  host_buffer, ...);

We’ll also record commands and the parameters used in a side band buffer so that we can display the faulting command buffers.

Another thing to consider is that the buffer we write to must be coherent with the host. On a device lost happening randomly inside a command we won’t have the opportunity to perform host memory barriers and signal a fence properly, so we must make sure the memory punches straight through to VRAM. On AMD, we can do this with

memory_props = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
               VK_MEMORY_PROPERTY_HOST_COHERENT_BIT |
               VK_MEMORY_PROPERTY_HOST_CACHED_BIT;

if (is_supported(VK_AMD_device_coherent_memory))
{
   memory_props |= VK_MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD |
                   VK_MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD;
}

On fault, we scan through the host buffer, and if we observe that TOP and BOTTOM markers are not 0 (never executed) or UINT32_MAX (done), scan through and report the range of failing commands.

RADV speciality, making buffer markers actually useful

GPUs execute commands concurrently unless barriers are emitted. This means that there is a large range of potential draws or dispatches in flight at any one time. RADV_DEBUG=syncshaders adds barriers in between every command so that we’re guaranteed a hang will narrow down to a single command. No other Vulkan driver supports this, and it makes RADV the only practical driver for breadcrumb techniques, at least on Vulkan. Sure, it is possible to add barriers yourself between every command to emulate this, but for render passes, this becomes extremely annoying since you have to consider restarting the render pass for every draw call …

As a simple example, I’ve hacked up one of the vkd3d-proton tests to write a bogus root descriptor address, which is a great way to crash GPUs in D3D12 and Vulkan.

When running with just breadcrumbs, it’s useless:

Device lost observed, analyzing breadcrumbs ...
Found pending command list context 1 in executable state, TOP_OF_PIPE marker 44, BOTTOM_OF_PIPE marker 0.
===== Potential crash region BEGIN (make sure RADV_DEBUG=syncshaders is used for maximum accuracy) =====
Command: top_marker
marker: 1
Command: set_shader_hash
hash: db5d68a6143611ad, stage: 20
Set arg: 0 (#0)
Set arg: 18446603340520357888 (#ffff800100400000)
Command: root_desc
Set arg: 0 (#0)
Set arg: 18446603340788793344 (#ffff800110400000)
Command: root_desc
Tag: ExecuteIndirect [MaxCommandCount, ArgBuffer cookie, ArgBuffer offset, Count cookie, Count offset]
Set arg: 1 (#1)
Set arg: 7 (#7)
Set arg: 16 (#10)
Set arg: 0 (#0)
Set arg: 0 (#0)
Set arg: 0 (#0)
Command: execute_indirect_unroll_compute
Command: bottom_marker
marker: 1
Command: top_marker
marker: 2
Command: execute_indirect
Command: bottom_marker

... A ton of commands

Command: barrier
Command: bottom_marker
marker: 44
===== Potential crash region END =====

Instead, with syncshaders, it becomes:

===== Potential crash region BEGIN (make sure RADV_DEBUG=syncshaders is used for maximum accuracy) =====
Command: top_marker
marker: 1
Command: set_shader_hash
hash: db5d68a6143611ad, stage: 20
Set arg: 0 (#0)
Set arg: 18446603340520357888 (#ffff800100400000)
Command: root_desc
Set arg: 0 (#0)
Set arg: 18446603340788793344 (#ffff800110400000) <-- bogus pointer
Command: root_desc
Tag: ExecuteIndirect [MaxCommandCount, ArgBuffer cookie, ArgBuffer offset, Count cookie, Count offset]
Set arg: 1 (#1)
Set arg: 7 (#7)
Set arg: 16 (#10)
Set arg: 0 (#0)
Set arg: 0 (#0)
Set arg: 0 (#0)
Command: execute_indirect_unroll_compute
Command: bottom_marker
marker: 1
===== Potential crash region END =====

That’s actionable.

It’s more widely supported than you’d expect

A lot of drivers actually support the buffer marker vendor extension, at least in Mesa land, and even NVIDIA (although, on NVIDIA we use another extension for breadcrumb purposes …)

Narrow down multiple queues

With async compute, it’s possible that multiple command streams are in flight, and with breadcrumbs, it’s not possible to determine which queue actually faulted. To aid this, we have VKD3D_CONFIG=single_queue which serializes everything into one VkQueue.

VK_NV_device_diagnostic_checkpoints

The NV vendor extension simplifies things a fair bit. Rather than allocating memory in sysmem and manually writing out markers, one call is made after every command:

VK_CALL(vkCmdSetCheckpointNV(list->vk_command_buffer,
        NV_ENCODE_CHECKPOINT(context, trace->counter)));

The argument is a void pointer where you can place whatever you want, so we encode command list index and a counter there. On device fault, you can then query checkpoints from your queues.

VK_CALL(vkGetQueueCheckpointDataNV(vk_queue, &checkpoint_count, NULL));
if (checkpoint_count == 0)
  return;

checkpoints = vkd3d_calloc(checkpoint_count, sizeof(VkCheckpointDataNV));
for (i = 0; i < checkpoint_count; i++)
  checkpoints[i].sType = VK_STRUCTURE_TYPE_CHECKPOINT_DATA_NV;
VK_CALL(vkGetQueueCheckpointDataNV(vk_queue,
        &checkpoint_count, checkpoints));

From there, start looking for TOP_OF_PIPE and BOTTOM_OF_PIPE pipeline stages to get a potential range of commands. BOTTOM_OF_PIPE means we know for sure all commands before completed execution, and TOP_OF_PIPE means the command processor might have started executing all commands up to that point.

The main flaw with this extension is there is no easy way to narrow down the range of commands. With RADV we can enforce sync with syncshaders as a (very useful) hack, but there is no such method on NV unless we do it ourselves 🙁

Drilling into hard rock

If we can narrow down a breadcrumb to a specific shader – and it’s reproducible – it might be the time to perform the dark art of shader replacement and GPU debug printing. We know where it crashes, but why is still a mystery.

Shader replacement is a special kind of magic that vkd3d-proton needs to consider since we have no means of modifying the original game shaders directly. We get (incomprehensible) DXIL which we then translate into SPIR-V. vkd3d-proton supports bypassing the translation step, and using our own SPIR-V instead. This SPIR-V can be instrumented with debug code which lets us drill down. This is a horribly slow and tedious process, but it’s the only thing we have to inspect shader execution in real time.

Dumping DXIL and SPIR-V

First, we have to dump all shaders used by a game.

mkdir /tmp/shaders
VKD3D_SHADER_DUMP_PATH=/tmp/shaders %command%

From a breadcrumb trace, we’ll hopefully know the shader hashes which are relevant to look at, and we can round-trip them to Vulkan GLSL using SPIRV-Cross.

mkdir /tmp/override
cp /tmp/shaders/$hash.* /tmp/override
cd /tmp/override
python3 ~/git/dxil-spirv/roundtrip_shaders.py --input . --output . --spirv-cross $PATH_TO_SPIRV_CROSS

From here, we can modify the shader as we please.

$ cat /tmp/override/db5d68a6143611ad.comp
#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

layout(buffer_reference) buffer _7;
layout(buffer_reference, buffer_reference_align = 4, std430) buffer _7
{
   uint _m0[];
};

layout(push_constant, std430) uniform push_cb
{
   uvec2 va0;
} _13;

void main()
{
  uint _22 = uint(0) * 1u;
  uint _28 = atomicAdd(_7(_13.va0)._m0[_22 + (uint(0) >> 2u)], 1u);
}

It’s pretty obvious where it crashes here, but to demonstrate …

#version 460
#extension GL_EXT_buffer_reference : require
#extension GL_EXT_buffer_reference_uvec2 : require
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;

#include "debug_channel.h"

layout(buffer_reference) buffer _7;
layout(buffer_reference, buffer_reference_align = 4, std430) buffer _7
{
  uint _m0[];
};

layout(push_constant, std430) uniform push_cb
{
  uvec2 va0; // Root descriptor raw address.
} _13;

void main()
{
  DEBUG_CHANNEL_INIT(gl_GlobalInvocationID);
  uint _22 = uint(0) * 1u;

  DEBUG_CHANNEL_MSG(1);
  DEBUG_CHANNEL_MSG(_13.va0.x, _13.va0.y);
  uint _28 = atomicAdd(_7(_13.va0)._m0[_22 + (uint(0) >> 2u)], 1u);
  DEBUG_CHANNEL_MSG(2);
}
cd /tmp/override
make M=$(pwd) -C ~/git/vkd3d-proton/include/shader-debug
make: Entering directory '/home/maister/git/vkd3d-proton/include/shader-debug'
glslc -o /tmp/override/db5d68a6143611ad.spv /tmp/override/db5d68a6143611ad.comp -I/home/maister/git/vkd3d-proton/include/shader-debug --target-env=vulkan1.1 --target-spv=spv1.4 
make: Leaving directory '/home/maister/git/vkd3d-proton/include/shader-debug'

Now we can run again with:

VKD3D_SHADER_OVERRIDE=/tmp/override VKD3D_SHADER_DEBUG_RING_SIZE_LOG2=30 VKD3D_CONFIG=breadcrumbs %command%
25af:info:vkd3d_shader_debug_ring_print_message: Shader: db5d68a6143611ad: Instance 0000000000, ID (0, 0, 0): 1
25af:info:vkd3d_shader_debug_ring_print_message: Shader: db5d68a6143611ad: Instance 0000000000, ID (0, 0, 0): #50400000, #ffff8001

As expected, the shader did not reach 2, because it crashed. The address also correlates with dmesg:

[ 2551.614976] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:56 vmid:1 pasid:32771, for process d3d12 pid 9633 thread d3d12 pid 9633)
[ 2551.614985] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800150400000 from client 0x1b (UTCL2)
[ 2551.614988] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x001C1070
[ 2551.614990] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[ 2551.614992] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0
[ 2551.614994] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0
[ 2551.614996] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x7
[ 2551.614998] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 2551.614999] amdgpu 0000:0b:00.0: amdgpu: RW: 0x1
[ 2611.649035] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.2.0 timeout, signaled seq=35, emitted seq=36

As you can imagine, this is not a fun process when debugging games with 3 ksloc+ long shaders with tons of resource access. To aid this process, we really need UMR …

To make debug print work in crash scenarios, we need to use the same trick as buffer markers, i.e., make the print ring buffer device coherent and uncached.

The holy grail – UMR wave dumps

If all else fails, we have a trump card. This tool is unique to AMD + Linux as far as I know and it lets us inspect all waves which were executing on the GPU at the time of crash. It’s developed by AMD on freedesktop. Alternatively, just install umr-git from AUR if you’re on Arch Linux.

  • ISA dumps
  • Corresponding SPIR-V dump
  • SGPR / VGPR register dumps
  • Which instruction was being executed in every wave
  • GPU disassembly around the crash site

Now this is the real deal. RADV can invoke UMR on crashes and dump out a bunch of useful information. The UMR tool is standalone and should work with AMDVLK or amdgpu-pro as well. Nothing stops you from invoking the same CLI commands that RADV does while the device is hung.

Some preparation

UMR needs to do pretty deep kernel poking, so permissions will be required. First, we have to expose the debug interfaces. Run this as super user after every boot:

#!/bin/bash

chmod 777 /sys/kernel/debug
chmod -R 777 /sys/kernel/debug/dri

If you’re on a multi-GPU system (a good idea if you’re debugging hangs all day every day), it’s possible that the AMD GPU won’t be DRI instance 0.

ls /sys/kernel/debug/dri/*/amdgpu_gfxoff
/sys/kernel/debug/dri/1/amdgpu_gfxoff

If the instance is not 0, RADV currently does not forward the proper arguments to UMR, so check out this Mesa MR for now and add

RADV_UMR_EXTRA_ARGS="--instance 1"

Hopefully a more automatic solution will present itself soon.

A note about Steam Linux Runtime

If trying to debug games from within Steam, Proton or native, the pressure-vessel container will sandbox away /usr/bin/umr, so you’ll need to bypass it somehow. Details are left elsewhere.

Graphics queue only

Currently, RADV only knows how to dump the GFX ring, so we need to ensure only that queue is used if crashes happen in async compute. In vkd3d-proton, we have VKD3D_CONFIG=single_queue for that purpose.

Bracing for impact

RADV_DEBUG=hang VKD3D_CONFIG=single_queue %command%

In RADV, this does a few things:

  • syncshaders is implied
  • After every queue submission, RADV waits for idle. If it times out, it is considered a hang
  • Dump a bunch of internal state, disassemblies, etc
  • Invoke UMR to provide wave dumps

It’s also possible to add this debug option to make page faults a little nicer to debug, but usually not needed:

ACO_DEBUG=force-waitcnt

Note that while in hang debug mode, games will usually run at less than half performance due to the aggressive synchronization.

****************************************************************
* WARNING: RADV_DEBUG=hang is costly and should only be used for debugging! *
****************************************************************
radv: GPU hang detected...
radv: GPU hang report will be saved to '/home/maister/radv_dumps_22769_2023.08.20_12.09.28'!
dmesg: read kernel buffer failed: Operation not permitted
radv: GPU hang report saved successfully!

RADV will try to dump dmesg too, but you probably won’t have permissions, it’s not a big deal.

There’s a lot of useful information here, time to bring out your text editor.

-rw-r--r-- 1 maister maister 780 Aug 20 12:09 1919677f221dcd2d53a149d71db43d179e35dac3.spv
-rw-r--r-- 1 maister maister 152 Aug 20 12:09 app_info.log
-rw-r--r-- 1 maister maister 5429 Aug 20 12:09 bo_history.log
-rw-r--r-- 1 maister maister 56 Aug 20 12:09 bo_ranges.log
-rw-r--r-- 1 maister maister 27 Aug 20 12:09 dmesg.log
-rw-r--r-- 1 maister maister 4927 Aug 20 12:09 gpu_info.log
-rw-r--r-- 1 maister maister 2403 Aug 20 12:09 pipeline.log
-rw-r--r-- 1 maister maister 11231 Aug 20 12:09 registers.log
-rw-r--r-- 1 maister maister 106478 Aug 20 12:09 trace.log
-rw-r--r-- 1 maister maister 826454 Aug 20 12:09 umr_ring.log
-rw-r--r-- 1 maister maister 11388 Aug 20 12:09 umr_waves.log

First, SPIR-V relative to the faulting PSO is dumped. In vkd3d-proton, we emit the DXBC/DXIL hash inside the SPIR-V to correlate back to breadcrumbs or shader dumps.

 %29 = OpString "db5d68a6143611ad.dxbc"

pipeline.log

This is a straight up ISA dump of NIR -> ACO -> AMD ISA.

DISASM:
BB0:
v_mov_b32_e32 v0, 1 ; 7e000281
v_mov_b32_e32 v1, 0 ; 7e020280
global_atomic_add v1, v0, s[2:3] ; dcc88000 00020001
s_endpgm ; bf810000

bo_history.log

This logs all allocations and frees. It’s intended to be parsed with e.g.

python ~/git/mesa/src/amd/vulkan/radv_check_va.py bo_history.log 0x0000800110400000

This is mostly useful to prove application use-after-free or similar.

umr_waves.log

This is the most important dump of all. An entry is made for every active wave. It’s very verbose, but the important parts I think are:

Main Registers:
pc_hi: 00008000 | pc_lo: 00071d10 | wave_inst_dw0: bf810000 | exec_hi: 00000000 | 
exec_lo: 00000001 | m0: f97b2781 | ib_dbg1: 01000000 |
SGPRS:
[ 0.. 3] = { f34d388c, a09e6289, 10400000, ffff8001 }
[ 4.. 7] = { 0000fff0, 00000000, 0000003b, 00000000 }
[ 8.. 11] = { b5cc397c, 650307c8, 138f6ec2, e0c2bfd0 }
[ 12.. 15] = { 3a19d354, 491b2107, 9b54b044, 40a40138 }
...
[ 104.. 107] = { 29bea6e2, ba67090a, 577b6776, 2b42adc6 }
VGPRS: t00 (t01) (t02) (t03) (t04) (t05) (t06) (t07) (t08) (t09) (t10) (t11) (t12) (t13) (t14) (t15) (t16) (t17) (t18) (t19) (t20) (t21) (t22) (t23) (t24) (t25) (t26) (t27) (t28) (t29) (t30) (t31) 
[ 0] = { 00000001 2a50a935 abb6dbce 7ff6fd9b 44024271 3182e80b afb57f2e 156403ff fc122e1a fe904906 13d03a6d 03dc3aa9 834eac40 044c0cf3 467e8798 15aec7d3 a4fd12c5 d1018c88 e77eea3b ce77e044 8e0af2f4 308960e9 7551b5b5 c97c7bee 3b030401 e81d4060 0b52c8cc d6377322 3bc1ac00 0901eb09 b4256737 b1fe11f2 }
[ 1] = { 00000000 09434004 46793b03 67fc758a 0000cd69 a1080140 bdd45dd9 252155aa 082c15d0 00475606 76027c6d 54ba72a2 92741152 06345667 bb634172 bd9c7f6a 8a701309 b3340805 aa2772f1 2f05cdee 7b478c94 80020493 93da61e0 f9052bce 6414435f 80971172 48c70d29 8033f3d8 928489a9 25521422 52dd60ac 4ee699be }
...
 [ 15] = { 6c84febd fff7d1f8 0036b433 922dddef d3d7197e ddfffb89 62247e7e 61af41c7 4ff21fb3 6d1d41b4 880a4918 8cde3ed6 3c72df4f ff76fef5 abe16b69 000b7ac1 d4b325ae 37fedc5d ed3f9301 870dfa3b abfa3757 d30dede1 42016b8e 785bb894 d9e3ba1c 9b7bbdad 6220918b 35729313 bcfffbdc b84f64ed 20d775d6 52135245 }
PGM_MEM:

pgm[7@0x800000071cf0 + 0x0 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x4 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x8 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0xc ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x10 ] = 0x7e000281 v_mov_b32_e32 v0, 1 
pgm[7@0x800000071cf0 + 0x14 ] = 0x7e020280 v_mov_b32_e32 v1, 0 
pgm[7@0x800000071cf0 + 0x18 ] = 0xdcc88000 global_atomic_add v1, v0, s[2:3] 
pgm[7@0x800000071cf0 + 0x1c ] = 0x00020001 ;; 
* pgm[7@0x800000071cf0 + 0x20 ] = 0xbf810000 s_endpgm 
pgm[7@0x800000071cf0 + 0x24 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x28 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x2c ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x30 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x34 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x38 ] = 0xbf9f0000 s_code_end 
pgm[7@0x800000071cf0 + 0x3c ] = 0xbf9f0000 s_code_end 
End of disassembly.
TRAPSTS[50000100]:
excp: 256 | illegal_inst: 0 | buffer_oob: 0 | excp_cycle: 0 | 
excp_wave64hi: 0 | dp_rate: 2 | excp_group_mask: 0 | utc_error: 1 |

Here we can see the faulting instruction is the global_atomic_add. It’s using the address in SGPR 2/3, which we can see is … 10400000, ffff8001, which in little endian is 8001’10400000. Only the lower 48 bits are relevant, and if we look at the page fault, the address matches.

If the fault happened in a descriptor-based instruction, we can inspect the descriptor as well, since on AMD, descriptors are consumed in the instruction itself. It’s really convenient in situations like these. 🙂

Correlating fault site back to HLSL or GLSL needs to be done manually, but it is not particularly difficult.

trace.log

This can be used to inspect the PM4 stream. I rarely find it actionable, but it’s nice to have regardless. It adds breadcrumbs, but at the lowest level possible. The most useful thing (for me at least) is that we can inspect the user registers being set.

 Trace point ID: 2
!!!!! This is the last trace point that was reached by the CP !!!!!
... blablabla
10400000 COMPUTE_USER_DATA_2 <- 0x10400000
ffff8001 COMPUTE_USER_DATA_3 <- 0xffff8001
... blablabla
!!!!! This trace point was NOT reached by the CP !!!!!

Real-world debug scenario – Street Fighter 6 GPU hangs

After the last Rashid update (of course this dropped while I was on vacation, hnnnnng), users were observing random GPU hangs during online play, which is the worst kind of GPU hang. After some online play in custom rooms with people on Discord ready to join the breadcrumb crusade, I was able to reproduce the hang at a ~10% rate. I went through a goose chase.

Identify a faulting shader

Breadcrumb trace always pointed to the same shader, which is always a good start. We observed a page fault, so it could have been anything. Use-after-free or OOB access is the usual cause here. The address did not correspond to any resource however, so that’s always a fun start.

Shader replacement?

Replacing shaders seemed to mask the bug, which was rather puzzling. Usually this points to a Vulkan driver bug, but that lead got us nowhere. When dealing with low repro rate random GPU hangs, this is always the worst, since we don’t know if we were just very unlucky with repros, or if the change actually made a difference …

UMR to the rescue

(I didn’t have the old crash dumps lying around, so please excuse the lack of ISA spam. I didn’t feel like spending my Sunday reproducing a bug that I already fixed :v)

Sometimes RADV_DEBUG=hang masks bugs as well due to extra sync, but fortunately we got a wave dump eventually. The failure was in a scalar load from a raw pointer. Normally, this means an out-of-bounds root CBV descriptor access.

First hint was that this was loading 8 dwords in one go, i.e. an image descriptor. I correlated the ISA with the Vulkan GLSL disassembly and it pointed to this code:

vec4 _1132 = textureLod(nonuniformEXT(samplerCube(_32[_1129], _56[_146])), vec3(_1097, _1098, _1099), _1123 * 4.0);

It was also bindless. Normally, my spider senses would immediately think that this was an out of bounds descriptor heap access.

The descriptor index was computed as root table offset + dynamic offset. Studying the ISA I realized that it was not actually the dynamic offset that was the culprit, but rather the root table offset. Figuring this out would have taken an eternity without SGPR dumps.

From the PM4 trace, I was then able to confirm that the SGPR root table offset correlated with vkCmdPushConstants on our end.

This was rather puzzling … I had a theory that maybe our root parameter flushing code had a bug, so I added extra instrumentation to our breadcrumbs …

Another crash later, I was able to prove that on a GPU fault:

  • Root table #10 was never set in the command list
  • The shader in question accessed a descriptor array which maps to root table #10
  • The root table #10 read an undefined offset as a result

Game bug, oops! Turns out this scenario does not trigger an error in D3D12 validation layers when I wrote some tests to study this UB scenario (yaaaay <_<). It’s possible to trigger GPU hangs on the native AMD D3D12 driver this way. Maybe they app-opt it on their end for RE Engine, we’ll never know 🙂

Our workaround was to emit offset 0 for every unset root table access and the crash went away. …

Conclusion

For hardcore GPU debugging on PC, I think RADV currently provides the best experience by far. If you’re stuck debugging hangs in D3D12 on Windows, maybe give RADV + vkd3d-proton a shot. GPU hang recovery on amdgpu is sadly still questionable on Linux, but I have a good time as long as the AMD GPU is not driving the desktop session. I suggest multi-GPU here for an enjoyable experience. I’m also hoping this functionality can be added to the newly released RGD by AMD.