Emulating a fake retro GPU in Vulkan compute

RetroWarp – a fake retro GPU

Lately, I’ve been fiddling with a side project which ended being quite interesting.

The goal of this side project was to prototype out a system which implements software rasterization in compute shaders using modern GPU features like Vulkan 1.1 subgroups and async compute to improve performance. Then, I wanted to apply this to emulation of retro GPUs, in particular, a more low-level approach.

I believe compute shader rasterization has some key advantages in the domain of low-level emulation. Chasing full accuracy means not being able to make use of the key fixed function aspects of the graphics pipeline on Vulkan GPUs and most of the reasons to use fragment shaders goes away. With compute, there is no fixed function baggage to grapple with, but it does mean a lot of the things we take for granted must be implemented in software.

I didn’t aim to dive straight into a concrete retro GPU with this prototype, but rather I designed a straight forward rasterizer which supports the basic features found in particular old GPUs. My approach here is that this could be used as a starting point when going further and emulating a real legacy chip.

The repository is available on Github: https://github.com/Themaister/RetroWarp

The high level system

Rather than reiterate everything presented in the slide deck, I will link to it directly instead, however, it’s useful to discuss the system at a very high level.

Presentation slides

I presented this work at the Khronos Munich meetup in October 2019. You can find the presentation slides here.

Implementing Low Level GPU – Hans-Kristian – Munich 2019

Tile-based

Going tile-based is practically necessary for any compute shader implementation. I implemented 8×8 and 16×16 tile modes. Smaller tiles are more suitable for lower resolution like 320×240 and 640×480, but 16×16 was useful for 720p and up.

If you are familiar with tile deferred shading and friends, you know where I’m going with this.

Coarse-then-fine binning

To be tile based, we need to assign primitives to tiles. This is a quite intensive process when the tile size is small as time scales as resolution times number of primitives. To optimize, I bin at a low resolution (e.g. 64×64 tiles) and then refine the binning at full tile resolution.

Bitmap instead of primitive list

A common way to bin is to build an array of primitives which affects a tile, and then the renderer can just loop through that array of indices on a per-tile basis. This is problematic in the worst case where a lot of primitives end up filling the entire screen, there simply might not be memory available to store all these lists. We cannot allocate memory arbitrary on the GPU, and we really want to do tile binning on the GPU and not CPU.

Instead, each tile gets a fixed array of u32 bitmasks, where 1 bit is used per primitive. Bit-scan loops are used instead. To speed up the process where there are many gaps in the bitmap (there certainly is), there is a hierarchy, where the first hierarchy of bits marks a bit if any primitives is binned in groups of 32 primitives. If we find a bit set here, we go down the hierarchy and loop some more. A more concrete example here is:

  • Maximum of 16k primitives (arbitrary limit we choose)
  • Bitmap is u32[16k / 32] to contain all state
  • Coarse bitmap is u32[16k / (32 * 32)]

If more than 16k primitives are used, we can just split this into multiple render passes. These old GPUs don’t exactly support indirect rendering, so that’s not really a problem.

Ubershader vs. split shader architecture

After binning, we could simply implement an ubershader from doom where we deal with any possible render state the GPU supports in one 5000+ line monster. This is very problematic for performance, particularly with register pressure/occupancy on the shader cores.

One of my key deviations from the norm here was to implement a split shader architecture. Rather than rely on ubershaders, it is possible for the depth/blending state to consume pre-shaded tiles which contains color/depth/coverage information necessary to run these stages.

To create color/depth/coverage information, we can generate indirect dispatches and use specialization constants to carve out the code paths we need to run instead. This keeps register pressure down. The key downside of this approach is that we need to allocate memory and bandwidth for the pre-shaded data.

Async compute + graphics queue compute

Depth/blending is the only stage which needs to happen in-order. We can happily do binning and shading and feed the results to the final shading stage. I run everything except for depth/blending in the async compute queue, and depth/blending can run in the graphics queue.

Performance uplifts

See presentation slides for more detailed results.

Subgroup optimizations gave a solid ~20% uplift on AMD/NV/Intel. Async compute gave further 10-20% uplift on AMD/NV. Overall, I’m quite happy with this.

The weird world of shader divergence and LOD

Mip-mapping is hard – importance of keeping your quads full

Sampling textures with mip-mapping is ancient, but it’s still hard apparently. Implicit LOD calculation is the first instance where we poke a hole into the “single threaded” abstraction of high level shading languages like GLSL and HLSL and dive into the maddening world of warps, waves, quads and everything in-between. For fragment shading, at least a group of 2×2 threads (a quad) need to run side by side so we can have gradient information over the screen. On any modern GPU, these 2×2 threads are actually running in lock-step as we’ll see when looking at GPU ISA later …

The Vulkan/GL/GLES ecosystem has always specified that implicit LOD instructions must happen in dynamically uniform control flow. Dynamically uniform just means that either all threads have to execute a texture() instruction, or no threads do. This ensures that there is always 4 valid texture coordinates from which to compute derivatives. The easiest way to ensure that this guarantee holds is simply to never sample in control flow, but that’s not really practical in more interesting shaders.

If you’re sampling in control flow you better make sure you uphold the guarantees of the spec.

Having to be dynamically uniform over an entire draw call is a bit silly, so the Vulkan specification recently tightened the scope such that if you have subgroupSize >= 4, you only need to be dynamically uniform on a per-quad granularity. This makes sense. We only need correct derivatives in the quad, we shouldn’t have to care if some unrelated quad or even triangle is diverging.

An interesting case came up recently where apparently developers expect that you actually can sample with implicit LOD in diverging control flow. Apparently HLSL “defines” this control flow to be valid code.

vec2 uv = from_somewhere();
if (weight > 0.0)
    sum += weight * texture(Texture, uv);

The idea is that we shouldn’t have to sample the texture unless we’re going to use it, but it’s still nice to provide UV for LOD purposes. Unfortunately, there is no obvious way to express this optimization in high level languages. UV is well defined in the outer scope which is dynamically uniform, so that’s something … Intuitively, this code makes sense, but it gets really murky once we dig deeper.

With subgroup ops, we can probably get a good approximation on the HLL side.

bool quadAny(bool value)
{
    // Perhaps this can be translated into s_wqm on AMD
    // if compiler checks this pattern?
    return subgroupClusteredOr(int(value), 4) != 0;
}

vec2 uv = from_somewhere();
// Hoist texture sampling out of branch and force quad uniformity.
vec4 tex;
if (quadAny(weight > 0.0))
    tex = texture(Texture, uv);
if (weight > 0.0)
    sum += weight * tex;

Querying gradients and then sampling with that in the branch is fine as well, but it is slow, and not really a fix, at best a workaround.

HLSL seems a bit murky about if this kind of code is legal, it’s all “that one app did this thing that one time and now we’re screwed”. From my understanding compilers can do some heroics here to work around this in applications.

I wanted to try this kind of code on all Vulkan devices I have available to see what happens. We’re in undefined territory as far as LOD goes, so anything can happen. There’s three outcomes I’m looking for which seem like plausible HW behavior:

  • It just happens to work. This is kinda scary, since it’ll probably break in 5 years anyways.
  • The LOD computed is garbage.
  • The LOD is forced to some value on divergence.

Here’s the concrete shader I’m using, from https://github.com/Themaister/Granite/blob/master/tests/assets/shaders/divergent_lod.frag. A test to run the shader is https://github.com/Themaister/Granite/blob/master/tests/divergent_lod_test.cpp

#version 450

layout(location = 0) in vec2 vUV;
layout(location = 0) out vec4 FragColor;
layout(set = 0, binding = 1) uniform sampler2D uSampler;
layout(set = 0, binding = 0, std140) uniform Weights
{
    vec4 weights[4];
};

void main()
{
    vec3 tex = vec3(0.0);
    float lod = -10.0;
    vec2 uv = vUV;
    if (weights[int(gl_FragCoord.x) + 2 * int(gl_FragCoord.y)].x > 0.0)
    {
        tex = texture(uSampler, uv).rgb;
        lod = textureQueryLod(uSampler, uv).y;
    }

    FragColor = vec4(tex, lod);
}

I render this on a 2×2 frame buffer with a full-screen “expanded triangle” to not get any helper lane shenanigans. Let’s try to run this across a wide range of hardware and see what happens. NOTE: any result here is equally valid in Vulkan, this is intentionally going out of spec.

AMD

I tested this on a Navi card. RDNA ISA seems similar enough to GCN … We effectively have 4 driver stacks for AMD cards now to test.

RADV (LLVM 10)

Garbage LOD

main:
BB16_0:
	s_mov_b64 s[0:1], exec    
	s_wqm_b64 exec, exec
	v_cvt_i32_f32_e32 v3, v3
	s_mov_b32 s6, s3
	s_movk_i32 s7, 0x8000          
	v_cvt_i32_f32_e32 v2, v2     
	v_mov_b32_e32 v5, 0xc1200000    
	s_load_dwordx4 s[8:11], s[6:7], 0x0    
	v_lshlrev_b32_e32 v3, 1, v3  
	v_add_lshl_u32 v2, v3, v2, 4   
	s_waitcnt lgkmcnt(0)    
	buffer_load_dword v4, v2, s[8:11], 0 offen    
	v_mov_b32_e32 v2, 0  
	v_mov_b32_e32 v3, v2      
	s_waitcnt vmcnt(0)        
	v_cmp_lt_f32_e32 vcc, 0, v4 against 0       
	v_mov_b32_e32 v4, v2     
	s_and_saveexec_b64 s[8:9], vcc
	s_cbranch_execz BB16_2
BB16_1:
	s_mov_b32 s3, s7      
	s_add_i32 s6, s2, 0x50           
	s_mov_b32 m0, s4           
	s_load_dwordx8 s[12:19], s[2:3], 0x0       
	s_load_dwordx4 s[20:23], s[6:7], 0x0          
	v_interp_p1_f32_e32 v7, v0, attr0.x
	v_interp_p1_f32_e32 v8, v0, attr0.y      
	v_interp_p2_f32_e32 v7, v1, attr0.x          
	v_interp_p2_f32_e32 v8, v1, attr0.y   
	s_waitcnt lgkmcnt(0)     
	image_sample v[2:4], v[7:8], s[12:19], s[20:23] dmask:0x7 dim:SQ_RSRC_IMG_2D
	image_get_lod v5, v[7:8], s[12:19], s[20:23] dmask:0x2 dim:SQ_RSRC_IMG_2D
BB16_2:
	v_nop                 
	s_or_b64 exec, exec, s[8:9]              
	s_and_b64 exec, exec, s[0:1]         
	s_waitcnt vmcnt(0)                
	exp mrt0 v2, v3, v4, v5 done vm          
	s_endpgm               

We see that v7 and v8 hold the UV coordinates, but they are actually only computed inside the branch (v_interp). The optimizer is allowed to place UV computation inside the branch here. If there is divergence in a quad, the disabled lanes won’t get correct values for v7 and v8 (since execution is masked), and LOD becomes garbage.

RADV (ACO)

Coupled with Navi cards, this is probably the most bleeding edge setup you can run. It’s a completely new compiler backend for AMD cards, not based on LLVM.

Just happens to work

BB0:
	s_wqm_b64 exec, exec 
	s_mov_b32 s0, s3    
	s_movk_i32 s1, 0x8000  
	s_load_dwordx4 s[8:11], s[0:1], 0x0 
	s_mov_b32 m0, s4 
	v_interp_p1_f32_e32 v4, v0, attr0.y 
	v_cvt_i32_f32_e32 v2, v2   
	v_cvt_i32_f32_e32 v3, v3   
	v_lshl_add_u32 v2, v3, 1, v2 
	v_lshlrev_b32_e32 v2, 4, v2 
	s_waitcnt lgkmcnt(0)   
	buffer_load_dword v2, v2, s[8:11], 0 offen 
	v_interp_p2_f32_e32 v4, v1, attr0.y 
	v_interp_p1_f32_e32 v0, v0, attr0.x  
	v_interp_p2_f32_e32 v0, v1, attr0.x 
	v_mov_b32_e32 v1, v4  
	s_waitcnt vmcnt(0)   
	v_cmp_lt_f32_e32 vcc, 0, v2  
	s_and_saveexec_b64 s[0:1], vcc     
	s_cbranch_execz BB3  
BB1:
	s_movk_i32 s3, 0x8000 
	s_load_dwordx8 s[4:11], s[2:3], 0x0  
	s_load_dwordx4 s[12:15], s[2:3], 0x50 
	s_waitcnt lgkmcnt(0)   
	image_sample v[2:4], v[0:1], s[4:11], s[12:15] dmask:0x7 dim:SQ_RSRC_IMG_2D
	image_get_lod v0, v[0:1], s[4:11], s[12:15] dmask:0x2 dim:SQ_RSRC_IMG_2D
BB3:
	s_andn2_b64 exec, s[0:1], exec 
	s_cbranch_execz BB6  
BB4:
	v_mov_b32_e32 v0, 0xc1200000 
	v_mov_b32_e32 v2, 0  
	v_mov_b32_e32 v3, 0   
	v_mov_b32_e32 v4, 0  
BB6:
	s_mov_b64 exec, s[0:1] 
	s_waitcnt vmcnt(0)   
	exp mrt0 v2, v3, v4, v0 done vm 
	s_endpgm

This time, UV is interpolated outside the branch, so sampling in divergent control flow ends up working after all. The registers are well defined as they enter the branch. For AMD, it seems like it just comes down to whether or not the lanes have correct values placed in them and not having them be clobbered by the time we get around to sampling. There doesn’t seem to be any hardware level checks for divergence.

AMDVLK

Garbage LOD

AMDVLK uses the same LLVM stack that RADV LLVM uses, and no surprise, same result, and basically same exact ISA is generated.

Windows

Also just happens to work

I guess it’s the exact same case as the ACO compiler here. No need to paste disassembly.

Intel

Tested on UHD 620 (8th gen mobile CPU I think).

Anvil (Mesa)

The Mesa compiler can spit out assembly, which is nice.

Just happens to work

ISA (a little too wide to embed): https://gist.github.com/Themaister/7c5b011cde3c7585459b089f80f897e2

From what I can make out of the ISA, the UV is interpolated outside control flow, and then only the sampling takes place in control flow. It seems like Intel has similar behavior as AMD here, in that just as long as the registers are valid, divergent sampling “works”.

Windows

Just happens to work

Doesn’t seem to be a way to get ISA from Windows driver, but I suppose it’s same as ANV here.

Nvidia

Tested on a Turing GPU on Linux. Didn’t bother testing on Windows as well considering the driver stack is basically the same.

LOD is clamped to 0, textureQueryLod returns -32.0.

Apparently, now we start seeing interesting behavior. Unfortunately, there is no public ISA to look at. The -32.0 LOD might look weird, but this is kind of expected. This is apparently the smallest possible representable LOD on this GPU. LOD is usually represented in some kind of fixed point, log2(0) = -inf after all.

I confirmed it worked as expected when using non-divergent execution as a sanity check.

Arm

Tested on Mali-G72.

LOD is clamped to 0, textureQueryLod returns -128.0.

Very similar behavior to Nvidia here, except the LOD is -128.0 rather than -32.0. I confirmed it worked as expected when using non-divergent execution as a sanity check.

QCOM

Tested on Adreno 506.

Garbage LOD

Again, no ISA to look at. I confirmed it worked as expected when using non-divergent execution as a sanity check.

Conclusion

Never ever rely on LOD behavior with divergent quads (EDIT: at least the way it’s specced out and implemented on Vulkan drivers right now) . You’d be contributing to the pain and suffering of compiler engineers the world over. Staying quad-uniform is fine though.

Yet another blog explaining Vulkan synchronization

After playing Fire Emblem: Three Houses for an ungodly 160 hours over the past weeks, I guess it’s time to put on my professor hat on the internet instead.

One topic I’ve been meaning to write about for a long time is synchronization in Vulkan. It’s a large hurdle to overcome when learning the API, and rather than mechanically explaining how it works, my goal here is to instill a mental model in the reader. Despite its reputation for maddening complexity, it is actually understandable and quite logical once you get over the initial hurdles.

Where appropriate, I will use terms which match the Vulkan specification.

The Vulkan queue

For this part of the discussion we will only consider a single VkQueue. There is a lot to consider for single-queue synchronization, and dealing with multiple queues is a small extension on top of single-queue synchronization, which is covered at the end when discussing semaphores.

The Vulkan queue is simply an abstraction where command buffers are submitted and the GPU churns through commands. Let’s get some common beginner mistakes out of the way first.

Command buffer misconceptions

Many developers seem to think that command buffer boundaries are somehow special in Vulkan. It is very important to clarify that for purposes of synchronization, everything submitted to a queue is simply a linear stream of commands. Any synchronization applies globally to a VkQueue, there is no concept of a only-inside-this-command-buffer synchronization.

Command overlap

The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write. Unless you add synchronization yourself, all commands in a queue execute out of order. Reordering may happen across command buffers and even vkQueueSubmits. This makes sense, considering that Vulkan only sees a linear stream of commands once you submit, it is a pitfall to assume that splitting command buffers or submits adds some magic synchronization for you.

NOTE: Unlike Vulkan, I do believe D3D12 disables any overlap across queue submits, but don’t quote me on that. Might be something to consider if you’re coming from D3D-land.

NOTE: Frame buffer operations inside a render pass happen in API-order, of course. This is a special exception which the spec calls out.

Pipeline stages

Every command you submit to Vulkan goes through a set of stages. These stages are represented in the VK_PIPELINE_STAGE enum. See chapter 6.1.2 in spec. When we synchronize work in Vulkan, we synchronize work happening in these pipeline stages as a whole, and not individual commands of work.

Draw calls, copy commands and compute dispatches all go through pipeline stages one by one.

The mysterious TOP_OF_PIPE and BOTTOM_OF_PIPE stages

A common stumbling block is the TOP_OF_PIPE and BOTTOM_OF_PIPE stages. These are essentially “helper” stages, which do no actual work, but serve some important purposes. Every command will first execute the TOP_OF_PIPE stage. This is basically the command processor on the GPU parsing the command. BOTTOM_OF_PIPE is where commands retire after all work has been done. TOP_OF_PIPE and BOTTOM_OF_PIPE are useful in specific scenarios, keep them in mind for later, as they are a little tricky and beginners make many mistakes with these.

In-queue execution barriers

Before we tackle memory barriers, we must fully understand execution barriers, as they are a subset of memory barriers. The primary mechanism in Vulkan to introduce execution barriers is the pipeline barrier.

To introduce the simplest form of an execution dependency we use a pipeline barrier:

void vkCmdPipelineBarrier(
    VkCommandBuffer                             commandBuffer,
    VkPipelineStageFlags                        srcStageMask,
    VkPipelineStageFlags                        dstStageMask,
    VkDependencyFlags                           dependencyFlags,
    uint32_t                                    memoryBarrierCount,
    const VkMemoryBarrier*                      pMemoryBarriers,
    uint32_t                                    bufferMemoryBarrierCount,
    const VkBufferMemoryBarrier*                pBufferMemoryBarriers,
    uint32_t                                    imageMemoryBarrierCount,
    const VkImageMemoryBarrier*                 pImageMemoryBarriers);

If we ignore the memory barriers and flags here, we’re essentially left with two arguments, srcStageMask and dstStageMask. This represents the heart of the Vulkan synchronization model. We’re splitting the command stream in two with a barrier, where we consider “everything before” the barrier, and “everything after” the barrier, and these two halves are synchronized in some way.

Section 6.1 lays this out in rather obtuse language, but we boil it down to:

srcStageMask

This represents what we are waiting for. Vulkan does not let you add fine-grained dependencies between individual commands. Instead you get to look at all work which happens in certain pipeline stages. For example, if we were to submit this series of commands starting off a fresh VkDevice:

  • vkCmdDispatch (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)
  • vkCmdCopyBuffer (VK_PIPELINE_STAGE_TRANSFER_BIT)
  • vkCmdDispatch (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)
  • vkCmdPipelineBarrier (srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)

We would be referring to the two vkCmdDispatch commands, as they perform their work in the COMPUTE stage. Even if we split these 4 commands into 4 different vkQueueSubmits, we would still consider the same commands for synchronization. Essentially, the work we are waiting for is all commands which have ever been submitted to the queue including any previous commands in the command buffer we’re recording. srcStageMask then restricts the scope of what we are waiting for. Only work happening in COMPUTE_SHADER_BIT stage is relevant in this example. srcStageMask is a bit-mask as the name suggests, so it’s perfectly fine to wait for both COMPUTE and TRANSFER work.

There are also flags to refer to “all commands”, ALL_COMMANDS_BIT, which basically drains the entire queue for work. ALL_GRAPHICS_BIT is the same, but only for render passes.

NOTE: Here we will find a potential use case for TOP_OF_PIPE. srcStageMask of TOP_OF_PIPE is basically saying “wait for nothing”, or to be more precise, we’re waiting for the GPU to parse all commands, which is, a complete noop. We had to parse all commands before getting to the pipeline barrier command to begin with. When we get to memory barriers, this can be very useful.

dstStageMask

This represents the second half of the barrier. Any work submitted after this barrier will need to wait for the work represented by srcStageMask before it can execute. Only work in the specified stages are affected. For example, if dstStageMask is FRAGMENT_SHADER_BIT, vertex shading for future commands can begin executing early, we only need to wait once FRAGMENT_SHADER_BIT is reached.

NOTE: As an analog to srcStageMask with TOP_OF_PIPE, for dstStageMask, using BOTTOM_OF_PIPE can be kind of useful. This basically translates to “block the last stage of execution in the pipeline”. Basically, we translate this to mean “no work after this barrier is going to wait for us”. This might seem meaningless, but it will be useful when we discuss semaphores and memory barriers later.

A crude example

Let’s assume we record and submit some commands on a fresh VkDevice:

  1. vkCmdDispatch
  2. vkCmdDispatch
  3. vkCmdDispatch
  4. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = COMPUTE)
  5. vkCmdDispatch
  6. vkCmdDispatch
  7. vkCmdDispatch

With this barrier, the “before” set is commands {1, 2, 3}. The “after” set is {5, 6, 7}. A possible execution order here could be:

  • #3
  • #2
  • #1
  • #7
  • #6
  • #5

{1, 2, 3} can execute out-of-order, and so can {5, 6, 7}, but these two sets of commands can not interleave execution. In spec lingo {1, 2, 3} happens-before {5, 6, 7}.

https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples has some examples of how these stages are used in practice.

Events aka. split barriers

Vulkan provides a way to get overlapping work in-between barriers. The idea of VkEvent is to get some unrelated commands in-between the “before” and “after” set of commands, e.g.:

  1. vkCmdDispatch
  2. vkCmdDispatch
  3. vkCmdSetEvent(event, srcStageMask = COMPUTE)
  4. vkCmdDispatch
  5. vkCmdWaitEvent(event, dstStageMask = COMPUTE)
  6. vkCmdDispatch
  7. vkCmdDispatch

The “before” set is now {1, 2}, and the after set is {6, 7}. 4 here is not affected by any synchronization and it can fill in the parallelism “bubble” we get when draining the GPU of work from 1, 2, 3. For advanced compute, this is a very important thing to know about, but not all GPUs and drivers can take advantage of this feature.

Execution dependency chain

This is a subtle – but very important – point which I don’t think is well enough understood. The general gist of it is that when we use dstStageMask to block stages, the dependencies in srcStageMask are carried forward into the blocked stages. Waiting for dstStageMask later will also wait for any dependencies dstStageMask had. It is easier to show an example here:

  1. vkCmdDispatch
  2. vkCmdDispatch
  3. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)
  4. vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)
  5. vkCmdDispatch
  6. vkCmdDispatch

In this example we actually get a dependency between {1, 2} and {5, 6}. This is because we created a chain of dependencies between COMPUTE -> TRANSFER -> COMPUTE. When we wait for TRANSFER in 4. we must also wait for anything which is currently blocking TRANSFER. This might seem confusing, but it makes sense if we consider a slightly modified example.

  1. vkCmdDispatch
  2. vkCmdDispatch
  3. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)
  4. vkCmdMagicDummyTransferOperation
  5. vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)
  6. vkCmdDispatch
  7. vkCmdDispatch

In this scenario, it’s clear that {4} must wait for {1, 2}. And {6, 7} must wait for {4}. So, we have created a chain where {1, 2} -> {4} -> {6, 7}, and as {4} is noop, {1, 2} -> {6, 7} is achieved. That’s essentially the chain.

This has some uses when you want to “link up” barriers for whatever reason. I kinda wish Vulkan had some special “scoreboard” pipeline stages just for this use case …

Pipeline stages and render passes

COMPUTE and TRANSFER work is very simple when it comes to pipeline stages. The only stages they execute are:

  • TOP_OF_PIPE
  • DRAW_INDIRECT (for indirect compute only)
  • COMPUTE / TRANSFER
  • BOTTOM_OF_PIPE

Render passes are a bit more intricate, and it’s very easy to confuse which pipelines stages do what.

In render passes there are two “families” of pipeline stages, those which concern themselves with geometry processing, and the fragment family, which does rasterization / frame buffer operations.

Aside from TOP_OF_PIPE/BOTTOM_OF_PIPE, we have

Geometry

  • DRAW_INDIRECT – Parses indirect buffers
  • VERTEX_INPUT – Consumes fixed function VBOs and IBOs
  • VERTEX_SHADER – Actual vertex shader
  • TESSELLATION_CONTROL_SHADER
  • TESSELLATION_EVALUATION_SHADER
  • GEOMETRY_SHADER

Fragment

  • EARLY_FRAGMENT_TESTS
  • FRAGMENT_SHADER
  • LATE_FRAGMENT_TESTS
  • COLOR_ATTACHMENT_OUTPUT

For the most part, it’s the Fragment stages which are a bit confusing. Each of them have their own use cases.

EARLY_FRAGMENT_TESTS

This is the stage where early depth/stencil tests happen. This stage isn’t all that useful or meaningful except in some very obscure scenarios with frame buffer self-dependencies (aka, GL_ARB_texture_barrier). This is also where a render pass performs its loadOp of a depth/stencil attachment.

LATE_FRAGMENT_TESTS

This is where late depth-stencil tests take place, and also where depth-stencil attachments are stored with storeOp when a render pass is done.

Helpful tip on fragment test stages

It’s somewhat confusing to have two stages which basically do the same thing. When you’re waiting for a depth map to have been rendered in an earlier render pass, you should use srcStageMask = LATE_FRAGMENT_TESTS_BIT, as that will wait for the storeOp to finish its work.

When blocking a render pass with dstStageMask, just use a mask of EARLY_FRAGMENT_TESTS | LATE_FRAGMENT_TESTS.

NOTE: dstStageMask = EARLY_FRAGMENT_TESTS alone might work since that will block loadOp, but there might be shenanigans with memory barriers if you are 100% pedantic about any memory access happening in LATE_FRAGMENT_TESTS. If you’re blocking an early stage, it never hurts to block a later stage as well.

COLOR_ATTACHMENT_OUTPUT

This one is where loadOp, storeOp, MSAA resolves and frame buffer blend stage takes place, basically anything which touches a color attachment in a render pass in some way. If you’re waiting for a render pass which uses color to be complete, use srcStageMask = COLOR_ATTACHMENT_OUTPUT, and similar for dstStageMask when blocking render passes from execution.

Memory barriers

Now that we have the basics for execution barriers, we can kick it up a notch and consider memory barriers.

Execution order and memory order are two different things. GPUs are notorious for having multiple, incoherent caches which all need to be carefully managed to avoid glitched out rendering. This means that just synchronizing execution alone is not enough to ensure that different units on the GPU can transfer data between themselves.

If you are familiar with how C++11 introduced memory order and atomics, it is a good start, but the C++11 memory model does not consider that memory access can be incoherent to my knowledge. All CPU memory is assumed to be coherent, but memory order is weak on basically anything non-x86. Vulkan expands on this concept.

The two concepts in the Vulkan specification we need to understand is memory being available and memory being visible. This is an abstraction over the fact that GPUs have incoherent caches. To explain this I will describe a mental model of a hypothetical GPU design which should make sense if you are familiar with how caches work.

NOTE: There is a formal Vulkan memory model now which covers all of this in extreme detail. I admit I have not studied it enough to make references to it here, but developers really don’t need to know that level of detail to use Vulkan correctly.

The L2 cache / main memory

We will let the last cache hierarchy represent the “master” memory controller which all caches are connected to. This cache is connected to all other L1 caches, and external DDR memory. The GPU DDR memory is connected to the CPU memory controller in some way (PCI-e or UMA).

When our L2 cache contains the most up-to-date data there is, we can say that memory is available, because L1 caches connected to L2 can pull in the most up-to-date data there is.

Incoherent L1 caches

Vulkan specifies a bunch of flags in the VK_ACCESS_ series of enums. These flags represent memory access which can be performed. Each pipeline stage can perform certain memory access, and thus we take the combination of pipeline stage + access mask and we get potentially a very large number of incoherent caches on the system. Each GPU core has its own set of L1 caches as well.

Of course, real GPUs will only have a fraction of the possible caches here, but as long as we are explicit about this in the API, any GPU driver can simplify this as needed.

Under section 6.1.3, table 4 in the Vulkan spec you can see a list of all possible access masks which can be used with a pipeline stage. These access masks either read from a cache, or write to an L1 cache in our mental model.

We say that memory is visible to a particular stage + access combination if memory has been made available and we then make that memory visible to the relevant stage + access mask.

Once a shader stage writes to memory, the L2 cache no longer has the most up-to-date data there is, so that memory is no longer considered available. If other caches try to read from L2, it will see undefined data. Whatever wrote that data must make those writes available before the data can be made visible again.

Cache flush and invalidate

To be clear, we can say that “making memory available” is all about flushing caches, and “making memory visible” is invalidating caches. This should make it more obvious what is going on. I will use the spec terminology however.

VkMemoryBarrier

If we revisit vkCmdPipelineBarrier, we can pass in a list of global memory barriers.

typedef struct VkMemoryBarrier {
    VkStructureType sType;
    const void* pNext;
    VkAccessFlags srcAccessMask;
    VkAccessFlags dstAccessMask;
} VkMemoryBarrier;

A global memory barrier deals with access to any resource, and it’s the simplest form of a memory barrier. This means that in vkCmdPipelineBarrier, we are specifying 4 things to happen in order:

  • Wait for srcStageMask to complete
  • Make all writes performed in possible combinations of srcStageMask + srcAccessMask available
  • Make available memory visible to possible combinations of dstStageMask + dstAccessMask.
  • Unblock work in dstStageMask.

A common misconception I see is that _READ flags are passed into srcAccessMask, but this is redundant. It does not make sense to make reads available, i.e. you don’t flush caches when you’re done reading data.

Memory access and TOP_OF_PIPE/BOTTOM_OF_PIPE

Never use AccessMask != 0 with these stages. These stages do not perform memory accesses, so any srcAccessMask and dstAccessMask combination with either stage will be meaningless, and spec disallows this. TOP_OF_PIPE and BOTTOM_OF_PIPE are purely there for the sake of execution barriers, not memory barriers.

Split memory barriers

A very important point here is that it’s perfectly possible to split up the “make available” and “make visible” operations.  This is similar to the execution dependency chain discussed earlier.

We can do something silly like:

  • vkCmdDispatch – writes to an SSBO, VK_ACCESS_SHADER_WRITE_BIT
  • vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER, srcAccessMask = SHADER_WRITE_BIT, dstAccessMask = 0)
  • vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE, srcAccessMask = 0, dstAccessMask = SHADER_READ_BIT)
  • vkCmdDispatch – read from the same SSBO, VK_ACCESS_SHADER_READ_BIT

While StageMask cannot be 0, AccessMask can be 0.

VkBufferMemoryBarrier

This is not very interesting, we’re just restricting memory availability and visibility to a specific buffer. No GPU I know of actually cares, I think it makes more sense to just use VkMemoryBarrier rather than bothering with buffer barriers.

VkImageMemoryBarrier

Unlike VkBufferMemoryBarrier, this one is critical. You have to change image layouts at some point and this is done as part of an image memory barrier.

typedef struct VkImageMemoryBarrier {
    VkStructureType sType;
    const void* pNext;
    VkAccessFlags srcAccessMask;
    VkAccessFlags dstAccessMask;
    VkImageLayout oldLayout;
    VkImageLayout newLayout;
    uint32_t srcQueueFamilyIndex;
    uint32_t dstQueueFamilyIndex;
    VkImage image;
    VkImageSubresourceRange subresourceRange;
} VkImageMemoryBarrier;

The interesting bits are oldLayout and newLayout. The layout transition happens in-between the make available and make visible stages of a memory barrier. The layout transition itself is considered a read/write operation, and the rules are basically that memory for the image must be available before the layout transition takes place. After a layout transition, that memory is automatically made available (but not visible!). Basically, think of the layout transition as some kind of in-place data munging which happens in L2 cache somehow.

A practical TOP_OF_PIPE example

Now we can actually make a practical example with TOP_OF_PIPE. If we just allocated an image and want to start using it, what we want to do is to just perform a layout transition, but we don’t need to wait for anything in order to do this transition. This is where TOP_OF_PIPE is useful. Let’s say that we’re allocating a fresh image, and we’re going to use it in a compute shader as a storage image. The pipeline barrier looks like:

  • srcStageMask = TOP_OF_PIPE – Wait for nothing
  • dstStageMask = COMPUTE – Unblock compute after the layout transition is done
  • srcAccessMask = 0 – This is key, there are no pending writes to flush out. This is the only way to use TOP_OF_PIPE in a memory barrier. It’s important to note that freshly allocated memory in Vulkan is always considered available and visible to all stages and access types. You cannot have stale caches when the memory was never accessed … What about recycled/aliased memory you ask? Excellent question, we’ll cover that too later.
  • oldLayout = UNDEFINED – Input is garbage
  • newLayout = GENERAL – Storage image compatible layout
  • dstAccessMask = SHADER_READ | SHADER_WRITE
A practical BOTTOM_OF_PIPE example

My favourite example here is swapchain images. We have to transition them into VK_IMAGE_LAYOUT_PRESENT_SRC_KHR before passing the image over to the presentation engine.

After transitioning into this PRESENT layout, we’re not going to touch the image again until we reacquire the image, so dstStageMask = BOTTOM_OF_PIPE is appropriate.

  • srcStageMask = COLOR_ATTACHMENT_OUTPUT (assuming we rendered to swapchain in a render pass)
  • srcAccessMask = COLOR_ATTACHMENT_WRITE
  • oldLayout = COLOR_ATTACHMENT_OPTIMAL
  • newLayout = PRESENT_SRC_KHR
  • dstStageMask = BOTTOM_OF_PIPE
  • dstAccessMask = 0

Having dstStageMask = BOTTOM_OF_PIPE and access mask being 0 is perfectly fine. We don’t care about making this memory visible to any stage beyond this point. We will use semaphores to synchronize with the presentation engine anyways.

Implicit memory ordering – semaphores and fences

Semaphores and fences are quite similar things in Vulkan, but serve a different purpose. Semaphores facilitate GPU <-> GPU synchronization across Vulkan queues, and fences facilitate GPU -> CPU synchronization.

These objects are signaled as part of a vkQueueSubmit. However, one very important thing to note about semaphores and fences is how they interact with memory. To signal a semaphore or fence, all previously submitted commands to the queue must complete. If this were a regular pipeline barrier, we would have srcStageMask = ALL_COMMANDS_BIT. However, we also get a full memory barrier, in the sense that all pending writes are made available. Essentially, srcAccessMask = MEMORY_WRITE_BIT.

Implicit memory guarantees on vkQueueSubmit

While signaling a fence or semaphore works like a full cache flush, submitting commands to the Vulkan queue, makes all memory access performed by host visible to all stages and access masks. Basically, submitting a batch issues a cache invalidation on host visible memory. A common mistake is to think that you need to do this invalidation manually when the CPU is writing into staging buffers or similar. Something like:

  • srcStageMask = HOST
  • srcAccessMask = HOST_WRITE_BIT
  • dstStageMask = TRANSFER
  • dstAccessMask = TRANSFER_READ

If the write happened before vkQueueSubmit, this is automatically done for you.

NOTE: This kind of barrier is necessary if you are using vkCmdWaitEvents where you wait for host to signal the event with vkSetEvent. In that case, you might be writing the necessary host data after vkQueueSubmit was called, which means you need a pipeline barrier like this. This is not exactly a common use case, but it’s important to understand when these API constructs are useful.

Implicit memory guarantees when waiting for a semaphore

While signalling a semaphore makes all memory available, waiting for a semaphore makes memory visible. This basically means you do not need a memory barrier if you use synchronization with semaphores since signal/wait pairs of semaphores works like a full memory barrier. Let’s see an example where queue 1 writes to an SSBO in compute, and consumes that buffer as a UBO in a fragment shader in queue 2. We’re going to assume the buffer was created with QUEUE_FAMILY_CONCURRENT.

NOTE: If you need to transfer ownership to a different queue family, you need memory barriers, one in each queue to release/acquire ownership.

Queue 1

  • vkCmdDispatch
  • vkQueueSubmit(signal = my_semaphore)

There is no pipeline barrier needed here. Signalling the semaphore waits for all commands, and all writes in the dispatch are made available to the device before the semaphore is actually signaled.

Queue 2

  • vkCmdBeginRenderPass
  • vkCmdDraw
  • vkCmdEndRenderPass
  • vkQueueSubmit(wait = my_semaphore, pDstWaitStageMask = FRAGMENT_SHADER)

When we wait for the semaphore, we specify which stages should wait for this semaphore, in this case the FRAGMENT_SHADER stage. All relevant memory access is automatically made visible, so we can safely access UNIFORM_READ_BIT in FRAGMENT_SHADER stage, without having extra barriers. The semaphores take care of this automatically, nice!

Execution dependency chain with semaphore

Just like pipeline barriers having execution dependency chains, we can create execution dependency chains with semaphores as well. pDstWaitStageMask in vkQueueSubmit blocks certain stages from executing.

If we create a pipeline barrier with srcStageMask targeting one of the stages in the wait stage mask, we also wait for the semaphore to be signaled. This is extremely useful for doing image layout transitions on swapchain images. We need to wait for the image to be acquired, and only then can we perform a layout transition. The best way to do this is to use pDstWaitStageMask = COLOR_ATTACHMENT_OUTPUT_BIT, and then use srcStageMask = COLOR_ATTACHMENT_OUTPUT_BIT in a pipeline barrier which transitions the swapchain image after semaphore is signaled.

Host memory reads

While signalling a fence makes all memory available, it does not make them available to the CPU, just within the device. This is where dstStageMask = PIPELINE_STAGE_HOST and dstAccessMask = ACCESS_HOST_READ_BIT flags come in. If you intend to read back data to the CPU, you must issue a pipeline barrier which makes memory available to the HOST as well. In our mental model, we can think of this as flushing the GPU L2 cache out to GPU main memory, so that CPU can access it over some bus interface.

Safely recycling memory and aliasing memory

We earlier had an example with creating a fresh VkImage and transitioning it from UNDEFINED, and waiting for TOP_OF_PIPE. As explained, we did not need to specify any srcAccessMask since we knew that memory was guaranteed to be available. The reason for this is because of the implied guarantee of signalling a fence. In order to recycle memory, we must have observed that the GPU was done using the image with a fence. In order to signal that fence, any pending writes to that memory must have been made available, so even recycled memory can be safely reused without a memory barrier. This point is kind of subtle, but it really helps your sanity not having to inject memory barriers everywhere.

However, what if we consider we want to alias memory inside a command buffer? The rule here is that in order to safely alias, all memory access from the active alias must be made available before a new alias can take its place. Here’s an example for a case where we have two VkImages which are used in two render passes, and they alias memory. When one image alias is written to, all other images immediately become “undefined”. There are some exceptions in the specification for when multiple aliases can be valid at the same time, but for now we assume that is not the case.

  • vkCmdPipelineBarrier(image = image1, oldLayout = UNDEFINED, newLayout = COLOR_ATTACHMENT_OPTIMAL, srcStageMask = COLOR_ATTACHMENT_OUTPUT, srcAccessMask = COLOR_ATTACHMENT_WRITE, dstStageMask = COLOR_ATTACHMENT_OUTPUT, dstAccessMask = COLOR_ATTACHMENT_WRITE|READ)

image1 will contain garbage here so we need to transition away from UNDEFINED. We need to make any pending writes to COLOR_ATTACHMENT_WRITE available before the layout transition takes place, assuming that we’re running these commands every frame. The following render pass will wait for this transition to take place using dstStageMask/dstAccessMask.

  • vkCmdBeginRenderPass/EndRenderPass
  • vkCmdPipelineBarrier(image = image2,  …)
  • vkCmdBeginRenderPass/EndRenderPass

image1 was written to, so image2 was invalidated. Similar to the pipeline barrier for image1, we need to transition away from UNDEFINED. We need to make sure any write to image1 is made available before we can perform the transition. Next frame, image1 needs to take ownership again, and so on.

External subpass dependencies

Render passes in Vulkan have a concept of EXTERNAL subpass dependencies. This is arguably the most misunderstood aspect of Vulkan sync. I’d like to dedicate a section to this, because too many developers are lured into using it in cases where it’s not terribly useful and very likely to cause bugs.

The main purpose of external subpass dependencies is to deal with initialLayout and finalLayout of an attachment reference. If initialLayout != layout used in the first subpass, the render pass is forced to perform a layout transition.

If you don’t specify anything else, that layout transition will wait for nothing before it performs the transition. Or rather, the driver will inject a dummy subpass dependency for you with srcStageMask = TOP_OF_PIPE_BIT. This is not what you want since it’s almost certainly going to be a race condition. You can set up a subpass dependency with the appropriate srcStageMask and srcAcessMask. The external subpass dependency is basically just a vkCmdPipelineBarrier injected for you by the driver. The whole premise here is that it’s theoretically better to do it this way because the driver has more information, but this is questionable, at least on current hardware and drivers.

There is a very similar external subpass dependency setup for finalLayout. If finalLayout differs from the last use in a subpass, driver will transition into the final layout automatically. Here you get to change dstStageMask/dstAccessMask. If you do nothing here, you get BOTTOM_OF_PIPE/0, which can actually be just fine. A prime use case here is swapchain images which have finalLayout = PRESENT_SRC_KHR.

Essentially, you can ignore external subpass dependencies. Their added complexity give very little gain. Render pass compatibility rules also imply that if you change even minor things like which stages to wait for, you need to create new pipelines! This is dumb, and will hopefully be fixed at some point in the spec.

However, while the usefulness of external subpass dependencies is questionable, they have some convenient use cases I’d like to go over:

Automatically transitioning TRANSIENT_ATTACHMENT images

If you’re on mobile, you should be using transient images where possible. When using these attachments in a render pass, it makes sense to always have them as initialLayout = UNDEFINED. Since we know that these images can only ever be used in COLOR_ATTACHMENT_OUTPUT or EARLY/LATE_FRAGMENT_TEST stages depending on their image format, the external subpass dependency writes itself, and we can just use transient attachments without having to think too hard about how to synchronize them. This is what I do in my Granite engine, and it’s quite useful. Of course, we could just inject a pipeline barrier for this exact same purpose, but that’s more boilerplate.

Automatically transitioning swapchain images

Typically, swapchain images are always just used once per frame, and we can deal with all synchronization using external subpass dependencies. We want initialLayout = UNDEFINED, and finalLayout = PRESENT_SRC_KHR.

srcStageMask is COLOR_ATTACHMENT_OUTPUT which lets us link up with the swapchain acquire semaphore. For this case, we will need an external subpass dependency. For the finalLayout transition after the render pass, we are fine with BOTTOM_OF_PIPE being used. We’re going to use semaphores here anyways.

I also do this in Granite.

Conclusion

I hope this was useful. The post got a bit more mechanical than I hoped for, but it should be a more distilled version of the specification.

A tour of Granite’s Vulkan backend – Part 6

Pipelines – what is your pain tolerance?

A lot of thought goes into pipelines. Eager or lazy creation, dynamic or static render state. Forget about one size fits all. How close will you approach the volcano? Make sure there is no lava under your feet when you’re done.

My pain tolerance is kinda low, I’d rather watch it on TV. Granite is a bit similar, it prefers to be cooled off magma instead.

The ideal case

Vulkan is designed to let you forget about filthy, filthy render state management and work exclusively with pristine VkPipeline objects. These objects encode every possible choice you can make when flipping the fixed-function bits and bobs on the GPU.

Getting to a point when you only think in terms of VkPipelines, and all pipelines are compiled up front in load-time is a holy grail of modern graphics API implementation. Gone are the stutters, the hitches, the sad 100 ms glitches which throw you off guard when you peek around the wall.

To get there, you must sacrifice all notions of flexibility, no last minute decisions, everything must be planned out in detail ahead of time. There is a lot of state which is pulled together to form a VkPipeline, an all-star cast of colorful characters and a plot with a lot of depth.

… ahem, that got a bit weird.

Shader modules

Obviously, the core part of a pipeline is the shader modules, the Vulkan::Program in Granite. From the program we automatically know the VkPipelineLayout because of reflection, so no problems there.

Render pass

We also need to know the render pass (and subpass index!) in order to create a pipeline. This one can be really counter-intuitive. The shader compiler often needs to know which render target formats are in use in order to generate final ISA. This is where we start running into problems. There is no obvious reason to combine a render pass and shader modules together. In my mental model these two should not know about each other, but drivers would really like that to be the case. For example, if I were to render a scene it would look something like:

  • Start rendering to some attachments (VkRenderPass is known here)
  • Set up the default rendering state appropriate for the pass. There are different “default” states for depth-only, opaque, lighting, and transparency rendering. Part of the render state vector is determined here.
  • Ask the renderer to render some list of visible objects which survived culling. Shader modules are known at this level, and some render state might be per-material, like two-sided rendering, etc.

There are a few ways to make this work, but somewhere you must have higher-level knowledge which shader modules are used in which render passes. If an application has a baking step during build, that might be a nice place to do it, but not all graphics API use cases work this way. Emulation comes to mind where you cannot know what an application will do until you execute it. User scripting could be a nightmare as well …

Render passes also have a lot of combinatorial explosion. If we just change from MSAA 2x to MSAA 4x, that means new render passes, and new pipelines which are compatible with those render passes. Clearly we see that something trivial like changing a setting in the options menu of most games will trickle down into a completely different set of pipelines for all materials. This kind of coupling isn’t what I call clean, but sometimes sanity must be sacrificed for performance. I’d prefer to keep my sanity.

Fixed-function vertex bindings

This consists of attributes, bindings, strides and input rates. This one is usually not a problem if you control the asset pipeline. You can decide on a “standard” vertex buffer layout and forget about it. There is some slight annoyance here if we want to support glTF or other scene transmission formats unless we’re prepared to rewrite all vertex buffers to match the standard layout.

Shader compilers like to know about this information since some ISAs need to fetch vertices in software, and therefore need to be able to compute correct offsets based on VertexIndex/InstanceIndex.

10 – Fixed-function render state

When rendering triangles in Vulkan, there is still a ton of state to deal with. Vulkan takes all the gunk you’d set in glEnable/glDisable and various other functions and bundles it together into one massive struct. I wrote up a sample which demonstrates how render state is set, saved and restored.

I have to admit I kinda like the old-school way of setting state individually. Isolating render state to a command buffer avoid almost all the horrifying issues with state management in OpenGL. In GL, the state is global, and leaked between modules and render passes. This is really scary, and you’re basically forced to make a custom state tracker on top of GL to keep yourself sane. There was also no good way of “saving” just the state you cared about and restoring it without writing a lot of custom code. I like the idea of setting some “standard” state which clears out any possible leakage of state. Overall, Granite’s model is maximum convenience.

A concept I’ve seen in other projects is the idea of creating big structures on the user side which mimics a pipeline, but I don’t think this is very useful unless it’s basically a full VkGraphicsPiplineCreateInfo with all the bells and whistles. If we don’t, we still don’t have the information we need to create a pipeline anyways, like render pass information for example, and we’re back to hashing with lazy creation.

Even just render state tends to be split in two halves for me. Some state tends to be “global” in nature and some tends to be “local”. This is state which is set by the higher level renderer which thinks in terms of:

  • Opaque pass vs transparent pass (alpha blending)
  • Depth-only? (depth write enable, depth bias?, equal test?)
  • Lighting pass? (additive blending?)
  • Stencil? (for deferred)

This state is saved and restored as necessary, then we have the objects which are rendered in a render pass which typically think in terms of:

  • Two sided mesh? (face culling)
  • Primitive restart?
  • Topology?
  • Shader program?
  • Vertex attributes?

I don’t like to couple these parts of the renderer together, so a tightly packed blob of state in Vulkan::CommandBuffer does the job for me. At the end of the day, the only real cost of this flexibility is some extra hashing cost. It doesn’t light up in the profile for me.

Overall, I like the “immediate” nature of the CommandBuffer interface. There’s always a hybrid solution if that is ever needed where I would set the state I’m interested in, then pull out a persistent VkPipeline handle which can be used later and bypasses any hashing of state when bound.

Avoiding stutters

The real problem with lazily creating pipelines is vkCreateGraphicsPipelines in my opinion. Doing this at the last minute is almost a guaranteed hitch, and it should be avoided at all cost. Avoiding last minute pipeline compilation is the real reason we should know all state combinations up front, not because we get to bind VkPipelines directly and avoid some small hashing cost.

My strategy for dealing with this problem is pre-warming the hashmaps with previously seen data. Granite integrates the Fossilize project to solve the problem of serializing all information needed to create pipelines in a GPU and driver independent way. In theory, I would be able to ship a Fossilize database as part of an application and use that to pre-warm all historically observed pipelines and their dependent objects at Vulkan::Device creation time.

To my knowledge, this is basically how all GL and D3D11 drivers behave. Cache all the things.

Conclusion

Granite’s render state management is old-school, but I like it. Pre-warming the various hashmaps in Vulkan::Device is the strategy I used to avoid any pipeline compilation stutters.

There are many alternatives for any graphics API abstraction. There are things I like in legacy APIs, and things I hate. I wanted to keep the parts I liked, and avoid the parts I disliked.

… that’s all folks!

I think this is the end of this series for now. I’ve gone over the Vulkan backend in broad strokes, and I hope it was interesting and useful.

A tour of Granite’s Vulkan backend – Part 5

Render passes and synchronization

This is part 5 in the tour of Granite‘s Vulkan backend. We’re going to get knee-deep in the aspects of Vulkan which are the most difficult to learn in my opinion, and mastering these topics of Vulkan is the real hurdle towards mastery. This level of understanding is something high-level APIs will prevent you from reaching.

This post isn’t intended to be a tutorial on Vulkan synchronization, so I’ll assume some basic level of knowledge.

Render passes

Render passes is a new fundamental part of Vulkan which does not exist in any of the legacy APIs. In older APIs you can freewheel how you render to frame buffers, but that approach was always terrible on tile-based GPUs, and these days with hybrid tilers, it’s probably terrible on desktop as well. Vulkan forces you to think about rendering all you need in one go to a frame buffer and then proceed to the next.

In Granite, I wanted to make sure most of the flexibility and explicitness of Vulkan render passes could be expressed with minimal boilerplate. Most projects don’t seem to pay attention to this part except that it’s something you just have to do, and very few see the benefits they bring. That is probably a reasonable stance for 2019 if you do not care about mobile performance. If you need to target mobile though, it is worth the extra work. As of writing, the feature is quite mobile-centric, but desktop GPUs seem to be inching towards tile-based architectures, so it will be interesting to see if this view on render passes will shift in the future. Even D3D12 recently got render passes too, albeit in a simplified form.

In the most basic form, render passes in Vulkan can be rather daunting to set up, and it’s one of the many battles you have to fight to get hello triangle on screen. If we take a render pass with just one sub-pass (the case we care about 99% of the time), we need to specify up front:

  • How many attachments?
  • Which formats are used?
  • How many MSAA samples?
  • initialLayout and finalLayout
  • Which image layouts to use while rendering?
  • Do we load from memory or clear the attachment on render pass begin?
  • Do we bother storing the attachments to memory?

Most of this information is boilerplate we can automate, but things like load/clear/store we cannot deduce in the backend before it is too late. Knowing this kind of information up-front can be very beneficial for bandwidth consumption, at least on mobile.

The ugly framebuffer objects

An ugly aspect of Vulkan is the use of VkFramebuffer. I want an API where I just say “start a render pass where we render to these attachments”. Creating “FBOs” up front was really ugly in GL, and I think it’s a bad abstraction to have API users carry around ownership of objects which represent little to no useful work. FBOs are empty husks which might as well just be an array of image views.

We could just create VkFramebuffers every render pass we begin and defer the deletion of it right away, but creating these objects have some cost. There’s a handle allocation in the driver at minimum, and probably a little more on certain drivers. Here I just reuse the temporary hashmap allocator which I introduced in the descriptor set model post. VkFramebuffer objects can be reused over multiple frames, but if they aren’t used for a while, they are just deleted since VkFramebuffer objects are immutable.

Automating VkRenderPass creation

This topic is actually quite complicated when we start diving into the deep end of Vulkan render passes, but we can start with the trivial cases. The API in Granite looks something like:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 1;
rp.color_attachments[0] = &rt->get_view();
rp.store_attachments = 1 << 0;
rp.clear_attachments = 1 << 0;

rp.clear_color[0].float32[0] = 1.0f;
rp.clear_color[0].float32[1] = 0.0f;
rp.clear_color[0].float32[2] = 1.0f;
rp.clear_color[0].float32[3] = 0.0f;

cmd->begin_render_pass(rp);

This is an immediate way of starting a render pass, no reason to create frame buffers up front and all that. VkRenderPass can be created lazily on-demand like everything else.

Formats / MSAA sample counts

Render passes need to know formats and sample counts, and since we pass the concrete attachments directly in begin_render_pass(), we have the information we need right here.

Image layouts and VK_SUBPASS_EXTERNAL dependencies

There are three kinds of attachments in Granite:

  • User-created. These attachments are render targets which are created with Device::create_image() and the backend does not own the resource or knows anything about how long the resource will live. Common case for user-created render targets.
  • WSI images. These images are special since they came from VkSwapchainKHR or some equivalent mechanism. We know that these images are only used for rendering and they are only consumed by the presentation engine, or some other mechanism.
  • Transient images. Images with transient usage flags only live inside render passes. They cannot be sampled from, their memory does not necessarily exist except in page tables which point to /dev/null. We don’t care what happens to these images once the render pass is done.

To deduce image layouts for a render pass we have a few different paths.

wsi images

I don’t care about preserving WSI images over multiple frames, and I don’t care about sampling from WSI images or any such weird use case after rendering, so the flow of image layouts is:

  • initialLayout = UNDEFINED (discard)
  • VkAttachmentReference -> COLOR_ATTACHMENT_OPTIMAL or whatever is required for the subpass
  • finalLayout = PRESENT_SRC_KHR or whatever layout we need when using external WSI. For something like libretro, this will be SHADER_READ_ONLY_OPTIMAL since the image will be handed off to some other render pass which we don’t know or care about. For headless PNG/video dumping, it might be TRANSFER_SRC_OPTIMAL.

When initialLayout != the layout used in the first subpass, vkCmdBeginRenderPass will actually need to perform a memory barrier implicitly to make this work. The big question is when this memory barrier takes place, and the answer is “as soon as possible” (TOP_OF_PIPE_BIT) if we don’t specify it anywhere. For this case, Granite will add a subpass dependency which waits for VK_SUBPASS_EXTERNAL in the COLOR_ATTACHMENT_OUTPUT stage. This latches onto the WSI acquire semaphore, more on that later.

Final layout != last layout is used, so we get a transition at the end of the render pass, but we don’t need to care about external subpass dependencies here. The automatically generated one is perfect, and we’re going to use the WSI release semaphore to properly synchronize this image anyways.

When we see a WSI image in a render pass, we can trivially mark this command buffer as “touching WSI”. This will affect command buffer submission later. This is indeed the kind of tracking which I have been arguing against in earlier posts, but it’s so trivial that it’s a no-brainer to me in this case.

Transient images

For transient images, we automate it just like WSI images, except that finalLayout will match last used layout in the render pass. This way we avoid some useless image layout transition at the end of the render pass. Next time we use the image, it’s going to be discarded anyways.

Because we deal with transitions automatically, users can freely pull images from Vulkan::Device with get_transient_attachment(), render to it, and forget about it. This is super convenient for things like MSAA rendering where the multi-sampled attachment just needs to exist temporarily for purposes of being resolved into the single-sampled attachment we care about. Having to care about synchronization for resources we don’t own is weird I think.

other images

For any other image, we need to avoid any implicit layout transition, so we simply force initialLayout to match the first use in the render pass, and finalLayout will match the last use. In our small sample, it’s all going to be COLOR_ATTACHMENT_OPTIMAL. It’s up to the API user to know what layouts a render pass will expect, but it’s straight forward to map a render pass to expected layout. Color attachments are COLOR_ATTACHMENT_OPTIMAL, depth-stencil is DEPTH_STENCIL_OPTIMAL or DEPTH_STENCIL_READ_ONLY_OPTIMAL based on the read/write mode for depth, input attachments are SHADER_READ_ONLY, etc. It’s possible to use an attachment for multiple purposes in a subpass, and Granite supports that as well. Some examples:

  • Color attachment + input attachment: Feedback loop ala GL_ARB_texture_barrier (super useful for certain emulators) -> GENERAL
  • RW Depth attachment + input attachment (some weird decal algorithm?) -> GENERAL
  • Read-Only depth + input attachment (deferred shading use case) -> DEPTH_STENCIL_READ_ONLY_OPTIMAL

All of this is analyzed when a newly observed VkRenderPass is created, and subpass dependencies are set up automatically. Anything which happens outside the render pass is the user’s responsibility.

08 – Bare-bones “deferred rendering” sample

I made a cut-down sample to show how the API expresses multi-pass in the context of deferred rendering with transient gbuffer + depth. The meat of it is:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 3;
rp.color_attachments[0] = &device.get_swapchain_view();

rp.color_attachments[1] = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		VK_FORMAT_R8G8B8A8_UNORM,
		0);
rp.color_attachments[2] = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		VK_FORMAT_R8G8B8A8_UNORM,
		1);

rp.depth_stencil = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		device.get_default_depth_format());

rp.store_attachments = 1 << 0;
rp.clear_attachments = (1 << 0) | (1 << 1) | (1 << 2);
rp.op_flags = Vulkan::RENDER_PASS_OP_CLEAR_DEPTH_STENCIL_BIT;

Vulkan::RenderPassInfo::Subpass subpasses[2];
rp.num_subpasses = 2;
rp.subpasses = subpasses;

rp.clear_depth_stencil.depth = 1.0f;

subpasses[0].num_color_attachments = 2;
subpasses[0].color_attachments[0] = 1;
subpasses[0].color_attachments[1] = 2;

subpasses[0].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadWrite;

subpasses[1].num_color_attachments = 1;
subpasses[1].color_attachments[0] = 0;
subpasses[1].num_input_attachments = 3;
subpasses[1].input_attachments[0] = 1;
subpasses[1].input_attachments[1] = 2;
subpasses[1].input_attachments[2] = 3;  // Depth attachment
subpasses[1].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadOnly;

cmd->begin_render_pass(rp);
// Do work
cmd->next_subpass();
// Do work
cmd->end_render_pass();

See code comments in sample for more detail. To write this kind of sample in raw Vulkan would be almost a full day’s project.

Synchronization

Unlike many aspects of Granite which are reasonably high-level, synchronization in Granite is almost 100% explicit. The general philosophy of Granite is that excessive tracking of resource use is a no-no, unless it is trivial to do so (e.g. WSI images). Synchronization is a case where you need a lot of tracking to do a good job, and it is impossible to do a perfect job since you end up relying on heuristics, at least if you are to implement automatic synchronization on top of Vulkan. GL and D3D11 drivers have an advantage here since they can tap into GPU-specific synchronization mechanisms which might be better suited for implicit synchronization. A good example here is the i915 driver stack in the Linux DRM stack which can do automatic synchronization in kernel space somehow. I’m sure that simplifies the Mesa GL driver a lot, but I don’t know the details.

Let’s go through a thought experiment where we look at the big problems we run into if we are to implement a fully automatic barrier system. (I have tried :p)

Problem #1 – We cannot rewind time

When touching a resource, we must ask ourselves: “When and where was this resource accessed last?” We have three potential solutions to resolve a hazard:

  • Pipeline barrier (was used just now)
  • Event (was used some time ago in this queue)
  • Semaphore (was used in a different queue)

Ideally, we need to inject a barrier at the exact point where a resource was last used, but we cannot inject new commands in the middle of a command buffer which has already been recorded.

There is no winning this fight, either we eagerly inject barriers after every command in the hope that some future command will need to synchronize against it (VkEvent is nice here), or we inject barriers too late, stalling the GPU needlessly.

Eagerly injecting barriers is pure insanity if we take into the account that the resource might be used on a different queue in the future. That means signalling a heavy semaphore after every render pass or command. We could simply ignore supporting multiple queues, but that’s a huge compromise to make.

Problem #2 – Redundant tracking of read-only resources

A problem I found while trying to implement automatic barrier tracking was that static resources might be written in the future, so we need to keep track of them. This is a waste of CPU time, but it might be possible to explicitly mark these resources as “do not track, they’re truly static, pinky promise”, but I feel this is bolting on hacks.

Problem #3 – Multi-threading

The question of “where was this resource touched last” might not actually be possible to answer in a multi-threaded scenario. If we are recording command buffers in parallel, the backend has no idea what is going on until we serialize execution in vkQueueSubmit. A common solution I have seen for this problem is to resolve synchronization internally in command buffers as they are recorded, and on command buffer submission time, we can look at all used resources and resolve any cross-command buffer synchronization points right before submitting the command buffers in vkQueueSubmit. The complexity starts to shoot through the roof though. That’s a good sign we need to rethink.

I think this is the kind of solution you end up with when you have no choice but to port some old legacy API to Vulkan, and breaking the abstraction API is not an option.

Render graphs

A Vulkan backend which solves synchronization can only look back in time and deal with hazards at the last minute, but that is only because we do not have any context about what the application is doing. At a higher level, we can know what is going to happen in the future, and we can make automated decisions at that level, where we actually have context about what is going on. This is another reason why I do not want to have automatic synchronization in a Vulkan backend. Either we get a sub-optimal solution, or we try to close the gap with heuristics, but now run-time behavior is completely non-deterministic. We should aim for something better.

I believe the synchronization problem should be solved at a higher level, like a render graph. I wrote a blog some time ago about this topic: http://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/

Signalling fences

Granite’s way of signalling fences is very similar to plain Vulkan.

Vulkan::Fence fence;
device.submit(cmd, &fence);

// Somewhere else, potentially in a different thread.
fence->wait();
// fence ref-count goes to 0, queued up for recycling.

There is a pool of VkFence objects which can be reused. Signalling a fence forces a vkQueueSubmit. Once the lifetime of a Vulkan::Fence ends it is recycled back to the frame context. Nothing out of the ordinary here.

Semaphores

Semaphores work very similar to fences and are requested in Device::submit() like fences. Like Vulkan, they have a restriction that they can only be waited on once. Semaphores can be waited on in other queues with Device::add_wait_semaphore() in a particular queue and pipeline stage. This matches pDstWaitStages. Semaphores are also recycled like fences.

Events

Events can be signalled and later waited on in the same queue. Again, we have a pool of VkEvent objects, CommandBuffer::signal_event() requests a fresh event, signals it with vkCmdSetEvent and hands it to the user. VkEvents are recycled using the frame context. There is a similar CommandBuffer::wait_event() which maps 1:1 to vkCmdWaitEvents.

Barriers

Granite has many different methods to inject pipeline barriers, the most common ones are:

cmd->barrier(srcStage, srcAccess, dstStage, dstAccess);

which maps to a vkCmdPipelineBarrier with a VkMemoryBarrier, and image barriers which act on all subresources:

cmd->image_barrier(image, oldLayout, newLayout, srcStage, srcAccess, dstStage, dstAccess);

There are cases where we want to batch barriers or otherwise use more complicated commands than this, so there are also 1:1 interfaces to vkCmdPipelineBarrier where the full structures are passed in, but these are only really used by the render graph since writing out full structures is super painful.

The automatic barriers in Granite

There are a few instances where I think having automatic barriers makes sense. These are cases where it’s convenient to do so, and there is no tracking required, so we can resolve all hazards right away and forget about it. Some of them we’ve seen already, like WSI images and transient images in render passes.

The other major case is static read-only resources. If you pass in initial_data to Device::create_buffer() or Device::create_image(), we generally have a desire to upload some data, and never touch it again.

The general gist of it is that we can upload data with a staging buffer over the transfer queue and inject semaphores which block all possible pipeline stages (based on bufferUsage/imageUsage flags). The downside is that we might end up creating too many submissions if we somehow want to upload a ridiculous amount of buffers or images in one go, but we can opt-out of this automatic behavior by simply not passing initial_data and do all the batching and synchronization work ourselves.

The end goal is that we should be able to call create_buffer or create_image and just use the static resource right away without having to think about synchronization at all.

09 – Rendering to image and reading it back to CPU on transfer queue

I wrote a sample which flexes most of the synchronization APIs. It renders a small 4×4 texture on the graphics queue, synchronizes that with the transfer queue with a semaphore and reads it back to a CACHED host buffer. We spawn threads which wait on a fence, maps the buffer and reads the results.

Conclusion

In these parts of the backend, the low-level explicit nature of Vulkan shines through. I think we have to be fairly low-level, or we inherit most of the problems with the older APIs.

… up next!

In the next installment, we’ll have a look at pipeline creation.

A tour of Granite’s Vulkan backend – Part 4

Optimizing for scratch data allocation

Allocating memory from a heap is fine and all, but very often in an engine we need to allocate throwaway data. Uniform buffers are the perfect use case here. With transient command buffers, certain data is also going to be transient. It’s very common to just want to send some constant data to a draw call and forget about it.

In Vulkan, there is a perfect descriptor type for this use case, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. It’s not just uniform buffers, it’s fairly common that we want to allocate scratch vertex buffers, index buffers and staging data for texture updates.

Being able to implement allocators like these with no API overhead is a huge deal with Vulkan for me. In legacy APIs there are extremely painful limitations where “fire and forget” buffer allocations are very hard to implement well. Buffers generally cannot be mapped when submitting draw calls, so we need to fight really hard and think about copy-on-write behavior, discard behavior, API overhead to call map/unmap all the time (which breaks threaded driver optimizations) or batch up allocations and memcpy data around a couple of extra times. It’s too painful and a lot of CPU performance can go down the drain if we don’t hit all the fast paths.

The only proper solution in legacy APIs I can think of is GL 4.5’s GL_ARB_buffer_storage, which supports persistently mapped buffers like Vulkan, but relying on GL 4.5 (or GLES 3.2 + extensions) just does not seem reasonable to me, since targeting GL should be considered a compatibility option for old GPUs which do not have Vulkan drivers. This feature was a cornerstone of the “AZDO” (approaching zero driver overhead) buzz back in the GL days. D3D11 is still going to be the “compat” option on Windows for a long time, and forget about relying on latest and greatest GLES on Android.

This is the perfect occasion to present a “hello triangle” sample which uses most of these features, but we first need WSI, so let’s start there.

06 – Pushing pixels with SDL2

Granite’s main codebase normally uses GLFW, so to get a less redundant sample working, I wrote this sample to use SDL2’s Vulkan support, which is very similar to GLFW’s support starting with SDL 2.0.8.

Implementing WSI is similar to instance and device creation where there is a lot of boilerplate to churn through, with little room for design considerations. Granite’s WSI implementation has two main paths:

On-screen / VK_KHR_surface

In this mode, The WSI class creates and owns the Vulkan::Context and Vulkan::Device automatically for us and owns a VkSwapchainKHR. The only thing it cannot on its own is create the VkSurfaceKHR, which is platform dependent. Fortunately, the surface is the only platform-dependent object so we can supply an interface implementation to create this surface when Vulkan::WSI needs it. The sample implements an SDL2Platform class which uses SDL2’s built in wrappers for surface creation, nifty!

Off-screen / externally owned swapchains

Granite is also used in cases where we don’t necessarily own a swapchain which is displayed on screen. We might want to supply already created images in lieu of VkSwapchainKHR and provide our own image indices as well as acquire/release semaphores. After completing a frame, we can pass along the fake swapchain image to our consumer. The prime case for this is the libretro API implementation in Granite.

Pumping the main loop

Vulkan’s Acquire/Present model maps directly to a “begin” and “end” model in Granite. We call Vulkan::WSI::begin_frame() to acquire a new image index, advance the frame context and deal with any in-between frame work. We might have to deal with out-of-date swapchains here and various janitorial work which we never had to consider in old APIs.

Semaphores for WSI images are dealt with automatically. WSI images are treated specially and automatically handling synchronization for WSI resources is straight-forward to the point that there is no point in exposing that to the user. (Synchronization in Granite is in general very explicit, but WSI is one of few exceptions.) The main loop looks something like:

wsi.begin_frame(); // <-- acquire image if necessary, advances frame context
auto cmd = device.request_command_buffer();
// do work and render to swapchain
device.submit(cmd);
wsi.end_frame(); // <-- flushes frame, queues up presents if swapchain was rendered to this frame

Overall, WSI code is must to abstract in Vulkan, and I’m happy with the flexibility and simplicity in use I ended up with.

07 – Hello triangle (quad?) with scratch allocated VBO, IBO and UBO

Now that we can get stuff on-screen, now we’re getting to the actual meat of this post. https://github.com/Themaister/Granite-MicroSamples/blob/master/07_linear_allocators.cpp augments the WSI sample with a nice little quad. The VBO, IBO and UBOs are allocated directly on the command buffer.

Linear allocator – allocating memory at the speed of light

This allocator has many names – chain allocator, bump allocator, scratch allocator, stack allocator, etc. This is the perfect allocator for when we want to allocate a lot of small blobs, and just wink it all away at some point in the future. Allocation happens by incrementing an offset, and freeing happens by setting the offset to 0 again, i.e. all memory in one go is just “winked away”.

Buffer pools of linear allocators

Some engine implementations have a strategy where there is only one huge linear allocator in flight and once exhausted, it is considered OOM and a GPU stall is inevitable. This strategy is nice from an “explicit descriptor set” design standpoint if we use UNIFORM_DYNAMIC descriptor type, since we can use a fixed descriptor set for uniform data, as offsets into the UBO are encoded when binding the descriptor set. I find this concept a bit too limiting, since there is no obvious limit to use (very content and scene dependent). I opted for a recycled pool of smaller buffers instead since Granite’s descriptor binding model is very flexible as we saw in the previous post in this series. If I had to deal with explicit descriptor sets, uniform data would be kind of nightmarish to deal with.

Vulkan::CommandBuffer can request a suitable chunk of data from Vulkan::Device, and once exhausted or on submission, the buffers are recycled back again. We can only reuse the buffer once the frame is complete on the GPU, so we also use the frame context to recycle linear allocators back into the “ready for allocation” pool at the right time.

To DMA queue or not to DMA queue …

Discrete GPUs have a property where accessing memory in VRAM is very fast, while host memory can be accessed over PCI-e at a far slower rate. For staging data like vertex, index and uniform buffers, it might be reasonable to assume that we should copy the CPU-side to GPU-side and let the GPU consume the streamed data in fast VRAM. Granite supports two modes where we let the GPU read data read-only from HOST_VISIBLE, and one where we automatically perform staging buffer copies over to GPU from the CPU buffer.

In practice however, I don’t see any gain from doing the staging copy. The extra overhead of submitting a command buffer on the DMA queue which copies data over, and adding the extra synchronization overhead with semaphores and friends just does not seem worthwhile. Discrete GPUs can cache read-only data sourced from PCI-e just fine.

Super-convenient API

Since we have a very free-flowing descriptor binding model, we can have an API like this:

auto cmd = device.request_command_buffer();
MyUBO *ubo = cmd->allocate_typed_constant_data<MyUBO>(set, binding, count);
// Fill in data on persistently allocated buffer.
ubo->data1 = 1;
ubo->data2 = 2;

void *vert_data = cmd->allocate_vertex_data(binding, size, stride);
// Fill in data.
void *index_data = cmd->allocate_index_data(size, VK_INDEX_TYPE_UINT16);
// Fill in data.
cmd->draw_indexed();

// Pointers are now invalidated.
device.submit(cmd);

The allocation functions are just light wrappers which allocate, and bind the buffer at the appropriate offset. It’s perfectly possible to roll your own linear allocation system, e.g. you want to reuse a throwaway allocation in multiple command buffers in the same frame, or something like that.

Conclusion

I think spending time on making temporary allocations as convenient as possible will pay dividends like nothing else. The productivity boost of knowing you can allocate data on the command buffer for near-zero overhead simplifies a lot of code around the callsite, and there is little to no cost of implementing this. Linear allocators are trivial to implement.

… up next!

On the next episode of “this all seems so high-level, where’s my low-level goodness”, we will look at render passes and synchronization in Granite, which is where the low-level aspects of Granite will be exposed.

A tour of Granite’s Vulkan backend – Part 3

Shaders and descriptor sets

This is part 3 of a blog series I’m writing on Granite‘s Vulkan backend. In this episode we are looking at how we deal with shaders and descriptor sets. At this point in our design process, there are many, many choices to make. Especially descriptor sets need to be carefully considered.

Hash all the things

A theme we start to see now is hashmaps and lazy creation of objects. One thing you run into with Vulkan’s pipeline-related types are how much work it is to be explicit all the time. The amount of information we need to provide is staggering. I believe it not healthy for mind and soul to work at low levels here except in special cases, and we should aggressively hide away detail where we can. There is naturally a clock cycles vs. sanity tradeoff to be made here.

You can argue that the lines between high-level GL/D3D11-style design and Granite’s model are quite blurred. The (mental) price to pay to be explicit is just not worth it in my opinion. I will try to explore the obvious alternatives here and provide more context why the design is the way it is.

04 – Shaders and pipeline layouts

The first step in creating a pipeline is of course, to create a VkShaderModule from our SPIR-V code. This is a no-brainer, but next we need a pipeline layout, which in turn requires VkDescriptorSetLayouts. The sample is here https://github.com/Themaister/Granite-MicroSamples/blob/master/04_shaders_and_programs.cpp.

Rather than manually declaring the pipeline layout like a caveman I think using reflection to automatically generate layouts is a good idea. There is no reason for users to copy information which exists in the shaders already. For the reflection, I use SPIRV-Cross. If we never need to compile SPIR-V in runtime (game engine scenario), there is no reason why we cannot shift the reflection step to off-line as well and just pass the side-band data along to remove a runtime dependency. I never got as far as building a nice off-line SPIR-V baking pipeline, so I just compile GLSL on the fly with shaderc. However, the interface in the Vulkan backend just consumes raw SPIR-V.

A common mistake beginners tend to do is to think that names are important in binding interfaces. This is a mistake carried over from the GL and D3D11 days. The only things we should care about are descriptor sets, bindings and location decorations as well as push constant use. This is the only semantic information we need to create binding interfaces, i.e. pipeline layouts and pipelines.

A pipeline layout in Vulkan needs to know all shader stages a-la GL programs, so we also need a step to combine shaders into a Vulkan::Program. Here we take the union of reflection information and request handles for Vulkan::DescriptorSetAllocator and Vulkan::PipelineLayout. This is hashed, but there is no performance concern here since we should do all of this work in load time when possible anyways. These handles are all owned internally in Vulkan::Device, and there is no reason to worry about object lifetime for these objects.

I don’t think there is a reason to deviate far from this design unless you have a very specific scheme in mind with descriptor set allocation. As I’ll explore later, using bindless descriptors extensions or explicit descriptor set allocation could motivate use of a “standard” pipeline layout, in which case reflection gets kind of meaningless anyways.

05 – The binding model – embracing laziness

I never really had a problem with the old-school way of binding resources to binding slots. It just isn’t the part of the old APIs I felt were lacking, so Granite is kind of old school here, but it does have full consideration for descriptor sets and I removed any impedance mismatch with Vulkan (i.e. no translation needed to bridge between Granite and Vulkan). E.g.:

cmd->set_storage_buffer(set, binding, *resource);
cmd->set_texture(set, binding, resource->get_view(), Vulkan::StockSampler::LinearClamp);

The old binding models in GL/D3D11 have flat binding spaces with no separation of per-frame, per-material, or per-draw bindings. In Granite I wanted to take full advantage of the descriptor set feature where we can assign some kind of “frequency” and relation between bindings. Here is an example to illustrate how it is used: https://github.com/Themaister/Granite-MicroSamples/blob/master/05_descriptor_sets_and_binding_model.cpp.

In draw time, we can use the current pipeline layout and pull in the binding points which are active and make sure we bind descriptor sets with the correct resources. This is actually hot code, so I spent time designing a nice system here which tries to be as optimal as possible, given these restrictions.

Because of mobile, we need some conservative limits. I use 4 descriptor sets and 16 (dense) binding points per set (minimum spec of Vulkan). This allows for fairly compact pipeline layout descriptions, and we can loop over bitsets to look at resources. This is also just fine for my use cases.

When it comes to allocation of descriptor sets themselves, I think I have a very different approach to most. A Vulkan::DescriptorSetAllocator is represented as:

  • The VkDescriptorSetLayout
  • A bunch of VkDescriptorPools which can only allocate VkDescriptorSets of this set layout. Pools are added on-demand.
  • A pool of unused VkDescriptorSets which are already allocated and can be freely updated.
  • A temporary hashmap which keeps track of which descriptor sets have been requested recently. This allows us to reuse descriptor sets directly. In the ideal case, we almost never actually need to call vkUpdateDescriptorSets. We end up with hash -> get VkDescriptorSet -> vkCmdBindDescriptorSets. When a descriptor set has not been used for a couple of frames (8), we assume that it is no longer relevant, and the set is recycled, and some other descriptor set can reuse it and just call vkUpdateDescriptorSet. We definitely do not want to keep track of when any buffer or image resources is destroyed, and recycle early. That’s tracking hell which slows everything down.

The temporary hashmap is a data structure I’m quite happy with. It’s used for a few other resources as well. See https://github.com/Themaister/Granite/blob/master/util/temporary_hashmap.hpp for the implementation.

On certain GPUs, allocating descriptor sets is, or at least used to be very costly. The descriptor pools might not be implemented as true pools (sigh …), so every vkAllocateDescriptorSets would mean a global heap allocation, absolutely horrible for performance. This is the reason I’m not a big fan of the “one large pool” design. In this model, we just allocate a massive VkDescriptorPool, and we just allocate from that, for any kind of descriptor set. This means recycling VkDescriptorSet handles over many frames is impractical. The intended use pattern is to call vkResetDescriptorPool and allocate new descriptor sets which are only valid for one frame at a time, just like command buffers. There is also the problem of knowing how to balance the descriptor load for these massive pools, what’s the ratio of image descriptors vs uniform buffer descriptors, etc. With per-descriptor set layout allocators, there is zero guess work involved.

Alternative design – Bindless

Bindless is all the rage right now. The only real complaint I have is that it’s only supported on desktop and requires an EXT extension. It also means writing shaders in a very specific way. I don’t really need it for my use cases, but bindless enables certain complex algorithms which benefit from accessing a huge set of resources dynamically.

Alternative design – persistent explicit VkDescriptorSets

An alternative is exposing descriptor sets directly and only allow users to bind descriptor sets rather than individual resources. The API user would need to build the sets manually. While this is an idea, I think there are too many hurdles to make it practical.

  • We need to know and declare the target imageLayout of textures up front. This is obvious 99% of the time (e.g. a group of material textures which are SHADER_READ_ONLY_OPTIMAL), but in certain cases, especially with depth textures, things can get rather ambiguous. This does seem to me like an API design fault. It is unclear why this information is needed.
  • Some resources are completely transient in nature and it does not make sense to place them in persistent descriptor sets. The perfect example here is uniform buffers. In later samples, we’ll look at the linear allocator system for transient data.
  • Some resources depend on the frame buffer, i.e. input attachments. Baking descriptor sets for these resources is not obvious, since we need to know the combination pipeline layout + frame buffer, which should have nothing to do with each other.
  • We need to know the descriptor set layout (and by extension, the shaders as well) up-front. This is problematic if resources are to be used in more than one shader. The common fix here is to settle on a “standard” pipeline layout so we can decouple shaders and resources. This usually means a lot of padding and redundant descriptor allocations instead. We have a limited amount of descriptor sets when targeting mobile (4). We do not have the luxury of splitting every individual “group” of resources into their own sets, some combinatorial effects are inevitable, making persistent descriptor sets less practical. On desktop, 8 sets is the norm, so that might be something to consider.
  • Hybrid solutions are possible, but complexity is increased for little obvious gain.

Conclusion

I’m happy with my design. It’s very easy to use, but there is a CPU prize I’m willing to pay and I honestly never saw it in the profiler. I think resource binding models are cases where shaving overhead away will shave your sanity away as well, at least if you want to be compatible with a wide range of hardware. It’s much easier if you only cater to high-end desktop where bindless can be deployed.

… up next!

Next up we will explore the linear allocators for uniform, vertex, index and staging data.

A tour of Granite’s Vulkan backend – Part 2

The life and death of objects

This is a part 2 in a series where I explore Granite‘s Vulkan backend. See part 1 for an introduction. In this blog entry we will dive into code, and we will start with the basics. Our focus in this entry will be to discuss object lifetimes and how Granite deals with the Vulkan rule that you cannot delete objects which are in use by the GPU.

Sample code structure

I will be referring to concrete code samples from here on out. I have started a small code repository which contains all the samples. See README.md for how to build, but you won’t need to run the samples to understand where I’m going with these samples. Stepping through the debugger can be rather helpful however.

Sample 01 – Create a Vulkan device

Before we can do anything, we must create a VkDevice. This aspect of Vulkan is quite dull and full of boilerplate, as is the setup code for any graphics API. There is not a lot to cover from an API design perspective, but there are a few things to mention. The sample code for this part is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/01_device_creation.cpp

The API for this is pretty straight forward. I decided to split up how we load the Vulkan loader library, since there are two main use cases here:

  • User wants Granite to load libvulkan.so/dll/dylib from standard locations and bootstrap from there.
  • User wants to load an already provided function pointer to vkGetInstanceProcAddr. This is actually the common case, since GLFW loads the Vulkan loader for you dynamically and you can just use the GLFW provided glfwGetInstanceProcAddr to bootstrap yourself. The volk loader has support for this.

To create the instance and device, we need to do the usual song and dance of creating a VkInstance and VkDevice:

  • Setup Vulkan debug callbacks
  • Identify and enable relevant extensions
  • Enable Vulkan validation layers in debug build
  • Find appropriate VkQueues to cover graphics, async compute, transfer

Vulkan::Context and Vulkan::Device

The Context owns the VkInstance and VkDevice, and Vulkan::Device borrows a VkDevice and manages the big objects which are created from a VkDevice. It is possible to have multiple Vulkan::Device on top of a VkDevice, but we end up sharing the VkQueues and the global heaps for that device, which is a very nice property of Vulkan, since it allows frontend/backend systems like e.g. RetroArch/libretro to share a VkDevice without having hidden global state leak between the API boundary, which is a huge problem with the legacy APIs like GL and D3D11.

Note that this sample, and all other samples in this chapter are completely headless. There is no WSI involved. Vulkan is really nice in that we don’t need to create window system contexts to do any GPU work.

02 – Creating objects

Creating new resources in a graphics API should be very easy, and here I spent a lot of time on convenience. Creating images and uploading data to them in raw Vulkan is a lot of work, since there are so many things you have to think about. Creating staging buffers, copy those, defer deletion of that staging buffer, maybe we copy on the transfer queue, or not? Emit some semaphores to transfer ownership to graphics queue, creating image views, and just so many things which is very painful to write. Just creating an image in a solid way is several hundred lines of code. Fortunately, this kind of code is very easy to wrap in an API. See sample: https://github.com/Themaister/Granite-MicroSamples/blob/master/02_object_creation.cpp, where we create a buffer and image. I think the API is about as simple as you can make it while keeping a reasonable amount of flexibility.

Memory management

When we allocate resources, we allocate it from Granite’s heap allocator for Vulkan. If I had done Granite today, I would just use AMD’s Vulkan Memory Allocator, but it did not exist at the time I designed my allocator, and I’m pretty happy with my design as it stands. Maybe if I need de-fragmentation in the future or some really complex memory management strategy, I’ll have to rethink and use a mature library.

To get a gist of the algorithms, Granite will allocate 64 MB chunks, which are split in 32 chunks. Those 32 chunks can then be subdivided into 32 smaller chunks, etc, all the way down to 256 bytes little chunks. I made a cute little algorithm to allocate effectively from these blocks with CTZ operations and friends. Classic buddy allocator, but you have 32 buddies.

There are also dedicated allocations. I use VK_KHR_dedicated_allocation to query if an image should be allocated with a separate vkAllocateMemory rather than being allocated from the heap. This is generally useful when allocating large frame buffers on certain architectures. Also, for allocations which exceed 64 MB, dedicated allocations are used.

Memory domains

A nice abstraction I made is that rather than dealing with memory types like DEVICE_LOCAL, HOST_VISIBLE, and the combination of all the possible types, I declare up-front where I like my buffers and images to reside. For buffers, there are 4 use cases:

  • Vulkan::BufferDomain::Device – Must reside on DEVICE_LOCAL_BIT memory. May or may not be host visible (discrete vs integrated GPUs).
  • Vulkan::BufferDomain::Host – Must be HOST_VISIBLE, prefer not CACHED. This for uploads to GPU.
  • Vulkan::BufferDomain::CachedHost – Must be HOST_VISIBLE and CACHED. Falls back to non-cached, but should never happen. Might not be COHERENT. Used for readbacks from GPU.
  • Vulkan::BufferDomain::LinkedDeviceHost – HOST_VISIBLE and DEVICE_LOCAL. This maps to AMD’s pinned PCI mapping, which is restricted to 256 MB. I don’t think I’ve ever actually used it, but it’s a niche option if I ever need it.

When uploading initial data to a buffer, and Device is used, we can take advantage of integrated GPUs which share memory with the CPU. In this case, we can avoid any staging buffer, and just memcpy data straight into the new DEVICE_LOCAL memory. Don’t just blindly use staging buffers when you don’t need it. Integrated GPUs will generally have DEVICE_LOCAL and HOST_VISIBLE memory types.

Mapping host memory

While not present in the sample, it makes sense to discuss how we map Vulkan memory to the CPU. A good rule of thumb in general is to keep host memory persistently mapped. vkMapMemory and vkUnmapMemory is quite expensive, especially on mobile, and we can only have one mapping of a VkDeviceMemory (64 MB with tons of suballocations!) active at any time. Rather than Map/Unmap all the time, we implement map/unmap in Vulkan::Device, by checking if we need to perform cache maintenance, with no extra CPU cost. On map() for example, we need to call vkInvalidateMappedRanges if the memory type is not COHERENT, and for unmap, we call vkFlushMappedRanges if the memory is not COHERENT. This is fairly common on mobile when doing readbacks from GPU, since we need CACHED, but we might not get COHERENT. Granite’s backend abstracts all of this away.

Physical and transient image memory

A very powerful feature of Vulkan is the support for TRANSIENT images. These images do not have to be backed by physical memory in Vulkan, and is very nice on tile-based mobile renderers.

In Granite I fully support transient images where I can pass in two different domains for images, Physical and Transient. Since Transient images are generally used for throw-away scenarios, there is a convenient method in Vulkan::Device::get_transient_attachment() to simply request a transient image with a format and resolution for rendering. Transient images are generally never created manually since they are so easy to manage internally.

Handle types

There are many ways to abstract handle types in general, but I went for my own “smart pointer” variant, the trusty intrusive ref-counted pointer. It can basically be thought of a std::shared_ptr, but simpler, and we can pool the allocations of handles very nicely. How we design these handle types are not really important for Vulkan though, but I figured this point would generate some questions, so I’m addressing it here. See https://github.com/Themaister/Granite/blob/master/util/intrusive.hpp for details.

03 – Deferring deletions of GPU resources

Now we’re getting into topics where there can be significant design differences between Vulkan backends. My design philosophy for a middle-level abstraction is convenient, deterministic and good enough at the cost of a theoretical optimal solution.

A common theme you’ll find in Granite is the use of RAII. Once lifetimes of objects end, we automatically clean up resources. This is nothing new to C++ programmers, but the big problem in Vulkan is we’re not managing just memory on CPU with new/delete. We actually need to carefully control when things are deleted, since the GPU might be using the resources we are freeing. The strategy here will be to defer any deletions. The sample is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/03_frame_contexts.cpp

The frame context

In order to handle object lifetimes in Granite, I have a concept of a frame context. The frame context is responsible for holding all resources which belong to a “frame” of work. Normally this corresponds to a frame of work between AcquireNextImage and QueuePresent, but it is not tightly coupled. The Device has an array of frame contexts, usually 2 of them to allow double-buffering between CPU and GPU, (and 3 on Android because TBDR GPUs are a bit more pipelined and tend to prefer a little more buffering). The frame context is basically a huge data structure which holds data like:

  • Which VkFences must be waited on to make sure that all GPU work associated with this queue is done. This is the gatekeeper which holds all our recycling and deletions back.
  • Command pools for each worker thread and queue types.
  • VkBuffers, VkImages, etc, to be deleted once the fences signal.
  • Memory allocations from heap allocator to be freed.
  • … and various other resources.

Basically, we have a central place to chuck any things which need to happen “later”, when the GPU is guaranteed to be done with this frame.

As a special consideration, the big fat “make it go slow” call Device::wait_idle() will automatically clean up everything in one go since it knows at this instant the GPU is not doing anything.

Command buffer lifetime compromise

To make the frame based cleanup work in practice, we need to simplify our notion of what command buffers can do. In Vulkan, we have the flexibility to record command buffers once and reuse them at will at any time. This creates some complications. First of all, it throws the idea of a per-frame command pool out of the window. We can never reset the command pool in that case, since there will be free-floating command buffers out there which might be used later. Command pools work their best in Vulkan when you don’t allow individual command buffers to be freed.

If we have reusable command buffers, we also have the problem of object lifetimes. We end up with a painful situation where GPU resources must be retained until all command buffers which reference them are discarded. This leads to a really difficult situation where you have two options – deep reference-counting per command buffer or just pray all of this works out and make sure objects are kept alive as long as necessary. The former option is very costly and bug-prone, and the latter is juggling with razor blades too much for my taste where a large, meaningless burden is placed on the user.

I generally don’t think reusable command buffers are a worthwhile idea, at least not for interactive applications where we’re not submitting a static workload to the GPU over and over. There just aren’t many reasonable use-cases where this gives you anything meaningful. The avenues where you can submit the same calls over and over are maybe restricted to post-processing, but recording a few draw calls which render a few full-screen quads (or compute dispatches for the cool kids) is not exactly where your draw call overhead is going to matter.

I find that beginners obsess over the idea of aggressive reuse a little too much. In the end I feel it is misguided, and there are many better places to spend your time. Recording command buffers itself in Vulkan is super efficient.

My ideal use for command buffers are where command buffers are light-weight handles which all source their memory from a common command pool linearly. No reuse, so we use ONE_TIME_SUBMIT_BIT and TRANSIENT_BIT on the pool.

In Granite, I greatly simplified the idea of command buffers into transient handles which are requested, recorded and submitted. They must be recorded and submitted in the same frame context you requested the command buffer. This way we remove the whole need for keeping track of objects per-command buffers. Instead, we just tie the resource destruction to a frame context, and that’s it. No need for complicated tracking, it’s very efficient, but we risk destroying the object a little later than is theoretically optimal. This could potentially increase memory pressure in certain situations, but I think the trade-off I made is good. If needed, I can always add explicit “delete this resource now, I know it’s safe” methods, but I haven’t found any need for this. This would only be important if we are truly memory bound.

A design decision I made was that I would never want to do internal ref-counts for resources like images and buffers, and the design would be forced to not rely on fine-grained tracking which you would typically find in legacy API implementations. Any ref-counted operations should be immediately visible to API users and never be hidden behind API implementations. In fact, command buffer arguments for binding resources are plain references or pointers, not ref-counted types.

The memory pressure of very large frames

The main flaw of this design is that there might be cases where there is one spurious frame context that has extreme use of creation and deletions of resources. A prime example here is loading screens or similar. Since Vulkan resources are not freed until the frame context itself is complete, we cannot recycle memory within a frame unless we explicitly iterate the frame context with Device::next_frame_context(). This tradeoff means that the Granite backend does not have to heuristically stall the GPU in order to reclaim memory at suitable times, which adds a lot of complexity and ruins the determinism of Granite’s behavior.

… up next!

In the next episode of Granite shenanigans we will look at the shader pipeline where we discuss VkShaderModule, VkDescriptorSetLayout, VkPipelineLayout and VkPipeline objects.

A tour of Granite’s Vulkan backend – Part 1

Introduction

Since January 2017, I’ve been working on my little engine project, which I call Granite. It’s on Github here. Like many others, I felt I needed to write a little engine for myself to fully learn Vulkan and I needed a test bed to implement various graphics techniques. I’ve been steadily working on it since then and used it as the backbone for many side-projects, but I think for others its value right now is for teaching Vulkan concepts by example.

A while back I wrote a blog about my render graph implementation. The render graph sits on top of the Vulkan implementation, but in this series I would like to focus on the Vulkan layer itself.

The motivation for a useful mid-level abstraction

One thing I’ve noticed in the Twitter-sphere and various panel discussions over the last years is the idea of the mid-level abstraction. Where GL and D3D11 is too high-level and inflexible for our needs in non-trivial applications, Vulkan and D3D12 tend to overshoot in low-level complexity, with the goal of being as low-level and explicit as possible while staying GPU architecture/OS-portable. I think everyone agrees that having a good mid-level abstraction is important, but the problem we always have when designing these layers is where to make the right trade-offs. There will always be those who chase maximum possible performance, even if complexity when using the abstraction shoots through the roof.

For Granite I always wanted to promote convenience, while avoiding the worst penalties in performance. The good old 80/20 rule basically. There are many, many opportunities in Vulkan to not do redundant CPU work, but at what cost? Is it worth architecting yourself into a diamond – a super solid, but in the end, inflexible implementation? I’m noticing a lot of angst in general around this topic, especially among beginners. A general fear of not chasing every last possible performance optimization because it “might be really important later” is probably why we haven’t seen a standard, mid-level graphics API yet in wide use.

I feel that the benefits you gain by designing for maximum possible CPU performance are more theoretical design exercises than practical ones. Even naive, straight forward, single-threaded Vulkan blows GL/GLES out of the water in CPU overhead in my experience, simply because we can pick and choose the extra work we need to do, but legacy driver stacks have built up cruft over a decade or more to support all kinds of weird use cases and heuristics. Add multi-threading on top of that, and then you can think about micro-optimizing API overhead, if you actually need it. I suspect you won’t even need multi-threaded Vulkan. I believe the real challenge with the modern APIs right now is optimizing GPU performance, not CPU.

Metal is getting a lot of praise for its successful trade-off in performance and usability. While I don’t know the API itself in detail to make a judgement (I know all the horrors of Metal Shading Language though cough), I get the impression that the mid-level abstraction is the abstraction level we should be working in 99% of the time.

I think Granite is one such implementation. I am not trying to propose that Granite is the solution, but it is one of them. The design space is massive. There just cannot possibly be a one true graphics API for all users. Rather than suggest you go out and use it directly, I will try to explain how I designed a Vulkan interface which is quite convenient to use and runs well on both desktop and mobile (very few projects consider both), at least for my use cases. Ideally, you should be inspired to make the mid-level abstraction that is right for you and your project. I have gone through a couple of iterations to get where I am now with the design, and used it for various projects, so I think it’s a good starting point at least.

The 3D-accelerated emulation use case

How Granite got started was actually the Vulkan backend in Beetle PSX HW renderer. I wrote up a Vulkan backend, and emulators need very immediate and flexible ways of using graphics APIs. Information is generally known only in the last minute. Being able to implement such projects guided Granite’s initial design process quite a lot. This is also a case where legacy APIs are really painful since you really need the flexibility of modern APIs to do a good job with performance. There are a lot of state changes and draw calls on top of the CPU cost of emulation itself. Creating resources and modifying data on the GPU in weird ways is a common case in emulation, and many drivers simply don’t understand these usage patterns and we hit painful slow-paths everywhere. With Vulkan there is little to no magic, we just implement things how we want, and performance ends up far more predictable.

I think many forget that Vulkan is not just for big (AAA) game engines. We can successfully use it for all kinds of things. We just need the right abstractions and knowledge.

How the design and implementation will be explored

To start off, we will explore the design through commented code samples, which use only the Vulkan portion of Granite as a library. We will write concrete samples of code, and then go through how all of this works, and then discuss how things could be designed differently.

… up next!

I haven’t written up any samples yet, so it makes sense to stop here. Next time, we’ll start with some samples.

Recreating the tone filter from NieR:Automata

Audio tech in games is rarely particularly interesting. Sadly, most of it seems to have been made into commodity over the last couple of decades. Once in while, some games have very clever audio tech, and in this case it was NieR:Automata’s tone filter which caught my attention. The blog post explaining this tech is found here. If you haven’t played the game, it is highly recommended to read the post and watch the videos to understand what it’s doing. That saves me a lot of explanation. Their blog is very sparse on technical implementation details, but I wanted to try recreating it as there was just enough high-level detail in there to get me started.

Outside graphics, I’ve done a fair bit of audio programming in the past. It’s been too long since I did any significant audio DSP programming.

In short, the goal of the filter is to attempt to turn normal high-fidelity soundtracks into something with an 8-bit feel on-demand. Being able to introduce dynamic aspects to the music with a pure filter is interesting.

The tone filter as explained

The goal of the filter is to extract musical notes, and emphasize them. By having a few notes playing with a classic waveform like square or saw waves, we can recreate a retro 8-bit feel.

The blog describes 48 filters, spanning 4 octaves. Each octave in the (western music) scale is divided into 12 tones. What I deduced from this is that the 48 filters should be 48 very sharp bandpass filters.

Whiteboard description from blog

A theoretically perfect bandpass filter will output a pure sine wave if the tone exists, and nothing if it doesn’t.

The distortion is tough to do well, and I spent a lot of time fiddling with this. We want to try making the sine wave become something like a square wave. I tried many variants but I ended up with something very simple like

https://www.desmos.com/calculator/qvymx5qf8t

If you’ve done HDR tone-mapping, this formula will look very familiar to you. Sometimes cross-domain knowledge comes in handy.

After distorting, there is the levelling stage, which I spent a lot of time fiddling with. Basically, if we run this system as is, we end up with a ton of noise in the signal with all the chromatic tones playing over each other. Needless to say, this sounded absolutely terrible.

There are little to no details on how this should be implemented, so I tried a very crude model which seems to work reasonably well. Basically, each of the 48 channels have a running power estimate which is computed right after the filter. We can compare that against the running power estimate of the unfiltered audio. This lets us get an idea how much of the audio energy is concentrated into each individual tone. If the energy is low enough, it falls off in a power-of-4 fashion to avoid leaking in audio from completely unrelated tones. Percussion sounds will generally have energy in almost the entire audio spectrum, and we need to filter that out as well as we can. If the ratio is too high, we just cap it. This is the fiddly part. There’s a lot of magic constants to tweak to get it sounding pleasing.

At the end we mix our mono output signal back into the original audio, and when it works well, it gives a nice harmonic edge. I believe it’s reasonably close to the original game now. Here’s an example from the NieR:Automata OST. I’m visualizing all the 48 bands, and the colors are:

  • Blue: Below threshold, severely muted.
  • Green: Over threshold, should be heard.
  • Red: Saturated, hitting max threshold.

The tones in one octave form one row, and the four octaves are stacked on top of each other. The top-left starts at A3 – 220 Hz. If you know some music theory, maybe you can figure out which key the tune is in? 🙂

Implementation

First we mix stereo down to mono. This is kind of trivial. Just take the average of left and right channels.

Ultra-sharp bandpass, resonance filters

I went through a few failed iterations to get here. My first attempts were to do all of this in the frequency domain with FFTs, but that plan failed very quickly. What I ended up with in the end was a simple biquad resonance filter. This filter is characterized by having two zeroes and two poles in DSP parlance, or in other words, FIR (finite impulse response) and IIR (infinite impulse response). In code, this would look something like:

y[t] = n0 * x[t] + n1 * x[t - 1] + n2 * x[t - 2] - d0 * y[t - 1] - d1 * y[t - 2]

In the Z-domain, this looks like

H(z) = 1/n0 * (1 + z^-1 * n1/g + z^-2 * n2/g) / (1 + z^-1 * d0 + z^-2 * d1)

The zeroes and poles occur where the roots of the polynomials go to zero in the numerator and denominator respectively. Basically, I designed the filter by deliberately placing zeroes and poles in the Z-domain, factoring the expressions out and converting it back to a normal FIR and IIR form.

I placed a zero at DC and the Nyquist frequency (w = pi). The poles were placed very close to the unit circle at w = +/- 2 * pi * freq / samplerate, and amplitude 0.9999. Then I evaluated the filter response at the resonance frequency and adjusted the FIR portion of the filter so that we got an estimated unit gain at the resonance frequency.

Basically, the frequency response at the resonance frequency will be very close to dividing by zero, so near-infinite response, but not quite. Numerical stability can easily throw off the filter if we’re not careful. This is one of the major issues with IIR filters in general. I initially tried an 8-pole filter but it was impossible to get this stable even in FP64, so I just gave up and tried a simple biquad instead which worked just fine.

SIMD

Since we’re doing 48 IIR filters in parallel, this was a perfect case for SIMD optimizations. I made everything into a struct-of-arrays (SoA) form, and just vectorized the scalar IIR filter directly. Normally, small IIR filters are tricky to vectorize since there are inter-dependencies between samples, but not here.

I optimized the filter in NEON, SSE1 and AVX and got a very nice performance boost, more on that later.

This would have been a great case for ISPC, but I considered it a too large dependency for something simple like this.

Distortion

The distortion function must be nicely SIMD-friendly and not too expensive. I landed on the classic x/(1+abs(x)) operator. The divide can be done fast with reciprocal estimations. We didn’t need high accuracy.

Slight low-pass

After we have mixed together the 48 distorted streams, we run a weak low-pass filter on top to remove some of the harshest harmonics. This is done with a trivial 1-pole IIR filter.

Performance

I tested performance on a Ryzen 7 1800x @ 3.8 GHz as well as a high-end phone (Galaxy S9 Exynos) to measure NEON performance. The benchmark pushes 20 million white noise samples through the filter and then times the result. The test doesn’t take that long, so this should be assumed to be absolute peak performance without any thermal / power consideration. The results below are given in samples processed per second. Normal audio clips are 44.1 kHz, so 0.441 M/s should correspond to 1x real-time performance. The C++ version is written without any intrinsics with -O3 -ffast-math. The SIMD versions are written with the standard intrinsics.

ChipC++SSEAVXAArch64 NEON
Samsung Exynos 9810
1.8 M/s6.8 M/s
Ryzen 7 1800x @ 3.8 GHz3.6 M/s7.1 M/s11.5 M/s

Basically, we’re 100x realtime performance here, even on a mobile CPU, nice. I’m surprised how close the performance ended up when comparing SSE and NEON. I didn’t see any auto-vectorization activate in the C++ variant, so I wonder what is going on with just 2x scaling in SSE. I got similar results on MSVC and GCC for what it’s worth … NEON gets close to ideal 4x scaling though, nice.

This uses quite a bit of processing power, so we can’t run wild with effects like this right now. But I look forward to being able to take advantage of systems like this for even more precise operations in the future.


The original implementation probably does more work on more gimped CPU hardware (AMD Jaguar consoles), but 100x real-time is pretty fast in my book. 😉

Source

The implementation is out there, but don’t expect to be able to use it as-is. This is a hobby project after all.

https://github.com/Themaister/Granite/blob/master/audio/dsp/tone_filter.cpp

https://github.com/Themaister/Granite/blob/master/tests/tone_filter_bench.cpp

VST plugin

I implemented a simple VST plugin with builds for Windows and macOS, both 64-bit. Feel free to try it out. It’s ultra bare bones.

Windows 64-bit

macOS 64-bit