Modernizing Granite’s mesh rendering

Granite’s renderer is currently quite old school. It was written with 2017 mobile hardware in mind after all. Little to no indirect drawing, a bindful material system, etc. We’ve all seen that a hundred times before. Granite’s niche ended up exploring esoteric use cases, not high-end rendering, so it was never a big priority for me to fix that.

Now that mesh shading has starting shipping and is somewhat proven in the wild with several games shipping UE5 Nanite, and Alan Wake II – which all rely on mesh shaders to not run horribly slow – it was time to make a more serious push towards rewriting the entire renderer in Granite. This has been a slow burn project that’s been haunting me for almost half a year at this point. I haven’t really had the energy to rewrite a ton of code like this in my spare time, but as always, holidays tend to give me some energy for these things. Video shenanigans have also kept me distracted this fall.

I’m still not done with this rewrite, but enough things have fallen into place, that I think it’s time to write down my findings so far.

Design requirements

Reasonable fallbacks

I had some goals for this new method. Unlike UE5 Nanite and Alan Wake II, I don’t want to hard-require actual VK_EXT_mesh_shader support to run acceptably. Just thinking in terms of meshlets should benefit us in plain multi-draw-indirect (MDI) as well. For various mobile hardware that doesn’t support MDI well (or at all …), I’d also like a fallback path that ends up using regular direct draws. That fallback path is necessary to evaluate performance uplift as well.

What Nanite does to fallback

This is something to avoid. Nanite relies heavily on rendering primitive IDs to a visibility buffer, where attributes are resolved later. In the primary compute software rasterizer, this becomes a 64-bit atomic, and in the mesh shader fallback, a single primitive ID is exported to fragment stage as a per-primitive varying, where fragment shader just does the atomic (no render targets, super fun to debug …). The problem here is that per-primitive varyings don’t exist in the classic vertex -> fragment pipeline. There are two obvious alternatives to work around this:

Geometry shaders. Pass-through mode can potentially be used if all the stars align on supported hardware, but using geometry shaders should revoke your graphics programmer’s license.
Unroll a meshlet into a non-indexed draw. Duplicate primitive ID into 3 vertices. Use flat shading to pull in the primitive ID.

From my spelunking in various shipped titles, Nanite does the latter, and fallback rendering performance is halved as a result (!). Depending on the game, meshlet fallbacks are either very common or very rare, so real world impact is scene and resolution dependent, but Immortals of Aveum lost 5-15% FPS when I tested it.

The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading suggests rendering out a visibility G-Buffer using InstanceID (fed through some mechanism) and SV_PrimitiveID, which might be worth exploring at some point. I’m not sure why Nanite did not go that route. It seems like it would have avoided the duplicated vertices.

Alan Wake II?

Mesh shaders are basically a hard requirement for this game. It will technically boot without mesh shader support, but the game gives you a stern warning about performance, and they are not kidding. I haven’t dug into what the fallback is doing, but I’ve seen people posting videos demonstrating sub-10 FPS on a 1080 Ti. Given the abysmal performance, I wouldn’t be surprised if they just disabled all culling and draw everything in the fallback.

A compressed runtime meshlet format

While studying https://github.com/zeux/meshoptimizer I found support for compressed meshes, a format that was turned into a glTF EXT. It seems to be designed for decompressing on CPU (completely serial algorithm), which was not all that exciting for me, but this sparked an idea. What if I could decompress meshlets on the GPU instead? There are two ways this can be useful:

Would it be fast enough to decompress inline inside the mesh shader? This can potentially save a lot of read bandwidth during rendering and save precious VRAM.
Bandwidth amplifier on asset loading time. Only the compressed meshlet format needs to go over PCI-e wire, and we decompress directly into VRAM. Similar idea to GDeflate and other compression formats, except I should be able to come up with something that is way faster than a general purpose algorithm and also give decent compression ratios.

I haven’t seen any ready-to-go implementation of this yet, so I figured this would be my starting point for the renderer. Always nice to have an excuse to write some cursed compute shaders.

Adapting to implementations

One annoying problem with mesh shading is that different vendors have very different fast paths through their hardware. There is no single implementation that fits all. I’ve spent some time testing various parameters and observe what makes NV and AMD go fast w.r.t. mesh shaders, with questionable results. I believe this is the number 1 reason mesh shaders are still considered a niche feature.

Since we’re baking meshlets offline, the format itself must be able to adapt to implementations that prefer 32/64/128/256 primitive meshlets. It must also adapt nicely to MultiDrawIndirect-style rendering.

Random-access

It should be efficient to decode meshlets in parallel, and in complete isolation.

The format

I went through some (read: way too many) design iterations before landing on this design.

256 vert/prim meshlets

Going wide means we get lower culling overhead and emitting larger MDI calls avoids us getting completely bottlenecked on command stream frontend churn. I tried going lower than 256, but performance suffered greatly. 256 seemed like a good compromise. With 256 prim/verts, we can use 8-bit index buffers as well, which saves quite a lot of memory.

Sublets – 8×32 grouping

To consider various hardware implementations, very few will be happy with full, fat 256 primitive meshlets. To remedy this, the encoding is grouped in units of 32 – a “sublet” – where we can shade the 8 groups independently, or have larger workgroups that shade multiple sublets together. Some consideration is key to be performance portable. At runtime we can specialize our shaders to fit whatever hardware we’re targeting.

Using grouping of 32 is core to the format as well, since we can exploit NV warps being 32-wide and force Wave32 on RDNA hardware to get subgroup accelerated mesh shading.

Format header

// Can point to mmap-ed file.
struct MeshView
{
    const FormatHeader *format_header;
    const Bound *bounds;
    const Bound *bounds_256; // Used to cull in units of 256 prims
    const Stream *streams;
    const uint32_t *payload;
    uint32_t total_primitives;
    uint32_t total_vertices;
    uint32_t num_bounds;
    uint32_t num_bounds_256;
};

struct FormatHeader
{
    MeshStyle style;
    uint32_t stream_count;
    uint32_t meshlet_count;
    uint32_t payload_size_words;
};

The style signals type of mesh. This is naturally engine specific.

Wireframe: A pure position + index buffer
Textured: Adds UV + Normal + Tangent
Skinned: Adds bone indices and weights on top

A stream is 32 values encoded in some way.

enum class StreamType
{
    Primitive = 0,
    Position,
    NormalTangentOct8,
    UV,
    BoneIndices,
    BoneWeights,
};

Each meshlet has stream_count number of Stream headers. The indexing is trivial:

streams[RuntimeHeader::stream_offset + int(StreamType)]

// 16 bytes
struct Stream
{
    union
    {
       uint32_t base_value[2];
       struct { uint32_t prim_count; uint32_t vert_count; } counts;
    } u;
    uint32_t bits;
    uint32_t offset_in_words;
};

This is where things get a bit more interesting. I ended up supporting some encoding styles that are tailored for the various attribute formats.

Encoding attributes

There’s two parts to this problem. First is to decide on some N-bit fixed point values, and then find the most efficient way to pull those bits from a buffer. I went through several iterations on the actual bit-stuffing.

Base value + DELTA encoding

A base value is encoded in Stream::base_value, and the decoded bits are an offset from the base. To start approaching speed-of-light decoding, this is about as fancy as we can do it.

I went through various iterations of this model. The first idea had a predictive encoding between neighbor values, where subgroup scan operations were used to complete the decode, but it was too slow in practice, and didn’t really improve bit rates at all.

Index buffer

Since the sublet is just 32-wide, we can encode with 5-bit indices. 15 bit / primitive. There is no real reason to use delta encode here, so instead of storing base values in the stream header, I opted to use those bits to encode vertex/index counts.

Position

This is decoded to 3×16-bit SINT. The shared exponent is stored in top 16 bits of Stream::bits.

vec3 position = ldexp(vec3(i16vec3(decoded)), exponent);

This facilitates arbitrary quantization as well.

UV

Similar idea as position, but 2×16-bit SINT. After decoding similar to position, a simple fixup is made to cater to typical UVs which lie in range of [0, +1], not [-1, +1].

vec2 uv = 0.5 * ldexp(vec2(i16vec2(decoded)), exponent) + 0.5;

Normal/Tangent

Encoded as 4×8-bit SNORM. Normal (XY) and Tangent (ZW) are encoded with Octahedral encoding from meshoptimizer library.

To encode the sign of tangent, Stream::bits stores 2 bits, which signals one of three modes:

Uniform W = -1
Uniform W = +1
LSB of decoded W encodes tangent W. Tangent’s second component loses 1 bit of precision.

Bone index / weight

Basically same as Normal/Tangent, but ignore tangent sign handling.

First (failed?) idea – bitplane encoding

For a long time, I was pursuing bitplane encoding, which is one of the simplest ways to encode variable bitrates. We can encode 1 bit for 32 values by packing them in one u32. To speed up decoding further, I aimed to pack everything into 128-bit aligned loads. This avoids having to wait for tiny, dependent 32-bit loads.

For example, for index buffers:

uint meshlet_decode_index_buffer(
   uint stream_index, uint chunk_index,
   int lane_index)
{
    uint offset_in_b128 =
      meshlet_streams.data[stream_index].offset_in_b128;

    // Fixed 5-bit encoding.
    offset_in_b128 += 4 * chunk_index;

    // Scalar load. 64 bytes in one go.
    uvec4 p0 = payload.data[offset_in_b128 + 0];
    uvec4 p1 = payload.data[offset_in_b128 + 1];
    uvec4 p2 = payload.data[offset_in_b128 + 2];
    uvec4 p3 = payload.data[offset_in_b128 + 3];

    uint indices = 0;

    indices |= bitfieldExtract(p0.x, lane_index, 1) << 0u;
    indices |= bitfieldExtract(p0.y, lane_index, 1) << 1u;
    indices |= bitfieldExtract(p0.z, lane_index, 1) << 2u;
    indices |= bitfieldExtract(p0.w, lane_index, 1) << 3u;

    indices |= bitfieldExtract(p1.x, lane_index, 1) << 8u;
    indices |= bitfieldExtract(p1.y, lane_index, 1) << 9u;
    indices |= bitfieldExtract(p1.z, lane_index, 1) << 10u;
    indices |= bitfieldExtract(p1.w, lane_index, 1) << 11u;

    indices |= bitfieldExtract(p2.x, lane_index, 1) << 16u;
    indices |= bitfieldExtract(p2.y, lane_index, 1) << 17u;
    indices |= bitfieldExtract(p2.z, lane_index, 1) << 18u;
    indices |= bitfieldExtract(p2.w, lane_index, 1) << 19u;

    indices |= bitfieldExtract(p3.x, lane_index, 1) << 4u;
    indices |= bitfieldExtract(p3.y, lane_index, 1) << 12u;
    indices |= bitfieldExtract(p3.z, lane_index, 1) << 20u;

    return indices;
}

On Deck, this ends up looking like

s_buffer_load_dwordx4 x 4
v_bfe_u32 x 15
v_lshl_or_b32 x 15

Thinking about ALU and loads in terms of scalar and vectors can greatly help AMD performance when done right, so this approach felt natural.

For variable bit rates, I’d have code like:

if (bits & 4) { unroll_4_bits_bit_plane(); }
if (bits & 2) { unroll_2_bits_bit_plane(); }
if (bits & 1) { unroll_1_bit_bit_plane(); }

However, I abandoned this idea, since while favoring SMEM so heavily, the VALU with all the bitfield ops wasn’t exactly amazing for perf. I’m still just clocking out one bit per operation here. AMD performance was quite alright compared to what I ended up with in the end, but NVIDIA performance was abysmal, so I went back to the drawing board, and ended up with the absolute simplest solution that would work.

Tightly packed bits

This idea is to just literally pack bits together, clearly a revolutionary idea that noone has ever done before. A VMEM load or two per thread, then some shifts should be all that is needed to move the components into place.

E.g. for index buffers:

uvec3 meshlet_decode_index_buffer(uint stream_index,
   uint chunk_index,
   int lane_index)
{
  uint offset_in_words = 
    meshlet_streams.data[stream_index].offset_in_words;
  return meshlet_decode3(offset_in_words, lane_index, 5);
}

For the actual decode I figured it would be pretty fast if all the shifts could be done in 64-bit. At least AMD has native instructions for that.

uvec3 meshlet_decode3(uint offset_in_words,
   uint index,
   uint bit_count)
{
    const uint num_components = 3;
    uint start_bit = index * bit_count * num_components;
    uint start_word = offset_in_words + start_bit / 32u;
    start_bit &= 31u;
    uint word0 = payload.data[start_word];
    uint word1 = payload.data[start_word + 1u];
    uvec3 v;

    uint64_t word = packUint2x32(uvec2(word0, word1));
    v.x = uint(word >> start_bit);
    start_bit += bit_count;
    v.y = uint(word >> start_bit);
    start_bit += bit_count;
    v.z = uint(word >> start_bit);
    return bitfieldExtract(v, 0, int(bit_count));
}

There is one detail here. For 13, 14 and 15 bit components with uvec3 decode, more than two u32 words may be needed, so in this case, encoder must choose 16 bit. (16-bit works due to alignment.) This only comes up in position encode, and encoder can easily just ensure 12 bit deltas is enough to encode, quantizing a bit more as necessary.

Mapping to MDI

Every 256-wide meshlet can turn into an indexed draw call with VK_INDEX_TYPE_UINT8_EXT, which is nice for saving VRAM. The “task shader” becomes a compute shader that dumps out a big multi-draw indirect buffer. The DrawIndex builtin in Vulkan ends up replacing WorkGroupID in mesh shader for pulling in per-meshlet data.

Performance sanity check

Before going further with mesh shading fun, it’s important to validate performance. I needed at least a ballpark idea of how many primitives could be pumped through the GPU with a good old vkCmdDrawIndexed and the MDI method where one draw call is one meshlet. This was then to be compared against a straight forward mesh shader.

Zeux’s Niagara renderer helpfully has a simple OBJ for us to play with.

When exported to the new meshlet format it looks like:

[INFO]: Stream 0: 54332 bytes. (Index) 15 bits / primitive
[INFO]: Stream 1: 75060 bytes. (Position) ~25 bits / pos
[INFO]: Stream 2: 70668 bytes. (Normal/Tangent) ~23.8 bits / N + T + sign
[INFO]: Total encoded vertices: 23738 // Vertex duplication :(
[INFO]: Average radius 0.037 (908 bounds) // 32-wide meshlet
[INFO]: Average cutoff 0.253 (908 bounds)
[INFO]: Average radius 0.114 (114 bounds) // 256-wide meshlet
[INFO]: Average cutoff 0.697 (114 bounds)
// Backface cone culling isn't amazing for larger meshlets.
[INFO]: Exported meshlet:
[INFO]: 908 meshlets
[INFO]: 200060 payload bytes
[INFO]: 86832 total indices
[INFO]: 14856 total attributes
[INFO]: 703872 uncompressed bytes

One annoying thing about meshlets is attribute duplication when one vertex is reused across meshlets, and using tiny 32-wide meshlets makes this way worse. Add padding on top for encode and the compression ratio isn’t that amazing anymore. The primitive to vertex ratio is ~1.95 here which is really solid, but turning things into meshlets tends to converge to ~1.0.

I tried different sublet sizes, but NVIDIA performance collapsed when I didn’t use 32-wide sublets, and going to 64 primitive / 32 vertex only marginally helped P/V ratios. AMD runtime performance did not like that in my testing (~30% throughput loss), so 32/32 it is!

After writing this section, AMD released a blog post suggesting that the 2N/N structure is actually good, but I couldn’t replicate that in my testing at least and I don’t have the energy anymore to rewrite everything (again) to test that.

Test scene

The classic “instance the same mesh a million times” strategy. This was tested on RTX 3070 (AMD numbers to follow, there are way more permutations to test there …). The mesh is instanced in a 13x13x13 grid. Here we’re throwing 63.59 million triangles at the GPU in one go.

Spam vkCmdDrawIndexed with no culling

5.5 ms

layout(location = 0) in vec3 POS;
layout(location = 1) in mediump vec3 NORMAL;
layout(location = 2) in mediump vec4 TANGENT;
layout(location = 3) in vec2 UV;

layout(location = 0) out mediump vec3 vNormal;
layout(location = 1) out mediump vec4 vTangent;
layout(location = 2) out vec2 vUV;

// The most basic vertex shader.
void main()
{
  vec3 world_pos = (M * vec4(POS, 1.0)).xyz;
  vNormal = mat3(M) * NORMAL;
  vTangent = vec4(mat3(M) * TANGENT.xyz, TANGENT.w);
  vUV = UV;
  gl_Position = VP * vec4(world_pos, 1.0);
}

With per-object frustum culling

This is the most basic thing to do, so for reference.

4.3 ms

One massive MDI

Here we’re just doing basic frustum culling of meshlets as well as back-face cone culling and emitting one draw call per meshlet that passes test.

3.9 ms

Significantly more geometry is rejected now due to back-face cull and tighter frustum cull, but performance isn’t that much better. Once we start considering occlusion culling, this should turn into a major win over normal draw calls. In this path, we have a bit more indirection in the vertex shader, so that probably accounts for some loss as well.

void main()
{
    // Need to index now, but shouldn't be a problem on desktop hardware.
    mat4 M = transforms.data[draw_info.data[gl_DrawIDARB].node_offset];

    vec3 world_pos = (M * vec4(POS, 1.0)).xyz;
    vNormal = mat3(M) * NORMAL;
    vTangent = vec4(mat3(M) * TANGENT.xyz, TANGENT.w);
    vUV = UV;

    // Need to pass down extra data to sample materials, etc.
    // Fragment shader cannot read gl_DrawIDARB.
    vMaterialID = draw_info.data[gl_DrawIDARB].material_index;

    gl_Position = VP * vec4(world_pos, 1.0);
}

Meshlet – Encoded payload

Here, the meshlet will read directly from the encoded payload, and decode inline in the shader. No per-primitive culling is performed.

4.1 ms

Meshlet – Decoded payload

4.0 ms

We’re at the point where we are bound on fixed function throughput. Encoded and Decoded paths are basically both hitting the limit of how much data we can pump to the rasterizer.

Per-primitive culling

To actually make good use of mesh shading, we need to consider per-primitive culling. For this section, I’ll be assuming a subgroup size of 32, and a meshlet size of 32. There are other code paths for larger workgroups, which require some use of groupshared memory, but that’s not very exciting for this discussion.

The gist of this idea was implemented in https://gpuopen.com/geometryfx/. Various AMD drivers adopted the idea as well to perform magic driver culling, but the code here isn’t based on any other code in particular.

Doing back-face culling correctly

This is tricky, but we only need to be conservative, not exact. We can only reject when we know for sure the primitive is not visible.

Perspective divide and clip codes

The first step is to do W divide per vertex and study how that vertex clips against the X, Y, and W planes. We don’t really care about Z. Near-plane clip is covered by negative W tests, and far plane should be covered by simple frustum test, assuming we have a far plane at all.

vec2 c = clip_pos.xy / clip_pos.w;

uint clip_code = clip_pos.w <= 0.0 ? CLIP_CODE_NEGATIVE_W : 0;
if (any(greaterThan(abs(c), vec2(4.0))))
    clip_code |= CLIP_CODE_INACCURATE;
if (c.x <= -1.0)
    clip_code |= CLIP_CODE_NEGATIVE_X;
if (c.y <= -1.0)
    clip_code |= CLIP_CODE_NEGATIVE_Y;
if (c.x >= 1.0)
    clip_code |= CLIP_CODE_POSITIVE_X;
if (c.y >= 1.0)
    clip_code |= CLIP_CODE_POSITIVE_Y;

vec2 window = roundEven(c * viewport.zw + viewport.xy);

There are things to unpack here. The INACCURATE clip code is used to denote a problem where we might start to run into accuracy issues when converting to fixed point, or GPUs might start doing clipping due to guard band exhaustion. I picked the value arbitrarily.

The window coordinate is then computed by simulating the fixed point window coordinate snapping done by real GPUs. Any GPU supporting DirectX will have a very precise way of doing this, so this should be okay in practice. Vulkan also exposes the number of sub-pixel bits in the viewport transform. On all GPUs I know of, this is 8. DirectX mandates exactly 8.

vec4 viewport =
    float(1 << 8 /* shader assumes 8 */) *
        vec4(cmd->get_viewport().x +
               0.5f * cmd->get_viewport().width - 0.5f,
             cmd->get_viewport().y +
               0.5f * cmd->get_viewport().height - 0.5f,
             0.5f * cmd->get_viewport().width,
             0.5f * cmd->get_viewport().height) -
             vec4(1.0f, 1.0f, 0.0f, 0.0f);

This particular way of doing it comes into play later when discussing micro-poly rejection. One thing to note here is that Vulkan clip-to-window coordinate transform does not flip Y-sign. D3D does however, so beware.

Shuffle clip codes and window coordinates

void meshlet_emit_primitive(uvec3 prim, vec4 clip_pos, vec4 viewport)
{
  // ...
  vec2 window = roundEven(c * viewport.zw + viewport.xy);

  // vertex ID maps to gl_SubgroupInvocationID
  // Fall back to groupshared as necessary
  vec2 window_a = subgroupShuffle(window, prim.x);
  vec2 window_b = subgroupShuffle(window, prim.y);
  vec2 window_c = subgroupShuffle(window, prim.z);
  uint code_a = subgroupShuffle(clip_code, prim.x);
  uint code_b = subgroupShuffle(clip_code, prim.y);
  uint code_c = subgroupShuffle(clip_code, prim.z);
}

Early reject or accept

Based on clip codes we can immediately accept or reject primitives.

uint or_code = code_a | code_b | code_c;
uint and_code = code_a & code_b & code_c;
bool culled_planes = (and_code & CLIP_CODE_PLANES) != 0;
bool is_active_prim = false;

if (!culled_planes)
{
    is_active_prim =
        (or_code & (CLIP_CODE_INACCURATE |
                    CLIP_CODE_NEGATIVE_W)) != 0;

    if (!is_active_prim)
        is_active_prim = cull_triangle(window_a,
                                       window_b,
                                       window_c);
}

If all three vertices are outside one of the clip planes, reject immediately
If any vertex is considered inaccurate, accept immediately
If one or two of the vertices have negative W, we have clipping. Our math won’t work, so accept immediately. (If all three vertices have negative W, the first test rejects).
Perform actual back-face cull.

Actual back-face cull

bool cull_triangle(vec2 a, vec2 b, vec2 c)
{
  precise vec2 ab = b - a;
  precise vec2 ac = c - a;

  // This is 100% accurate as long as the primitive
  // is no larger than ~4k subpixels, i.e. 16x16 pixels.
  // Normally, we'd be able to do GEQ test, but GE test is conservative,
  // even with FP error in play.

  // Depending on your engine and API conventions, swap these two.
  precise float pos_area = ab.y * ac.x;
  precise float neg_area = ab.x * ac.y;

  // If the pos value is (-2^24, +2^24),
  // the FP math is exact,
  // if not, we have to be conservative.
  // Less-than check is there to ensure that 1.0 delta
  // in neg_area *will* resolve to a different value.
  bool active_primitive;
  if (abs(pos_area) < 16777216.0)
    active_primitive = pos_area > neg_area;
  else
    active_primitive = pos_area >= neg_area;

  return active_primitive;
}

To compute winding, we need a 2D cross product. While noodling with this code, I noticed that we can still do it in FP32 instead of full 64-bit integer math. We’re working with integer-rounded values here, so based on the magnitudes involved we can pick the exact GEQ test. If we risk FP rounding error, we can use GE test. If the results don’t test equal, we know for sure area must be negative, otherwise, it’s possible it could have been positive, but the intermediate values rounded to same value in the end.

3.3 ms

Culling primitives helped as expected. Less pressure on the fixed function units.

Micro-poly rejection

Given how pathologically geometry dense this scene is, we expect that most primitives never trigger the rasterizer at all.

If we can prove that the bounding box of the primitive lands between two pixel grids, we can reject it since it will never have coverage.

if (active_primitive)
{
    // Micropoly test.
    const int SUBPIXEL_BITS = 8;
    vec2 lo = floor(ldexp(min(min(a, b), c), ivec2(-SUBPIXEL_BITS)));
    vec2 hi = floor(ldexp(max(max(a, b), c), ivec2(-SUBPIXEL_BITS)));
    active_primitive = all(notEqual(lo, hi));
}

There is a lot to unpack in this code. If we re-examine the viewport transform:

vec4 viewport = float(1 << 8 /* shader assumes 8 */) *
  vec4(cmd->get_viewport().x +
        0.5f * cmd->get_viewport().width - 0.5f,
      cmd->get_viewport().y +
        0.5f * cmd->get_viewport().height - 0.5f,
      0.5f * cmd->get_viewport().width,
      0.5f * cmd->get_viewport().height) -
      vec4(1.0f, 1.0f, 0.0f, 0.0f);

First, we need to shift by 0.5 pixels. The rasterization test happens at the center of a pixel, and it’s more convenient to sample at integer points. Then, due to top-left rasterization rules on all desktop GPUs (a DirectX requirement), we shift the result by one sub-pixel. This ensures that should a primitive have a bounding box of [1.0, 2.0], we will consider it for rasterization, but [1.0 + 1.0 / 256.0, 2.0] will not. Top-left rules are not technically guaranteed in Vulkan however (it just has to have some rule), so if you’re paranoid, increase the upper bound by one sub-pixel.

1.9 ms

Now we’re only submitting 1.2 M primitives to the rasterizer, which is pretty cool, given that we started with 31 M potential primitives. Of course, this is a contrived example with ridiculous micro-poly issues.

We’re actually at the point here where reporting the invocation stats (one atomic per workgroup) becomes a performance problem, so turning that off:

1.65 ms

With inline decoding there’s some extra overhead, but we’re still well ahead:

2.5 ms

Build active vertex / primitive masks

This is quite straight forward. Once we have the counts, SetMeshOutputCounts is called and we can compute the packed output indices with a mask and popcount.

uint vert_mask = 0u;
if (is_active_prim)
    vert_mask = (1u << prim.x) | (1u << prim.y) | (1u << prim.z);

uvec4 prim_ballot = subgroupBallot(is_active_prim);

shared_active_prim_offset = subgroupBallotExclusiveBitCount(prim_ballot);
shared_active_vert_mask = subgroupOr(vert_mask);

shared_active_prim_count_total = subgroupBallotBitCount(prim_ballot);
shared_active_vert_count_total = bitCount(shared_active_vert_mask);

Special magic NVIDIA optimization

Can we improve things from here? On NVIDIA, yes. NVIDIA seems to under-dimension the shader export buffers in their hardware compared to peak triangle throughput, and their developer documentation on the topic suggests:

Replace attributes with barycentrics and allowing the Pixel Shader to fetch and interpolate the attributes

Using VK_KHR_fragment_shader_barycentrics we can write code like:

// Mesh output
layout(location = 0) flat out uint vVertexID[];
layout(location = 1) perprimitiveEXT out uint vTransformIndex[];

// Fragment
layout(location = 0) pervertexEXT in uint vVertexID[];
layout(location = 1) perprimitiveEXT flat in uint vTransformIndex;

// Fetch vertex IDs
uint va = vVertexID[0];
uint vb = vVertexID[1];
uint vc = vVertexID[2];

// Load attributes from memory directly
uint na = attr.data[va].n;
uint nb = attr.data[vb].n;
uint nc = attr.data[vc].n;

// Interpolate by hand
mediump vec3 normal = gl_BaryCoordEXT.x * decode_rgb10a2(na) +
    gl_BaryCoordEXT.y * decode_rgb10a2(nb) +
    gl_BaryCoordEXT.z * decode_rgb10a2(nc);

// Have to transform normals and tangents as necessary.
// Need to pass down some way to load transforms.
normal = mat3(transforms.data[vTransformIndex]) * normal;
normal = normalize(normal);

1.0 ms

Quite the dramatic gain! Nsight Graphics suggests we’re finally SM bound at this point (> 80% utilization), where we used to be ISBE bound (primitive / varying allocation). An alternative that I assume would work just as well is to pass down a primitive ID to a G-buffer similar to Nanite.

There are a lot of caveats with this approach however, and I don’t think I will pursue it:

Moves a ton of extra work to fragment stage
- I’m not aiming for Nanite-style micro-poly hell here, so doing work per-vertex seems better than per-fragment
- This result isn’t representative of a real scene where fragment shader load would be far more significant
Incompatible with encoded meshlet scheme
- It is possible to decode individual values, but it sure is a lot of dependent memory loads to extract a single value
Very awkward to write shader code like this at scale
- Probably need some kind of meta compiler that can generate code, but that’s a rabbit hole I’m not going down
- Need fallbacks, barycentrics is a very modern feature
Makes skinning even more annoying
- Loading multiple matrices with fully dynamic index in fragment shader does not scream performance, then combine that with having to compute motion vectors on top …
Only seems to help throughput on NVIDIA
We’re already way ahead of MDI anyway

Either way, this result was useful to observe.

AMD

Steam Deck

Before running the numbers, we have to consider that the RADV driver already does some mesh shader optimizations for us automatically. The NGG geometry pipeline automatically converts vertex shading workloads into pseudo-meshlets, and RADV also does primitive culling in the driver-generated shader.

To get the raw baseline, we’ll first consider the tests without that path, so we can see how well RADV’s own culling is doing. The legacy vertex path is completely gone on RDNA3 as far as I know, so these tests have to be done on RDNA2.

No culling, plain vkCmdDrawIndexed, RADV_DEBUG=nongg

Even locked to 1600 MHz (peak), GPU is still just consuming 5.5 W. We’re 100% bound on fixed function logic here, the shader cores are sleeping.

44.3 ms

Basic frustum culling

As expected, performance scales as we cull. Still 5.5 W. 27.9 ms

NGG path, no primitive culling, RADV_DEBUG=nonggc

Not too much changed in performance here. We’re still bound on the same fixed function units pumping invisible primitives through. 28.4 ms

Actual RADV path

When we don’t cripple RADV, we get a lot of benefit from driver culling. GPU hits 12.1 W now. 9.6 ms

MDI

Slight win. 8.9 ms

Forcing Wave32 in mesh shaders

Using Vulkan 1.3’s subgroup size control feature, we can force RDNA2 to execute in Wave32 mode. This requires support in

 VkShaderStageFlags requiredSubgroupSizeStages;

The Deck drivers and upstream Mesa ship support for requiredSize task/mesh shaders now which is very handy. AMD’s Windows drivers or AMDVLK/amdgpu-pro do not, however 🙁 It’s possible Wave32 isn’t the best idea for AMD mesh shaders in the first place, it’s just that the format favors Wave32, so I enable it if I can.

Testing various parameters

While NVIDIA really likes 32/32 (anything else I tried fell off the perf cliff), AMD should in theory favor larger workgroups. However, it’s not that easy in practice, as I found.

Decoded meshlet – Wave32 – N/N prim/vert

32/32: 9.3 ms
64/64: 10.5 ms
128/128: 11.2 ms
256/256: 12.8 ms

These results are … surprising.

Encoded meshlet – Wave32 N/N prim/vert

32/32: 10.7 ms
64/64: 11.8 ms
128/128: 12.7 ms
256/256: 14.7 ms

Apparently Deck (or RDNA2 in general) likes small meshlets?

Wave64?

No meaningful difference in performance on Deck.

VertexID passthrough?

No meaningful difference either. This is a very NVIDIA-centric optimization I think.

A note on LocalInvocation output

In Vulkan, there are some properties that AMD sets for mesh shaders.

VkBool32 prefersLocalInvocationVertexOutput;
VkBool32 prefersLocalInvocationPrimitiveOutput;

This means that we should write outputs using LocalInvocationIndex, which corresponds to how RDNA hardware works. Each thread can export one primitive and one vertex and the thread index corresponds to primitive index / vertex index. Due to culling and compaction, we will have to roundtrip through groupshared memory somehow to satisfy this.

For the encoded representation, I found that it’s actually faster to ignore this suggestion, but for the decoded representation, we can just send the vertex IDs through groupshared, and do split vertex / attribute shading. E.g.:

if (meshlet_lane_has_active_vert())
{
    uint out_vert_index = meshlet_compacted_vertex_output();
    uint vert_id = meshlet.vertex_offset + linear_index;

    shared_clip_pos[out_vert_index] = clip_pos;
    shared_attr_index[out_vert_index] = vert_id;
}

barrier();

if (gl_LocalInvocationIndex < shared_active_vert_count_total)
{
    TexturedAttr a =
      attr.data[shared_attr_index[gl_LocalInvocationIndex]];
    mediump vec3 n = unpack_bgr10a2(a.n).xyz;
    mediump vec4 t = unpack_bgr10a2(a.t);
    gl_MeshVerticesEXT[gl_LocalInvocationIndex].gl_Position =
      shared_clip_pos[gl_LocalInvocationIndex];
    vUV[gl_LocalInvocationIndex] = a.uv;
    vNormal[gl_LocalInvocationIndex] = mat3(M) * n;
    vTangent[gl_LocalInvocationIndex] = vec4(mat3(M) * t.xyz, t.w);
}

Only computing visible attributes is a very common optimization in GPUs in general and RADV’s NGG implementation does it roughly like this.

Either way, we’re not actually beating the driver-based meshlet culling on Deck. It’s more or less already doing this work for us. Given how close the results are, it’s possible we’re still bound on something that’s not raw compute. On the positive side, the cost of using encoded representation is very small here, and saving RAM for meshes is always nice.

Already, the permutation hell is starting to become a problem. It’s getting quite obvious why mesh shaders haven’t taken off yet 🙂

RX 7600 numbers

Data dump section incoming …

NGG culling seems obsolete now?

By default RADV disables NGG culling on RDNA3, because apparently it has a much stronger fixed function culling in hardware now. I tried forcing it on with RADV_DEBUG=nggc, but found no uplift in performance for normal vertex shaders. Curious. Here’s with no culling, where the shader is completely export bound.

But, force NGG on, and it still doesn’t help much. Culling path takes as much time as the other, the instruction latencies are just spread around more.

RADV

vkCmdDrawIndexed, no frustum culling: 5.9 ms
With frustum cull: 3.7 ms
MDI: 5.0 ms

Wave32 – Meshlet

Encoded – 32/32: 3.3 ms
Encoded – 64/64 : 2.5 ms
Encoded – 128/128: 2.7 ms
Encoded – 256/256: 2.9 ms
Decoded – 32/32: 3.3 ms
Decoded – 64/64: 2.4 ms
Decoded – 128/128: 2.6 ms
Decoded – 256/256: 2.7 ms

Wave64 – Meshlet

Encoded – 64/64: 2.4 ms
Encoded – 128/128: 2.6 ms
Encoded – 256/256: 2.7 ms
Decoded – 64/64: 2.2 ms
Decoded – 128/128: 2.5 ms
Decoded – 256/256: 2.7 ms

Wave64 mode is doing quite well here. From what I understand, RADV hasn’t fully taken advantage of the dual-issue instructions in RDNA3 specifically yet, which is important for Wave32 performance, so that might be a plausible explanation.

There was also no meaningful difference in doing VertexID passthrough.

It’s not exactly easy to deduce anything meaningful out of these numbers, other than 32/32 being bad on RDNA3, while good on RDNA2 (Deck)?

AMD doesn’t seem to like the smaller 256 primitive draws on the larger desktop GPUs. I tried 512 and 1024 as a quick test and that improved throughput considerably, still, with finer grained culling in place, it should be a significant win.

amdgpu-pro / proprietary (Linux)

Since we cannot request specific subgroup size, the driver is free to pick Wave32 or Wave64 as it pleases, so I cannot test the difference. It won’t hit the subgroup optimized paths however.

vkCmdDrawIndexed, no culling : 6.2 ms
With frustum cull: 4.0 ms
MDI: 5.3 ms
Meshlet – Encoded – 32/32: 2.5 ms
Meshlet – Encoded – 64/64 : 2.6 ms
Meshlet – Encoded – 128/128: 2.7 ms
Meshlet – Encoded – 256/256: 2.6 ms
Meshlet – Decoded – 32/32: 2.1 ms
Meshlet – Decoded – 64/64: 2.1 ms
Meshlet – Decoded – 128/128: 2.1 ms
Meshlet – Decoded – 256/256: 2.1 ms

I also did some quick spot checks on AMDVLK, and the numbers are very similar.

The proprietary driver is doing quite well here in mesh shaders. On desktop, we can get significant wins on both RADV and proprietary with mesh shaders, which is nice to see.

It seems like the AMD Windows driver skipped NGG culling on RDNA3 as well. Performance is basically the same.

Task shader woes

The job of task shaders is to generate mesh shader work on the fly. In principle this is nicer than indirect rendering with mesh shaders for two reasons:

No need to allocate temporary memory to hold for indirect draw
No need to add extra compute passes with barriers

However, it turns out that this shader stage is even more vendor specific when it comes to tuning for performance. So far, no game I know of has actually shipped with task shaders (or the D3D12 equivalent amplification shader), and I think I now understand why.

The basic task unit I settled on was:

struct TaskInfo
{
    uint32_t aabb_instance;  // AABB, for top-level culling
    uint32_t node_instance;  // Affine transform
    uint32_t material_index; // To be eventually forwarded to fragment
    uint32_t mesh_index_count;
    // Encodes count [1, 32] in lower bits.
    // Mesh index is aligned to 32.
    uint32_t occluder_state_offset;
    // For two-phase occlusion state (for later)
};

An array of these is prepared on CPU. Each scene entity translates to one or more TaskInfos. Those are batched up into one big buffer, and off we go.

The logical task shader for me was to have N = 32 threads which tests AABB of N tasks in parallel. For the tasks that pass the test, test 32 meshlets in parallel. This makes it so the task workgroup can emit up to 1024 meshlets.

When I tried this on NVIDIA however …

18.8 ms

10x slowdown … The NVIDIA docs do mention that large outputs are bad, but I didn’t expect it to be this bad:

Avoid large outputs from the amplification shader, as this can incur a significant performance penalty. Generally, we encourage a flexible implementation that allows for fine-tuning. With that in mind, there are a number of generic factors that impact performance:

Size of the payloads. The AS payload should preferably stay below 108 bytes, but if that is not possible, then keep it at least under 236 bytes.

If we remove all support for hierarchical culling, the task shader runs alright again. 1 thread emits 0 or 1 meshlet. However, this means a lot of threads dedicated to culling, but it’s similar in performance to plain indirect mesh shading.

AMD however, is a completely different story. Task shaders are implemented by essentially emitting a bunch of tiny indirect mesh shader dispatches anyway, so the usefulness of task shaders on AMD is questionable from a performance point of view. While writing this blog, AMD released a new blog on the topic, how convenient!

When I tried NV-style task shader on AMD, performance suffered quite a lot.

However, the only thing that gets us to max perf on both AMD and NV is to forget about task shaders and go with vkCmdDrawMeshTasksIndirectCountEXT instead. While the optimal task shader path for each vendor gets close to indirect mesh shading, having a universal fast path is good for my sanity. The task shader loss was about 10% for me even in ideal situations on both vendors, which isn’t great. As rejection ratios increase, this loss grows even more. This kind of occupancy looks way better 🙂

The reason for using multi-indirect-count is to deal with the limitation that we can only submit about 64k workgroups in any dimension, similar to compute. This makes 1D atomic increments awkward, since we’ll easily blow past the 64k limit. One alternative is to take another tiny compute pass that prepares a multi-indirect draw, but that’s not really needed. Compute shader code like this works too:

// global_offset = atomicAdd() in thread 0

if (gl_LocalInvocationIndex == 0 && draw_count != 0)
{
  uint max_global_offset = global_offset + draw_count - 1;
  // Meshlet style.
  // Only guaranteed to get 0xffff meshlets,
  // so use 32k as cutoff for easy math.
  // Allocate the 2D draws in-place, avoiding an extra barrier.
  uint multi_draw_index = max_global_offset / 0x8000u;
  uint local_draw_index = max_global_offset & 0x7fffu;
  const int INC_OFFSET = NUM_CHUNK_WORKGROUPS == 1 ? 0 : 1;
  atomicMax(output_draws.count[1], multi_draw_index + 1);
  atomicMax(output_draws.count[
    2 + 3 * multi_draw_index + INC_OFFSET],
    local_draw_index + 1);

  if (local_draw_index <= draw_count)
  {
    // This is the thread that takes us over the threshold.
    output_draws.count[
      2 + 3 * multi_draw_index + 1 - INC_OFFSET] =
      NUM_CHUNK_WORKGROUPS;
    output_draws.count[2 + 3 * multi_draw_index + 2] = 1;
  }

  // Wrapped around, make sure last bucket sees 32k meshlets.
  if (multi_draw_index != 0 && local_draw_index < draw_count)
  {
    atomicMax(output_draws.count[
      2 + 3 * (multi_draw_index - 1) +
      INC_OFFSET], 0x8000u);
  }
}

This prepares a bunch of (8, 32k, 1) dispatches that are processed in one go. No chance to observe a bunch of dead dispatches back-to-back like task shaders can cause. In the mesh shader, we can use DrawIndex to offset the WorkGroupID by the appropriate amount (yay, Vulkan). A dispatchX count of 8 is to shade the full 256-wide meshlet through 8x 32-wide workgroups. As the workgroup size increases to handle more sublets per group, dispatchX count decreases similarly.

Occlusion culling

To complete the meshlet renderer, we need to consider occlusion culling. The go-to technique for this these days is two-phase occlusion culling with HiZ depth buffer. Some references:

https://advances.realtimerendering.com/s2015/ – GPU-Driven Rendering Pipelines
https://medium.com/@mil_kru/two-pass-occlusion-culling-4100edcad501 – This is a quite nice tutorial on the subject.
https://www.youtube.com/watch?v=eviSykqSUUw – 07:53 – Nanite deep-dive presentation
https://github.com/zeux/niagara – Niagara renderer by Zeux. Basically implemented all of this a long time ago.

Basic gist is to keep track of which meshlets are considered visible. This requires persistent storage of 1 bit per unit of visibility. Each pass in the renderer needs to keep track of its own bit-array. E.g. shadow passes have different visibility compared to main scene rendering.

For Granite, I went with an approach where 1 TaskInfo points to one uint32_t bitmask. Each of the 32 meshlets within the TaskInfo gets 1 bit. This makes the hierarchical culling nice too, since we can just test for visibility != 0 on the entire word. Nifty!

First phase

Here we render all objects which were considered visible last frame. It’s extremely likely that whatever was visible last frame is visible this frame, unless there was a full camera cut or similar. It’s important that we’re actually rendering to the framebuffer now. In theory, we’d be done rendering now if there were no changes to camera or objects in the scene.

HiZ pass

Based on the objects we drew in phase 1, build a HiZ depth map. This topic is actually kinda tricky. Building the mip-chain in one pass is great for performance, but causes some problems. With NPOT textures and single pass, there is no obvious way to create a functional HiZ, and the go-to shader for this, FidelityFX SPD, doesn’t support that use case.

The problem is that the size of mip-maps round down, so if we have a 7×7 texture, LOD 1 is 3×3 and LOD 2 is 1×1. In LOD2, we will be able to query a 4×4 depth region, but the edge pixels are forgotten.

The “obvious” workaround is to pad the texture to POT, but that is a horrible waste of VRAM. The solution I went with instead was to fold in the neighbors as the mips are reduced. This makes it so that the edge pixels in each LOD also remembers depth information for pixels which were truncated away due to NPOT rounding.

I rolled a custom HiZ shader similar to SPD with some extra subgroup shenanigans because why not (SubgroupShuffleXor with 4 and 8).

Second phase

In this pass we submit for rendering any object which became visible this frame, i.e. the visibility bit was not set, but it passed occlusion test now. Again, if camera did not change, and objects did not move, then nothing should be rendered here.

However, we still have to test every object, in order to update the visibility buffer for next frame. We don’t want visibility to remain sticky, unless we have dedicated proxy geometry to serve as occluders (might still be a thing if game needs to handle camera cuts without large jumps in rendering time).

In this pass we can cull meshlet bounds against the HiZ.

Because I cannot be arsed to make a fancy SVG for this, the math to compute a tight AABB bound for a sphere is straight forward once the geometry is understood.

The gist is to figure out the angle, then rotate the (X, W) vector with positive and negative angles. X / W becomes the projected lower or upper bound. Y bounds are computed separately.

vec2 project_sphere_flat(float view_xy, float view_z, float radius)
{
    float len = length(vec2(view_xy, view_z));
    float sin_xy = radius / len;

    float cos_xy = sqrt(1.0 - sin_xy * sin_xy);
    vec2 rot_lo = mat2(cos_xy, sin_xy, -sin_xy, cos_xy) *
      vec2(view_xy, view_z);
    vec2 rot_hi = mat2(cos_xy, -sin_xy, +sin_xy, cos_xy) *
      vec2(view_xy, view_z);

    return vec2(rot_lo.x / rot_lo.y, rot_hi.x / rot_hi.y);
}

The math is done in view space where the sphere is still a sphere, which is then scaled to window coordinates afterwards. To make the math easier to work with, I use a modified view space in this code where +Y is down and +Z is in view direction.

bool hiz_cull(vec2 view_range_x, vec2 view_range_y, float closest_z)
// view_range_x: .x -> lower bound, .y -> upper bound
// view_range_y: same
// closest_z: linear depth. ViewZ - Radius for a sphere

First, convert to integer coordinates.

// Viewport scale first applies any projection scale in X/Y
// (without Y flip).
// The scale also does viewport size / 2 and then
// offsets into integer window coordinates.

vec2 range_x = view_range_x *
  frustum.viewport_scale_bias.x +
  frustum.viewport_scale_bias.z;
vec2 range_y = view_range_y *
  frustum.viewport_scale_bias.y +
  frustum.viewport_scale_bias.w;

ivec2 ix = ivec2(range_x);
ivec2 iy = ivec2(range_y);

ix.x = clamp(ix.x, 0, frustum.hiz_resolution.x - 1);
ix.y = clamp(ix.y, ix.x, frustum.hiz_resolution.x - 1);
iy.x = clamp(iy.x, 0, frustum.hiz_resolution.y - 1);
iy.y = clamp(iy.y, iy.x, frustum.hiz_resolution.y - 1);

Figure out a LOD where we only have to sample a 2×2 footprint. findMSB to the rescue.

int max_delta = max(ix.y - ix.x, iy.y - iy.x);
int lod = min(findMSB(max_delta - 1) + 1, frustum.hiz_max_lod);
ivec2 lod_max_coord = max(frustum.hiz_resolution >> lod, ivec2(1)) - 1;

// Clamp to size of the actual LOD.
ix = min(ix >> lod, lod_max_coord.xx);
iy = min(iy >> lod, lod_max_coord.yy);

And finally, sample:

ivec2 hiz_coord = ivec2(ix.x, iy.x);

float d = texelFetch(uHiZDepth, hiz_coord, lod).x;
bool nx = ix.y != ix.x;
bool ny = iy.y != iy.x;

if (nx)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(1, 0)).x);

if (ny)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(0, 1)).x);

if (nx && ny)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(1, 1)).x);

return closest_z < d;

Trying to get up-close, it’s quite effective.

Without culling:

With two-phase:

As the culling becomes more extreme, GPU go brrrrr. Mostly just bound on HiZ pass and culling passes now which can probably be tuned a lot more.

Conclusion

I’ve spent way too much time on this now, and I just need to stop endlessly tuning various parameters. This is the true curse of mesh shaders, there’s always something to tweak. Given the performance I’m getting, I can call this a success, even if there might be some wins left on the table by tweaking some more. Now I just need to take a long break from mesh shaders before I actually rewrite the renderer to use this new code … And maybe one day I can even think about how to deal with LODs, then I would truly have Nanite at home!

The “compression” format ended up being something that can barely be called a compression format. To chase decode performance of tens of billions of primitives per second through, I suppose that’s just how it is.

Vulkan video shenanigans – FFmpeg + RADV integration experiments

Vulkan video is finally here and it’s a fierce battle to get things working fully. The leaders of the pack right now with the full release is RADV (Dave Airlie) and FFmpeg (Lynne).

In Granite, I’ve been wanting a solid GPU video decoding solution and I figured I’d work on a Vulkan video implementation over the holidays to try helping iron out any kinks with real-world application integration. The goal was achieving everything a 3D engine could potentially want out of video decode.

Hardware accelerated
GPU decode to RGB without round-trip through system memory (with optional mip generation when placed in a 3D world)
Audio decode
A/V synchronization

This blog is mostly here to demonstrate the progress in FFmpeg + RADV. I made a neat little sample app that fully uses Vulkan video to do a simple Sponza cinema. It supports A/V sync and seeking, which covers most of what a real media player would need. Ideally, this can be used as a test bench.

Place a video feed as a 3D object inside Sponza, why not?

Introduction blog post – read this first

This blog post by Lynne summarizes the state of Vulkan video at the time it was written. Note that none of this is merged upstream as of writing and APIs are changing rapidly.

Building FFmpeg + RADV + Granite

FFmpeg

Make sure to install the very latest Vulkan headers. On Arch Linux, install vulkan-headers-git from AUR for example.

Check out the branch in the blog and build. Make sure to install it in some throwaway prefix, e.g.

./configure --disable-doc --disable-shared --enable-static --disable-ffplay --disable-ffprobe --enable-vulkan --prefix=$HOME/ffmpeg-vulkan

Mesa

Check out https://gitlab.freedesktop.org/airlied/mesa/-/commits/radv-vulkan-video-decode. Then build with:

mkdir build
cd build
meson setup .. -Dvideo-codecs=h264dec,h265dec --buildtype release
ninja

Granite

git clone https://github.com/Themaister/Granite
cd Granite
git submodule update --init
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DGRANITE_FFMPEG=ON -DGRANITE_AUDIO=ON -DGRANITE_FFMPEG_VULKAN=ON -G Ninja -DCMAKE_PREFIX_PATH=$HOME/ffmpeg-vulkan
ninja video-player

Running test app

Basic operation, a weird video player where the image is a flat 3D object floating in space. For fun the video is also mip-mapped and the plane is anisotropically filtered, because why not.

RADV_PERFTEST=video_decode GRANITE_FFMPEG_VULKAN=1 ./tests/video-player /tmp/test.mkv

Controls

WASD: move camera
Arrow keys: rotate camera
Space: Toggle pause
HJKL: Vim style for seeking

If you have https://github.com/KhronosGroup/glTF-Sample-Models checked out you can add a glTF scene as well for fun. I hacked it together with Sponza in mind, so:

RADV_PERFTEST=video_decode GRANITE_FFMPEG_VULKAN=1 ./tests/video-player $HOME/git/glTF-Sample-Models/2.0/Sponza/glTF/Sponza.gltf /tmp/test.mkv

and then you get the screenshot above with whatever video you’re using 🙂

Integration API

The Granite implementation can be found in https://github.com/Themaister/Granite/blob/master/video/ffmpeg_decode.cpp. It will probably be different in the final upstreamed version, so beware. I’m not an FFmpeg developer either FWIW, so take this implementation with a few grains of salt.

To integrate with Vulkan video, there are some steps we need to take. This assumes some familiarity with FFmpeg APIs. This is mostly interesting for non-FFmpeg developers. I had to figure this out with help from Lynne, spelunking in mpv and looking over the hardware decode samples in FFmpeg upstream.

Creating shared device

Before opening the decode context with:

avcodec_open2(ctx, codec, nullptr)

we will provide libavcodec with a hardware device context.

avcodec_get_hw_config(codec, index)

to scan through until you find a Vulkan configuration.

AVBufferRef *hw_dev = av_hwdevice_ctx_alloc(config->device_type);
auto *hwctx = reinterpret_cast<AVHWDeviceContext *>(hw_dev->data);
auto *vk = static_cast<AVVulkanDeviceContext *>(hwctx->hwctx);

hwctx->user_opaque = this; // For callbacks later.

To interoperate with FFmpeg, we have to provide it our own Vulkan device and lots of information about how we created the device.

vk->get_proc_addr = Vulkan::Context::get_instance_proc_addr();
vk->inst = device->get_instance();
vk->act_dev = device->get_device();
vk->phys_dev = device->get_physical_device();
vk->device_features = *device->get_device_features().pdf2;
vk->enabled_inst_extensions =
  device->get_device_features().instance_extensions;
vk->nb_enabled_inst_extensions =
  int(device->get_device_features().num_instance_extensions);
vk->enabled_dev_extensions =
  device->get_device_features().device_extensions;
vk->nb_enabled_dev_extensions =
  int(device->get_device_features().num_device_extensions);

Fortunately, I had most of this query scaffolding in place for Fossilize integration already. Vulkan 1.3 core is required here as well, so I had to bump that too when Vulkan video is enabled.

auto &q = device->get_queue_info();

vk->queue_family_index =
  int(q.family_indices[Vulkan::QUEUE_INDEX_GRAPHICS]);
vk->queue_family_comp_index =
  int(q.family_indices[Vulkan::QUEUE_INDEX_COMPUTE]);
vk->queue_family_tx_index =
  int(q.family_indices[Vulkan::QUEUE_INDEX_TRANSFER]);
vk->queue_family_decode_index =
  int(q.family_indices[Vulkan::QUEUE_INDEX_VIDEO_DECODE]);

vk->nb_graphics_queues = int(q.counts[Vulkan::QUEUE_INDEX_GRAPHICS]);
vk->nb_comp_queues = int(q.counts[Vulkan::QUEUE_INDEX_COMPUTE]);
vk->nb_tx_queues = int(q.counts[Vulkan::QUEUE_INDEX_TRANSFER]);
vk->nb_decode_queues = int(q.counts[Vulkan::QUEUE_INDEX_VIDEO_DECODE]);

vk->queue_family_encode_index = -1;
vk->nb_encode_queues = 0;

We need to let FFmpeg know about how it can query queues. Close match with Granite, but I had to add some extra APIs to make this work.

We also need a way to lock Vulkan queues:

vk->lock_queue = [](AVHWDeviceContext *ctx, int, int) {
   auto *self = static_cast<Impl *>(ctx->user_opaque);
   self->device->external_queue_lock();
};

vk->unlock_queue = [](AVHWDeviceContext *ctx, int, int) {
   auto *self = static_cast<Impl *>(ctx->user_opaque);
   self->device->external_queue_unlock();
};

For integration purposes, not making vkQueueSubmit internally synchronized in Vulkan was a mistake I think, oh well.

Once we’ve created a hardware context, we can let the codec context borrow it:

hw.device = av_hwdevice_ctx_init(hw_dev); // Unref later.

ctx->hw_device_ctx = av_buffer_ref(hw.device);

We also have to override get_format() and return the hardware pixel format.

ctx->opaque = this;
ctx->get_format = [](
    AVCodecContext *ctx,
    const enum AVPixelFormat *pix_fmts) -> AVPixelFormat {
  auto *self = static_cast<Impl *>(ctx->opaque);
  while (*pix_fmts != AV_PIX_FMT_NONE)
  {
    if (*pix_fmts == self->hw.config->pix_fmt)
      return *pix_fmts;
    pix_fmts++;
  }

  return AV_PIX_FMT_NONE;
};

This will work, but we’re also supposed to create a frames context before returning from get_format(). This also lets us configure how Vulkan images are created.

int ret = avcodec_get_hw_frames_parameters(
      ctx, ctx->hw_device_ctx,
      AV_PIX_FMT_VULKAN, &ctx->hw_frames_ctx);
// Check error.

auto *frames =
  reinterpret_cast<AVHWFramesContext *>(ctx->hw_frames_ctx->data);
auto *vk = static_cast<AVVulkanFramesContext *>(frames->hwctx);

vk->img_flags |= VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT;

ret = av_hwframe_ctx_init(ctx->hw_frames_ctx);
// Check error.

The primary motivation for overriding image creation was that I wanted to do YCbCr to RGB conversion in a more unified way, i.e. using individual planes. That would be compatible with non-Vulkan video as well, but taking plane views of an image requires VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT.

Using per-plane views is important, as we’ll see later. YCbCr samplers fall flat when dealing with practical video use cases.

Processing AVFrames

In FFmpeg, decoding works by sending AVPackets to a codec and it spits out AVFrame objects. If these frames are emitted by a software codec, we just poke at AVFrame::data[] directly, but with hardware decoders, AVFrame::pix_fmt is an opaque type.

There are two ways we can deal with this. For non-Vulkan hardware decoders, just read-back and upload planes to a VkBuffer staging buffer later, ewwww.

AVFrame *sw_frame = av_frame_alloc();

if (av_hwframe_transfer_data(sw_frame, av_frame, 0) < 0)
{
   LOGE("Failed to transfer HW frame.\n");
   av_frame_free(&sw_frame);
   av_frame_free(&av_frame);
}
else
{
   sw_frame->pts = av_frame->pts;
   av_frame_free(&av_frame);
   av_frame = sw_frame;
}

Each hardware pixel format lets you reinterpret AVFrame::data[] in a “magical” way if you’re willing to poke into low-level data structures. For VAAPI, VDPAU and APIs like that there are ways to use buffer sharing somehow, but the details are extremely hairy and is best left to experts. For Vulkan, we don’t even need external memory!

First, we need to extract the decode format:

auto *frames =
  reinterpret_cast<AVHWFramesContext *>(ctx->hw_frames_ctx->data);
active_upload_pix_fmt = frames->sw_format;

Then we can query the VkFormat if we want to stay multi-plane.

auto *hwdev =
  reinterpret_cast<AVHWDeviceContext *>(hw.device->data);
const VkFormat *fmts = nullptr;
VkImageAspectFlags aspects;
VkImageUsageFlags usage;
int nb_images;

int ret = av_vkfmt_from_pixfmt2(hwdev, active_upload_pix_fmt,
                                VK_IMAGE_USAGE_SAMPLED_BIT, &fmts,
                                &nb_images, &aspects, &usage);

However, this has some pitfalls in practice. Video frames tend to be aligned to a macro-block size or similar, meaning that the VkImage dimension might not be equal to the actual size we’re supposed to display. Even 1080p falls in this category for example since 1080 does not cleanly divide into 16×16 macro blocks. The only way to resolve this without extra copies is to view planes separately with VK_IMAGE_ASPECT_PLANE_n_BIT and do texture coordinate clamping manually. This way we avoid sampling garbage when converting to RGB. av_vkfmt_from_pixfmt can help here to deduce the per-plane Vulkan formats, but I just did it manually either way.

// Real output size.
ubo.resolution = uvec2(video.av_ctx->width, video.av_ctx->height);

if (video.av_ctx->hw_frames_ctx && hw.config &&
    hw.config->device_type == AV_HWDEVICE_TYPE_VULKAN)
{
   // Frames (VkImages) may be padded.
   auto *frames = reinterpret_cast<AVHWFramesContext *>(
       video.av_ctx->hw_frames_ctx->data);
   ubo.inv_resolution = vec2(
       1.0f / float(frames->width),
       1.0f / float(frames->height));
}
else
{
   ubo.inv_resolution = vec2(1.0f / float(video.av_ctx->width),
                             1.0f / float(video.av_ctx->height));
}

// Have to emulate CLAMP_TO_EDGE to avoid filtering against garbage.
ubo.chroma_clamp =
  (vec2(ubo.resolution) - 0.5f * float(1u << plane_subsample_log2[1])) *
  ubo.inv_resolution;

Processing the frame itself starts with magic casts:

auto *frames =
  reinterpret_cast<AVHWFramesContext *>(ctx->hw_frames_ctx->data);
auto *vk = static_cast<AVVulkanFramesContext *>(frames->hwctx);
auto *vk_frame = reinterpret_cast<AVVkFrame *>(av_frame->data[0]);

We have to lock the frame while accessing it, FFmpeg is threaded.

vk->lock_frame(frames, vk_frame);
// Do stuff
vk->unlock_frame(frames, vk_frame);

Now, we have to wait on the timeline semaphore (note that Vulkan 1.3 is required, so this is guaranteed to be supported).

// Acquire the image from FFmpeg.
if (vk_frame->sem[0] != VK_NULL_HANDLE && vk_frame->sem_value[0])
{
   // vkQueueSubmit(wait = sem[0], value = sem_value[0])
}

Create a VkImageView from the provided image. Based on av_vkfmt_from_pixfmt2 or per-plane formats from earlier, we know the appropriate Vulkan format to use when creating a view.

Queue family ownership transfer is not needed. FFmpeg uses CONCURRENT for sake of our sanity.

Transition the layout:

cmd->image_barrier(
    *wrapped_image,
    vk_frame->layout[0],
    VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL,
    VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT /* sem wait stage */, 0,
    VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
    VK_ACCESS_2_SHADER_SAMPLED_READ_BIT);

vk_frame->layout[0] = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;

Now, we can convert this to RGB as we desire. I went with an async compute formulation. If this were a pure video player we could probably blit this directly to screen with some fancy scaling filters.

When we’re done, we have to “release” the image back to FFmpeg.

// Release the image back to FFmpeg.
if (vk_frame->sem[0] != VK_NULL_HANDLE)
{
   vk_frame->sem_value[0] += 1;
   // vkQueueSubmit(signal = sem[0], value = sem_value[0]);
}

And that’s it!

Test results

I tried various codec configurations to see state of things.

RADV

H.264 – 8bit: Works
H.264 – 10bit: Not supported by hardware
H.265 – 8bit: Works
H.265 – 10bit: Works

nvidia

H.264: Broken
H.265: Seems to work

ANV

There’s a preliminary branch by Airlie again, but it doesn’t seem to have been updated for final spec yet.

Conclusion

Exciting times for Vulkan video. The API is ridiculously low level and way too complicated for mere graphics programming mortals, which is why having first class support in FFmpeg and friends will be so important to make the API usable.

Compressed GPU texture formats – a review and compute shader decoders – part 1

Compressed texture formats is one of the esoteric aspects of graphics programming almost no one cares all that much about. Neither did I, however, I’ve recently taken an academic interest in the zoo of compressed texture formats.

During development in Granite, I occasionally find it useful to test scenes which target mobile on desktop and vice versa, and in Vulkan, where there are no fallback paths for unsupported compression formats, we gotta roll our own decompression.

While it really isn’t all that useful to write a decoder for these formats, my goal is to create a suite of reasonably understandable compute shader kernels which can decode all of the standard formats I care about. Of course, I could just use a Frankenstein decoder which merges together a lot of C reference decoders and call it a day, but that’s not aesthetically pleasing or interesting to me. By implementing these formats straight from the Khronos Data Format specification, I learned a lot of things I would not otherwise know about these formats.

There are several major families of formats we can consider multi-vendor and standardized. Each of them fill their own niche. Unfortunately, desktop and mobile each have their own timelines with different texture compression standards, which is not fully resolved to this day in GPU hardware. (Basis Universal is something I will need to study eventually as well as it aims to solve this problem in software.)

By implementing all these formats, I got to see the evolution of block compression formats, see the major differences and design decisions that went into each format.

The major format families

First, it is useful to summarize all the families of texture compression I’ve looked at.

S3TC / DXT

The simplest family of formats. These formats are also known as the “BC” formats in Vulkan, or rather, BC 1, 2 and 3. This is the granddad of texture compression, similar to how I view MPEG1 in the video compression world.

These formats are firmly rooted in desktop GPUs. They are basically non-existent on mobile GPUs, probably for historical patent reasons.

RGTC

A very close relative of S3TC. These formats are very simple formats which specialize in encoding 1 and 2 uncorrelated channels, perfect for normal maps, metallic-roughness maps, etc. It is somewhat questionable to call these a separate family of formats (the Data Format specification separates them), since the basic format is basically exactly equal to the alpha format of S3TC, except that it extends the format to also support SNORM (-1, 1 range) alongside UNORM. These formats represent BC4 and BC5 in Vulkan.

These formats are firmly rooted in desktop GPUs. They are basically non-existent on mobile GPUs.

ETC

The ETC family of formats is very similarly laid out to S3TC in how different texture types are supported, but the implementation detail is quite different (and ETC2 is quite the interesting format). To support encoding full depth alpha and 1/2-component textures, there is the EAC format, which mirrors the RGTC formats.

These formats are firmly rooted in mobile GPUs. ETC1 was originally the only mandated format for OpenGLES 2.0 implementations, and ETC2 was mandated for OpenGLES 3.0 GPUs. It has almost no support on desktop GPUs. Intel iGPU is an exception here.

BPTC

This is where complexity starts to explode and where things get interesting. BC6 and BC7 are designed to compress high quality color images at 8bpp. BC6 adds support for HDR, which is to this day, one of only two ways to compress HDR images.

On desktop, BPTC is the state of the art in texture compression and was introduced around 2010.

ASTC

ASTC is the final boss of texture compression, and is the current state of the art in texture compression. Its complexity is staggering and aims to dominate the world with 128 bits. Mere mortals are not supposed to understand this format.

ASTC’s roots are on Mali GPUs, but it was always a Khronos standard, and is widely supported now on mobile Vulkan implementation (and Intel iGPU :3), at least the LDR profile. What you say, profiles in a texture compression format? Yes … yes, this is just the beginning of the madness that is ASTC.

PVRTC?

PVRTC is a PowerVR-exclusive format that has had some staying power due to iOS and I will likely ignore it in this series. However, it seems like a very different kind of format to all the others and studying it might be interesting. However, there is zero reason to use this format in Granite, and I don’t want to chew over too much.

What is a texture compression format anyway?

In a texture compression format, the specification describes a process for taking random bits given to it, and how to decode the bit-soup into texels. There are fundamental constraints in texture compression which is unique to this problem domain, and these restrictions heavily influence the design of the formats in question.

Fixed block size

To be able to randomly access any texel in a texture, there must be an O(1) mapping from texture coordinate to memory address. The only reasonable way to do this is to have a fixed block size. In all formats, 4×4 is the most common one. (As you can guess, ASTC can do odd-ball block sizes like 6×5).

Similarly, for reasons of random access, the number of bits spent per block must be constant. The typical block sizes are 64-bits and 128-bits, which is 4bpp and 8bpp respectively at 4×4 block size.

Image and video compression has none of these restrictions. That is a major reason why image and video compression is so much more efficient.

A set of coding tools

Each format has certain things it can do. The more complex the operations the format can do, the more expensive the decoding hardware becomes (and complex a software decoder becomes), so there’s always a challenge to balance complexity with quality per bit when standardizing a format. The most typical way to add coding tools is to be able to select between different modes of operation based on the content of the block, where each mode is suited to certain patterns of input. Use the right tool for the job! As we will see in this study, the number of coding tools will increase exponentially, and it starts to become impossible to make good use of all the tools given to you by the format.

Encoding becomes an optimization task where we aim to figure out the best coding tools to use among the ones given to us. In simpler formats, there are very few things to try, and approaching the optimal solution becomes straight forward, but as we get into the more esoteric formats, the real challenge is to prune dead ends early, since brute forcing our way through a near-infinite configuration space is not practical (but maybe it is with GPU encode? :3)

Commonalities across formats

Image compression and video compression uses the Discrete Cosine Transform (DCT) even to this day. This fundamental compression technique has been with us since the 80s and refuses to die. All the new compression formats just keep piling on complexity on top of more complexity, but in the center of it all, we find the DCT, or some equivalent of it.

Very similarly, texture compression achieves its compression through interpolation between two color values. Somehow, the formats all let us specify two endpoints which are constant over the 4×4 block and interpolation weights, which vary per pixel. Most of the innovation in the formats all comes down to how complicated and esoteric we can make the process of generating the endpoints and weights.

The weight values are typically expressed with very few bits of precision per texel (usually 2 or 3), and this is the main way we will keep bits spent per pixel down. This snippet is the core coding tool in all the formats I have studied:

decoded_texel = mix(endpoint0, endpoint1, weight_between_0_and_1);

To correlate, or not to correlate?

The endpoint model blends all components in lock-step. Typically the endpoint will be an RGB value. We call this correlated, because this interpolation will only work well if chrominance remains fairly constant with luminance being the only component which varies significantly. In uncorrelated input, say, RGB with an alpha mask, many formats let us express decorrelated inputs with two sets of weights.

decoded_rgb = mix(endpoint0_rgb, endpoint1_rgb, rgb_weight);
decoded_alpha = mix(endpoint0_alpha, endpoint1_alpha, alpha_weight);

This costs a lot more bits to encode since alpha_weight is very different from rgb_weight, but it should be worth it.

Many formats let us express if there is correlation or not. Correlation should always be exploited.

Working around the horrible endpoint interpolation artifacts

Almost all formats beyond the most trivial ones try really hard to come up with ways to work around the fact that endpoint interpolation leads to horrible results in all but the simplest input. The most common approach here is to split the block into partitions, where each partition has its own endpoints.

S3TC – The basics

A compute shader decoder:

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/s3tc.comp

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.h

BC1 – 4×4 – 64 bits

The BC1 format is extremely simple and a good starting point. 32 bits is used to encode two RGB endpoints in RGB565 format. The other 32 bits encode 16 weights, with 2 bits allocated to each texel.

This lets us represent interpolation weights of 0, 1/3, 2/3 and 1.

Since there is a symmetry in this design, i.e.:

mix(a, b, l) == mix(b, a, 1.0 - l)

there would be two ways to specify the same block, where we swap endpoints and invert the weights to compensate. This is an extra bit of information we can exploit. Based on the integer representation of the two endpoints, we can check if one of greater than the other, and use a different decoding mode based on that information. This exploitation of symmetry will pop up again in many formats later! In the secondary mode, we add support for 1-bit alpha, called a punch-through in most formats. In this mode, the interpolation weights become 0, 1/2, 1 and BLACK. This lets us represent fully transparent pixels. However, color becomes BLACK, so this will only work with pre-multiplied alpha schemes, otherwise there will be black rings around textures. I don’t think this mode is used all that much these days, but it is an option.

That is it for this format, it really is that simple.

One thing to note is that the specification is defined in terms of floating point with under-specified requirements for precision, and thus there is no bit-exact representation of the decoded values. Almost all hardware decoders of this format will give slightly different results, which is unfortunate. MPEG1 and MPEG2 also made the same mistake back in the day, where the DCT is specified in terms of floating point.

BC2 – 4×4 – 128 bits

BC2 is a format which adds alpha support by splicing together two blocks. A BC1 block describes color, and a second block adds an alpha plane with 4-bit UNORM. This format is quite obscure since the next format, BC3, generally does a much better job at compressing alpha. A curious side effect of BC2 and 3 is that the punch-through mode in the BC1 block no longer exists, i.e. the symmetry exists, so we lose 1 bit of information. I wonder why that information bit was not used to toggle between BC2 alpha decode (noisy alpha) and BC3 alpha decode (smooth alpha) …

BC3 – 4×4 – 128 bits

The alpha encoding of BC3 is very similar to how BC1 works. It also forms the basis of RGTC. The 64 bits of the alpha block spends 16 bits to encode 2 8-bit endpoints for alpha. We now get 3 bits as interpolation weights instead of 2 since there’s 48 bits left for this purpose. Similar to BC1, we can exploit symmetry and get two different modes based on the endpoints. In the first mode, the interpolation weights are as expected, [0, 1, 2, …, 7] / 7. In the second mode, we have 6 interpolated values [0, 1, 2, …, 5] / 5, and two values are reserved to represent 0.0 and 1.0. This can be very useful. This is essentially a very early attempt to introduce partitions into the mix, as we can essentially split up a block into 3 partitions on demand: (Fully opaque texels, fully transparent texels, the in-betweens). This can let us specify a tighter range for the endpoints as there is never a need to use a full [0, 0xff] endpoint range.

Summary

The S3TC formats are very simple, but there are certainly things to note about them. Alpha support is just bolted on, it is not integrated into the format. This means that even though the block is 128 bits, there is no way to spend more than 64 bits on color, even if the alpha plane has a completely flat value.

RGTC

RGTC (red-green) is basically BC3’s alpha block format turned into its own thing. Their main use is with non-color textures, e.g. normal maps, metallic-roughness maps, luminance maps, etc. It is very simple, and quality is quite good.

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.comp

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/rgtc.h

BC4 – 4×4 – 64 bits

This is BC3’s alpha block format as its own format, which returns one component. The only real difference from BC3 alpha is that it also supports an SNORM variant, which is very useful for normal maps, although I only bothered with UNORM, since my shaders need to assume input can be from any format.

BC5 – 4×4 – 128 bits

RGTC assumes uncorrelated channels, and thus the only sensible choice was to just slap together two BC4 blocks side-by-side, and voila, we can encode 2 channels instead of 1.

Summary

RGTC is simple and nice. It only needs to consider single channels of data, and writing encoders for it is very easy, and it is probably the simplest format out there. For what they do, I really like these formats.

Like S3TC, there is no bit-exact decoding, which is rather unfortunate.

ETC – Refining S3TC

ETC, or Ericsson Texture Compression is a family with multiple generations. ETC2 is backwards compatible with ETC1 in that all valid ETC1 blocks will decode the same way in ETC2, but ETC2 exploits some undefined behavior of ETC1 to extend the format into something more interesting.

ETC has a bit-exact decode, which makes verification very easy. 😀

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/etc2.comp

ETC1 – 4×4 – 64 bits

In many ways, ETC1 is quite similar to BC1, but there are some key differences. Just like BC1, 32 bits are spent to encode endpoints, and 32 bits are spent to give 16 texels 2 bits of weight information each. The main difference between ETC1 and BC1 is how endpoints work.

Sub-blocks

As a very crude form of partitioning, ETC1 allows you to split a block into either 2×4 or 4×2 sub-blocks, where each sub-block has its own endpoints. To do this, endpoints are expressed in a more compact way than BC1. Rather than specifying two RGB values, ETC in general likes to express endpoints as RGB +/- delta-intensity, where delta-intensity is described by a table. This makes things far more compact since we enforce constant chrominance. By saving so many bits, we can express 4 endpoints in total, 2 for each sub-block.

Uncorrelated or correlated endpoints?

Since we have to specify two sets of endpoints, the format gives us a way to specify if the two endpoints have completely different colors, or if the endpoints should be specified in base + offset form. This is controlled with a single bit, which changes the encoding from ep0 = RGB444, ep1 = RGB444 to ep0 = RGB555, ep1 = ep0 + sign_extend(RGB333). These values are not allowed to overflow in any way, which is something ETC2 exploits to great effect later.

NOTE: I found it more instructional to call it uncorrelated and correlated endpoint modes, but the specification calls it “individual” and “differential” modes.

Summary

ETC1 is somewhat different than BC1, but overall, it’s quite similar. It turns out, if you add a few restrictions on top of ETC1, you get ETC1S, which can trivially be transcoded to BC1. Basically, enforce the correlated endpoint mode, set the RGB333 bits to 0, enforce that delta-luma is the same for both sub-blocks.

There is also no way to express alpha with ETC1, which is unfortunate, but ETC1 is completely obsolete for my use cases either way.

ETC2 RGB – 4×4 – 64 bits

As mentioned earlier, ETC1 is a sub-set of ETC2. ETC2 adds a bunch of new and curious modes which gives us a small glimpse into more flexible ways to express endpoints, and even adds a mode which I have never seen in any other formats ever since.

Exploiting undefined behavior

When you select the correlated (differential) endpoint mode, there were some restrictions on overflow. We can exploit this fact in order to gain 3 new modes of operation for the ETC2 color codec!

First, we check if R + sign_extend(dR) is outside [0, 31] range. If so, we activate the so-called “T” mode. In this mode, we essentially add a partitioning scheme to the codec. We now remove the concept of two sub-blocks and let all texels access all available endpoints. We encode two RGB444 values (A, B), and a delta value (d). We form a T-shape by specifying 4 possible color values as A, B, B + d, B – d. This can be useful if the block is smooth, except for some weird outliers. A would be the outlier color, and B represents the middle of the smooth colors.

If G + sign_extend(dG) overflows, we enter a very similar “H” mode. In this mode, we do the exact same thing, except that the 4 possible colors become A + d, A – d, B + d, B – d.

If B + sign_extend(dB) overflows, we enter a very interesting mode, which I have never seen again in future formats. I’m not sure why, since it seems very useful for expressing smooth gradients. Essentially, in this mode we don’t encode weights per texel, but rather express RGB at texel (0, 0), at texel (4, 0) and texel(0, 4), and just bilinearly interpolate across the block to obtain the actual color. This is very different from the other endpoint interpolation we’ve seen earlier, because that flattens everything into a single line in the color space, but now we can access an entire 2D plane in the space instead.

Punch-through alpha

Like BC1’s 1-bit alpha scheme, ETC2 is very similar. When this format is enabled, we remove the capability for uncorrelated endpoints (individual mode), and replace the bit with a selection to select if all texels are Opaque, or potentially transparent. This idea is the exact same as BC1. In the transparent mode, code == 2 marks the texel as being transparent black. It does not work in planar mode though, this bit is ignored there.

Alpha support

Very similar to BC3, ETC2 also supports full 8-bit alpha by slapping together a separate block alongside the color block. The way this works is very similar to how RGTC works, but instead of two endpoints, ETC2 encodes a center point, and then uses tables to expand that range into 8 possible values using a table selector and a multiplier. These 8 possible values for the block are then selected with 3bpp indices. We lose the capability to cleanly represent 0.0 and 1.0 though, which is somewhat curious.

EAC – 4×4 – 64 bits

https://github.com/Themaister/Granite/blob/master/assets/shaders/decode/eac.comp

EAC is ETC2’s version of RGTC, it is designed as a way to encode 1 and 2 de-correlated channels, basically the exact same approach as RGTC where the alpha block format is reused for the 1/2-channel formats. EAC is a bit different in that the internal precision is 11 bits (for some reason).

Unfortunately, EAC is kinda awkward since it’s technically bit-exact in the final fixed point value between [0, 2047], but it specifies many different ways how this can be converted to a floating point value.

Next up, BPTC

S3TC, RGTC and ETC represents the simpler formats, and hopefully I’ve summarized what these formats can do, next up, I’ll go through the BPTC formats, which significantly increases complexity.

Clustered shading evolution in Granite

Shading many lights in a 3D engine is kinda hard once you step outside the bubble of classic deferred shading. Granite supports a fair amount of different ways to do lighting, mostly because I like to experiment with different rendering structures.

I presented some work on this topic at SIGGRAPH 2018 in the Moving Mobile Graphics course when I still worked at Arm: https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-20-66/5127.siggraph_2D00_2018_2D00_mmg_2D00_3_2D00_rendering_2D00_hanskristian.pdf. Unfortunately, some slides have videos embedded, and they didn’t seem to have made the transition unscathed to PDF.

Since then I’ve looked into VK_EXT_descriptor_indexing in order to remove some critical limitations with my older implementation and I ended up with some uncommon implementation details.

Classic deferred

Just to summarize what I mean by this, this is the good old method of rendering light volumes on top of a G-buffer and light is accumulated per-light by blending on top of the frame buffer. This method is considered completely obsolete on desktop these days, but it’s still quite viable on mobile with on-chip G-buffers.

Where classic deferred breaks down

Very high bandwidth on desktop due to fill-rate/bandwidth
No forward shading support (well, duh)
No transparent objects
No volumetric lighting

Thus, I explored some alternatives …

Why I don’t like tile deferred shading

This is a somewhat overloaded term, but this is the method where you send the G-buffer to a compute shader. The G-buffer is split into tiles, assigned to a workgroup, depth range (or multiple ranges) is found for the tile, which then forms a frustum. Lights are then culled against this frustum, and shaded in one go. This technique was a really great fit for the PS3 SPUs, as shown by Battlefield 3 back in the day.

It still doesn’t really solve the underlying issues of classic deferred except that it is far more bandwidth efficient on desktop-class GPUs.

Forward shading isn’t feasible unless you split the algorithm into multiple stages with Z-prepass -> build light list per tile -> resubmit geometry and shade, but then it’s probably called something entirely different … (Is this what’s known as Forward+ perhaps?)

Transparency isn’t doable unless you render all transparent geometry in a separate pass to find min/max depth per pixel or something, and forget about volumetrics.

On mobile TBDR architectures you lose all bandwidth savings from staying on-tile, and doing FRAGMENT -> COMPUTE barriers on a tile-based architecture is usually a terrible idea for pipelining. The exception here is tile shaders on recent iPhone hardware which seems almost designed to do this algorithm in hardware.

Clustered shading – the old implementation

Clustered shading is really nice in that it is completely agnostic to a depth buffer, so all the problems mentioned earlier just go away. Lights are assigned in 3D-space rather than screen-space. The original paper on the subject is from around the same time tile deferred was getting popular.

Abandoning the frustum structure

In Granite, I chose early on to abandon the “frustum” layout. Culling spot lights and point lights against elongated frustums analytically is very hard and computationally expensive, and getting the Z slices correct is also very fiddly.

A common workaround for culling is using conservative rasterization, but that is a feature I cannot rely on. I figured that using a more grid-like structure I could get away with much simpler culling math. Since all elements in the grid-based cluster are near-perfect cubes, I could get away with treating the elements as spheres, and https://bartwronski.com/2017/04/13/cull-that-cone/ makes spot light-to-sphere culling very cheap and quite tight. Here’s a visualization of the structure. This structure is stored in a 3D texture. Each “mip-level” is packed in the Z dimension as the resolution of each level is the same.

Bitscan loops instead of light lists

Light lists is the approach where each element in the cluster contains a (start, count) and all lights found in that range are shaded. Computing this list on GPU is rather messy. The memory footprint for a single element is unknown in CPU timeline, and we cannot deal properly with worst-case scenarios. This is easier when we cluster on the CPU instead, but that’s boring!

I really wanted to cluster lights on the GPU, so I landed on a bitmask approach instead. The worst case storage is just 1 bit per light per element rather than 32 bit.

The main limitation of this technique is still the number of lights we could feasibly support. With a bitmask structure we need to allocate for worst-case and it can get out of hand when we consider worst case with 1000+ lights. I only had modest ideas in the beginning, so I supported 32 spot lights and 32 point lights, which were encoded in RG32UI per element in a 3D texture. At a resolution of the cluster at 64x32x16x9, culling on the GPU is very fast, even on mobile. We can set the ceiling higher of course if we expand to RGBA32UI or use more texels per element.

Bitscan loops are great for scalarization

A thing I realized quickly when doing clustered forward shading is the importance of keeping VGPRs down on AMD hardware. The trick to move VGPRs to more plentiful SGPRs is to ensure that values are uniform across a subgroup. E.g. instead of doing this:

// VGPR
int light_bitmask = fetch_bitmask_for_world_coord(coord);

vec3 color = vec3(0.0);
while (light_bitmask != 0)
{
    int lsb = findLSB(light_bitmask);

    // All light data must be loaded in VGRPs since lsb is a VGPR.
    color += shade_light(lsb);
    light_bitmask &= ~(1 << lsb);
}

we can do a simple trick with subgroup operations:

// VGPR
int light_bitmask = fetch_bitmask_for_world_coord(coord);

// OR over all active threads.
// As this is the same value for all threads, compiler promoted to SGPR.
light_bitmask = subgroupOr(light_bitmask);

vec3 color = vec3(0.0);
while (light_bitmask != 0)
{
    int lsb = findLSB(light_bitmask);

    // All light data can be loaded into SGPRs instead.
    // Far better occupancy, much amaze, wow!
    color += shade_light(lsb);
    light_bitmask &= ~(1 << lsb);
}

Uniformly loading light data from buffers is excellent. I’ve observed up to 15% uplift on AMD by doing this. The light list approach mentioned earlier has a much harder time employing this kind of optimization. We would have to scalarize on the cluster element itself, which could lead to very bad worst-case performance.

No bindless – ugly atlasing

Another problem with clustered shading (and tile deferred for that matter) is that we need to shade a lot of lights in one go, and those lights can have shadow maps. Without bindless, all shadow maps for spot lights must fit into one texture, and point lights must fit into one texture. Atlassing is the classic solution here, but it is a little too messy for my taste. As the number of lights was rather low, I just had a plain 2D texture for spot lights, and a cube array for point lights. Implementing variable resolution with an atlas is also rather annoying, and for point lights, I would be forced to flatten the cube down to 6 2D rects and do manual cube lookup instead without proper seam filtering, ugh.

Scaling to “arbitrary” number of lights

While performance for reasonable number of lights was quite excellent compared to alternative techniques, I couldn’t really scale number of lights arbitrarily, and it has been nagging me a bit. Memory becomes a concern, and while the “list of lights” approach is likely less memory hungry in the average case, it has even worse worst-case memory requirements, and it’s not very friendly for GPU culling.

Kinda clustered shading? New bindless hotness

The talk on Call of Duty’s renderer in Advances in Real-Time Rendering 2017 presents a very fresh idea on how to do shading, and it hits all the right buttons. Culling on GPU, bitscan loops, scalarization, scales to a lot of lights, a lot of things to like here.

I spent some time this holiday season implementing a new path for clustered shading based on this technique, and I ended up deviating in a few places. I’ll go through my implementation of it, but it will make more sense once you study the presentation first.

Decoupling XY culling from Z

The key feature of the Call of Duty implementation is how it partitions space. Rather than a full 3D cluster, we decouple XY and Z dimensions, so rather than O(X * Y * Z) we get O(X * Y + Z).

Z is also binned linearly, and we can have several thousand bins in Z. This makes everything much nicer to deal with later. Culling here is trivial, since we compute min/max Z in view space for each light, which is very simple.

For each Z-slice, we just need to figure out the minimum light index and maximum light index which hits our Z-slice. Of course, to make the ranges as tight as possible, we sort lights by Z distance.

Data structures

Each tile in XY needs a bitmask array, u32[ceil(num_lights / 32)] for each tile. This can be tightly packed in a single buffer.
A buffer containing Z-slices as described above.
Per-light information: position/radius/cone/light type/shadow matrix/etc

Going back to frustum culling

Now that we cluster in XY and Z separately, I went back to frustum partitioning, and now we need a way to do frustum culling against spot and point lights … Conservative rasterization really is the perfect extension to use here, just a shame it’s not widely available yet on all relevant hardware.

The presentation has an alternative for conservative rasterization as current consoles do not support this feature, which is mostly to render light volumes a-la classic deferred (not completely dead yet!) at full-resolution and splatting out bits with atomics as fragment threads are spawned. If you have depth information you can eliminate coverage using classic deferred techniques. However, I went in a completely different direction without using depth at all.

Compute shader conservative rasterization

It felt natural to do all the culling work in compute shaders. This is where most of the “fun” is. This is also a rather esoteric way of doing it, but I like doing esoteric stuff with compute shaders, see https://themaister.net/blog/2019/10/12/emulating-a-fake-retro-gpu-in-vulkan-compute/ as proof of that.

Conservative sphere rasterization

To solve this problem, we’re going to tackle it geometrically instead. First, we solve the screen-space bounding box problem. Fortunately, this problem is separable, so we can compute screen-space bounds in X and Y separately.

(Behold glorious Inkscape skillz ._.)

What we want to do is to figure out where P_lo and P_hi intersect the near plane. P can be rotated in 2D by the half-angle in two directions. This way we find tangent points on the circle. sin(theta) is conveniently equal to r / length(L), so building a 2×2 rotation matrix is very easy. After rotating P, we can project P_lo and P_hi, and now we have clip space bounds in one dimension. Compute separately for XZ and YZ dimensions and we have screen-space boundaries.

Projecting a sphere to clip space creates an ellipsis, and to compute the ellipsis, we need to rotate view space such that the sphere center lies perfectly on the X or Y axis. For simplicity, we orient it on the +X axis. We can then perform the range test, and an ellipsis is formed. We can now test any point directly against this ellipsis. If the sphere intersects the near plane in any way, we can fall back to screen-space bounding box.

Here’s a ShaderToy demonstrating the math. Of course, Inigo Quilez did it first :p

In the real implementation, we only need to compute setup data once for each point light, to rasterize a pixel, we apply a transform, and perform a conservative ellipsis test, which is rather straight forward.

Conservative spot light rasterization

I tried an analytical approach, but I gave up. Spot-Frustum culling is really hard if you want tight culling, so I went with the simpler approach of just straight up rasterizing 6 triangles which forms a spot light. We can rasterize in floating point since we don’t care about water-tight rasterization rules, and it’s conservative anyways. It’s not the prettiest thing in the world to do primitive clipping inside a shader, but you gotta do what you gotta do …

The … peculiar shader can be found here.

Binning shader

Once we have setup data for point lights and spot lights, we do the classic culling optimization where we bin N lights in parallel over a tile, broadcast the results to all threads, which then computes the relevant lights per pixel. Subgroup ballots is a nice trick here, which replaces the old shared memory approach. Each workgroup preferably works on 32 lights at a time to compute a 32-bit bitmask.

Shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer_bindless_binning.comp

The shading loop

The final loop to shade becomes something like:

uint cluster_mask_range(uint mask, uvec2 range, uint start_index)
{
	range.x = clamp(range.x, start_index, start_index + 32u);
	range.y = clamp(range.y + 1u, range.x, start_index + 32u);

	uint num_bits = range.y - range.x;
	uint range_mask = num_bits == 32 ?
		0xffffffffu :
		((1u << num_bits) - 1u) << (range.x - start_index);
	return mask & uint(range_mask);
}

vec3 shade_clustered(Material material, vec3 world_pos)
{
    ivec2 cluster_coord = compute_clustered_coord(gl_FragCoord.xy);
    int linear_cluster_coord = linearize_coord(cluster_coord);
    int z = compute_z_slice(dot(world_pos - camera_pos, camera_front));

    uvec2 z_range = cluster_range[z];

    // Find min/max light we need to consider when shading slice Z.
    // Make this uniform across subgroup. SGPR.
    int z_start = int(subgroupMin(z_range.x) >> 5u);
    int z_end = int(subgroupMax(z_range.y) >> 5u);

    for (int i = z_start; i <= z_end; i++)
    {
        // SGPR
        uint mask =
            cluster_bitmask[linear_cluster_coord *
                            num_lights_div_32 + i];

        // Restrict to lights within our Z-range. VGPR now.
        mask = cluster_mask_range(mask, z_range, 32u * i);
        // SGPR again.
        mask = subgroupOr(mask);

        // SGPR
        uint type_mask = cluster_transforms.type_mask[i];

        // Good old scalarized loop <3
        while (mask != 0u)
        {
            int bit_index = findLSB(mask);
            int light_index = 32 * i + bit_index;
            if ((type_mask & (1 << bit_index)) != 0)
            {
                result += compute_point_light(light_index,
                                              material,
                                              world_pos);
            }
            else
            {
                result += compute_spot_light(light_index,
                                             material,
                                             world_pos);
            }
        }
    }
}

Shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer_bindless.h

Potential performance problems

By decoupling XY and Z in culling there’s a lot of potential for false positives where large lights might dominate how Z-ranges are computed and trigger a lot of over-shading. I haven’t done much testing here though, but this is probably the only real weakness I can think of with this technique. Regular tile-deferred has similar issues.

Culling tightness

I’m using smaller lights here to demonstrate. Red or green light signifies that a light was computed for that pixel:

Placing point light in volumetric fog for good measure. The weird red/green “artifacting” around the edges is caused by the forward shading and subgroupOr logic when shading to ensure subgroup uniform behavior.

Potential improvements

Clipping Z-range calculation against view frustum might help a fair bit, since Z-range can be way too conservative for large positional lights like these. Especially spot lights which point to the side like the image above. Classic deferred has a very similar problem case unless ancient stencil culling techniques are used to get double sided depth tests.

Conclusion

I’m happy with the implementation. Performance seems very good, but I haven’t dug deep in analysis there yet. I was mostly concerned getting it to work. Just waiting for mobile GPU vendors to support bindless, so I can test there as well I guess.

The weird world of shader divergence and LOD

Mip-mapping is hard – importance of keeping your quads full

Sampling textures with mip-mapping is ancient, but it’s still hard apparently. Implicit LOD calculation is the first instance where we poke a hole into the “single threaded” abstraction of high level shading languages like GLSL and HLSL and dive into the maddening world of warps, waves, quads and everything in-between. For fragment shading, at least a group of 2×2 threads (a quad) need to run side by side so we can have gradient information over the screen. On any modern GPU, these 2×2 threads are actually running in lock-step as we’ll see when looking at GPU ISA later …

The Vulkan/GL/GLES ecosystem has always specified that implicit LOD instructions must happen in dynamically uniform control flow. Dynamically uniform just means that either all threads have to execute a texture() instruction, or no threads do. This ensures that there is always 4 valid texture coordinates from which to compute derivatives. The easiest way to ensure that this guarantee holds is simply to never sample in control flow, but that’s not really practical in more interesting shaders.

If you’re sampling in control flow you better make sure you uphold the guarantees of the spec.

Having to be dynamically uniform over an entire draw call is a bit silly, so the Vulkan specification recently tightened the scope such that if you have subgroupSize >= 4, you only need to be dynamically uniform on a per-quad granularity. This makes sense. We only need correct derivatives in the quad, we shouldn’t have to care if some unrelated quad or even triangle is diverging.

An interesting case came up recently where apparently developers expect that you actually can sample with implicit LOD in diverging control flow. Apparently HLSL “defines” this control flow to be valid code.

vec2 uv = from_somewhere();
if (weight > 0.0)
    sum += weight * texture(Texture, uv);

The idea is that we shouldn’t have to sample the texture unless we’re going to use it, but it’s still nice to provide UV for LOD purposes. Unfortunately, there is no obvious way to express this optimization in high level languages. UV is well defined in the outer scope which is dynamically uniform, so that’s something … Intuitively, this code makes sense, but it gets really murky once we dig deeper.

With subgroup ops, we can probably get a good approximation on the HLL side.

bool quadAny(bool value)
{
    // Perhaps this can be translated into s_wqm on AMD
    // if compiler checks this pattern?
    return subgroupClusteredOr(int(value), 4) != 0;
}

vec2 uv = from_somewhere();
// Hoist texture sampling out of branch and force quad uniformity.
vec4 tex;
if (quadAny(weight > 0.0))
    tex = texture(Texture, uv);
if (weight > 0.0)
    sum += weight * tex;

Querying gradients and then sampling with that in the branch is fine as well, but it is slow, and not really a fix, at best a workaround.

HLSL seems a bit murky about if this kind of code is legal, it’s all “that one app did this thing that one time and now we’re screwed”. From my understanding compilers can do some heroics here to work around this in applications.

I wanted to try this kind of code on all Vulkan devices I have available to see what happens. We’re in undefined territory as far as LOD goes, so anything can happen. There’s three outcomes I’m looking for which seem like plausible HW behavior:

It just happens to work. This is kinda scary, since it’ll probably break in 5 years anyways.
The LOD computed is garbage.
The LOD is forced to some value on divergence.

Here’s the concrete shader I’m using, from https://github.com/Themaister/Granite/blob/master/tests/assets/shaders/divergent_lod.frag. A test to run the shader is https://github.com/Themaister/Granite/blob/master/tests/divergent_lod_test.cpp

#version 450

layout(location = 0) in vec2 vUV;
layout(location = 0) out vec4 FragColor;
layout(set = 0, binding = 1) uniform sampler2D uSampler;
layout(set = 0, binding = 0, std140) uniform Weights
{
    vec4 weights[4];
};

void main()
{
    vec3 tex = vec3(0.0);
    float lod = -10.0;
    vec2 uv = vUV;
    if (weights[int(gl_FragCoord.x) + 2 * int(gl_FragCoord.y)].x > 0.0)
    {
        tex = texture(uSampler, uv).rgb;
        lod = textureQueryLod(uSampler, uv).y;
    }

    FragColor = vec4(tex, lod);
}

I render this on a 2×2 frame buffer with a full-screen “expanded triangle” to not get any helper lane shenanigans. Let’s try to run this across a wide range of hardware and see what happens. NOTE: any result here is equally valid in Vulkan, this is intentionally going out of spec.

AMD

I tested this on a Navi card. RDNA ISA seems similar enough to GCN … We effectively have 4 driver stacks for AMD cards now to test.

RADV (LLVM 10)

Garbage LOD

main:
BB16_0:
	s_mov_b64 s[0:1], exec    
	s_wqm_b64 exec, exec
	v_cvt_i32_f32_e32 v3, v3
	s_mov_b32 s6, s3
	s_movk_i32 s7, 0x8000          
	v_cvt_i32_f32_e32 v2, v2     
	v_mov_b32_e32 v5, 0xc1200000    
	s_load_dwordx4 s[8:11], s[6:7], 0x0    
	v_lshlrev_b32_e32 v3, 1, v3  
	v_add_lshl_u32 v2, v3, v2, 4   
	s_waitcnt lgkmcnt(0)    
	buffer_load_dword v4, v2, s[8:11], 0 offen    
	v_mov_b32_e32 v2, 0  
	v_mov_b32_e32 v3, v2      
	s_waitcnt vmcnt(0)        
	v_cmp_lt_f32_e32 vcc, 0, v4 against 0       
	v_mov_b32_e32 v4, v2     
	s_and_saveexec_b64 s[8:9], vcc
	s_cbranch_execz BB16_2
BB16_1:
	s_mov_b32 s3, s7      
	s_add_i32 s6, s2, 0x50           
	s_mov_b32 m0, s4           
	s_load_dwordx8 s[12:19], s[2:3], 0x0       
	s_load_dwordx4 s[20:23], s[6:7], 0x0          
	v_interp_p1_f32_e32 v7, v0, attr0.x
	v_interp_p1_f32_e32 v8, v0, attr0.y      
	v_interp_p2_f32_e32 v7, v1, attr0.x          
	v_interp_p2_f32_e32 v8, v1, attr0.y   
	s_waitcnt lgkmcnt(0)     
	image_sample v[2:4], v[7:8], s[12:19], s[20:23] dmask:0x7 dim:SQ_RSRC_IMG_2D
	image_get_lod v5, v[7:8], s[12:19], s[20:23] dmask:0x2 dim:SQ_RSRC_IMG_2D
BB16_2:
	v_nop                 
	s_or_b64 exec, exec, s[8:9]              
	s_and_b64 exec, exec, s[0:1]         
	s_waitcnt vmcnt(0)                
	exp mrt0 v2, v3, v4, v5 done vm          
	s_endpgm

We see that v7 and v8 hold the UV coordinates, but they are actually only computed inside the branch (v_interp). The optimizer is allowed to place UV computation inside the branch here. If there is divergence in a quad, the disabled lanes won’t get correct values for v7 and v8 (since execution is masked), and LOD becomes garbage.

RADV (ACO)

Coupled with Navi cards, this is probably the most bleeding edge setup you can run. It’s a completely new compiler backend for AMD cards, not based on LLVM.

Just happens to work

BB0:
	s_wqm_b64 exec, exec 
	s_mov_b32 s0, s3    
	s_movk_i32 s1, 0x8000  
	s_load_dwordx4 s[8:11], s[0:1], 0x0 
	s_mov_b32 m0, s4 
	v_interp_p1_f32_e32 v4, v0, attr0.y 
	v_cvt_i32_f32_e32 v2, v2   
	v_cvt_i32_f32_e32 v3, v3   
	v_lshl_add_u32 v2, v3, 1, v2 
	v_lshlrev_b32_e32 v2, 4, v2 
	s_waitcnt lgkmcnt(0)   
	buffer_load_dword v2, v2, s[8:11], 0 offen 
	v_interp_p2_f32_e32 v4, v1, attr0.y 
	v_interp_p1_f32_e32 v0, v0, attr0.x  
	v_interp_p2_f32_e32 v0, v1, attr0.x 
	v_mov_b32_e32 v1, v4  
	s_waitcnt vmcnt(0)   
	v_cmp_lt_f32_e32 vcc, 0, v2  
	s_and_saveexec_b64 s[0:1], vcc     
	s_cbranch_execz BB3  
BB1:
	s_movk_i32 s3, 0x8000 
	s_load_dwordx8 s[4:11], s[2:3], 0x0  
	s_load_dwordx4 s[12:15], s[2:3], 0x50 
	s_waitcnt lgkmcnt(0)   
	image_sample v[2:4], v[0:1], s[4:11], s[12:15] dmask:0x7 dim:SQ_RSRC_IMG_2D
	image_get_lod v0, v[0:1], s[4:11], s[12:15] dmask:0x2 dim:SQ_RSRC_IMG_2D
BB3:
	s_andn2_b64 exec, s[0:1], exec 
	s_cbranch_execz BB6  
BB4:
	v_mov_b32_e32 v0, 0xc1200000 
	v_mov_b32_e32 v2, 0  
	v_mov_b32_e32 v3, 0   
	v_mov_b32_e32 v4, 0  
BB6:
	s_mov_b64 exec, s[0:1] 
	s_waitcnt vmcnt(0)   
	exp mrt0 v2, v3, v4, v0 done vm 
	s_endpgm

This time, UV is interpolated outside the branch, so sampling in divergent control flow ends up working after all. The registers are well defined as they enter the branch. For AMD, it seems like it just comes down to whether or not the lanes have correct values placed in them and not having them be clobbered by the time we get around to sampling. There doesn’t seem to be any hardware level checks for divergence.

AMDVLK

Garbage LOD

AMDVLK uses the same LLVM stack that RADV LLVM uses, and no surprise, same result, and basically same exact ISA is generated.

Windows

Also just happens to work

I guess it’s the exact same case as the ACO compiler here. No need to paste disassembly.

Intel

Tested on UHD 620 (8th gen mobile CPU I think).

Anvil (Mesa)

The Mesa compiler can spit out assembly, which is nice.

Just happens to work

ISA (a little too wide to embed): https://gist.github.com/Themaister/7c5b011cde3c7585459b089f80f897e2

From what I can make out of the ISA, the UV is interpolated outside control flow, and then only the sampling takes place in control flow. It seems like Intel has similar behavior as AMD here, in that just as long as the registers are valid, divergent sampling “works”.

Windows

Just happens to work

Doesn’t seem to be a way to get ISA from Windows driver, but I suppose it’s same as ANV here.

Nvidia

Tested on a Turing GPU on Linux. Didn’t bother testing on Windows as well considering the driver stack is basically the same.

LOD is clamped to 0, textureQueryLod returns -32.0.

Apparently, now we start seeing interesting behavior. Unfortunately, there is no public ISA to look at. The -32.0 LOD might look weird, but this is kind of expected. This is apparently the smallest possible representable LOD on this GPU. LOD is usually represented in some kind of fixed point, log2(0) = -inf after all.

I confirmed it worked as expected when using non-divergent execution as a sanity check.

Arm

Tested on Mali-G72.

LOD is clamped to 0, textureQueryLod returns -128.0.

Very similar behavior to Nvidia here, except the LOD is -128.0 rather than -32.0. I confirmed it worked as expected when using non-divergent execution as a sanity check.

QCOM

Tested on Adreno 506.

Garbage LOD

Again, no ISA to look at. I confirmed it worked as expected when using non-divergent execution as a sanity check.

Conclusion

Never ever rely on LOD behavior with divergent quads (EDIT: at least the way it’s specced out and implemented on Vulkan drivers right now) . You’d be contributing to the pain and suffering of compiler engineers the world over. Staying quad-uniform is fine though.

Yet another blog explaining Vulkan synchronization

After playing Fire Emblem: Three Houses for an ungodly 160 hours over the past weeks, I guess it’s time to put on my professor hat on the internet instead.

One topic I’ve been meaning to write about for a long time is synchronization in Vulkan. It’s a large hurdle to overcome when learning the API, and rather than mechanically explaining how it works, my goal here is to instill a mental model in the reader. Despite its reputation for maddening complexity, it is actually understandable and quite logical once you get over the initial hurdles.

Where appropriate, I will use terms which match the Vulkan specification.

The Vulkan queue

For this part of the discussion we will only consider a single VkQueue. There is a lot to consider for single-queue synchronization, and dealing with multiple queues is a small extension on top of single-queue synchronization, which is covered at the end when discussing semaphores.

The Vulkan queue is simply an abstraction where command buffers are submitted and the GPU churns through commands. Let’s get some common beginner mistakes out of the way first.

Command buffer misconceptions

Many developers seem to think that command buffer boundaries are somehow special in Vulkan. It is very important to clarify that for purposes of synchronization, everything submitted to a queue is simply a linear stream of commands. Any synchronization applies globally to a VkQueue, there is no concept of a only-inside-this-command-buffer synchronization.

Command overlap

The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write. Unless you add synchronization yourself, all commands in a queue execute out of order. Reordering may happen across command buffers and even vkQueueSubmits. This makes sense, considering that Vulkan only sees a linear stream of commands once you submit, it is a pitfall to assume that splitting command buffers or submits adds some magic synchronization for you.

NOTE: Unlike Vulkan, I do believe D3D12 disables any overlap across queue submits, but don’t quote me on that. Might be something to consider if you’re coming from D3D-land.

NOTE: Frame buffer operations inside a render pass happen in API-order, of course. This is a special exception which the spec calls out.

Pipeline stages

Every command you submit to Vulkan goes through a set of stages. These stages are represented in the VK_PIPELINE_STAGE enum. See chapter 6.1.2 in spec. When we synchronize work in Vulkan, we synchronize work happening in these pipeline stages as a whole, and not individual commands of work.

Draw calls, copy commands and compute dispatches all go through pipeline stages one by one.

The mysterious TOP_OF_PIPE and BOTTOM_OF_PIPE stages

A common stumbling block is the TOP_OF_PIPE and BOTTOM_OF_PIPE stages. These are essentially “helper” stages, which do no actual work, but serve some important purposes. Every command will first execute the TOP_OF_PIPE stage. This is basically the command processor on the GPU parsing the command. BOTTOM_OF_PIPE is where commands retire after all work has been done. TOP_OF_PIPE and BOTTOM_OF_PIPE are useful in specific scenarios, keep them in mind for later, as they are a little tricky and beginners make many mistakes with these.

In-queue execution barriers

Before we tackle memory barriers, we must fully understand execution barriers, as they are a subset of memory barriers. The primary mechanism in Vulkan to introduce execution barriers is the pipeline barrier.

To introduce the simplest form of an execution dependency we use a pipeline barrier:

void vkCmdPipelineBarrier(
    VkCommandBuffer                             commandBuffer,
    VkPipelineStageFlags                        srcStageMask,
    VkPipelineStageFlags                        dstStageMask,
    VkDependencyFlags                           dependencyFlags,
    uint32_t                                    memoryBarrierCount,
    const VkMemoryBarrier*                      pMemoryBarriers,
    uint32_t                                    bufferMemoryBarrierCount,
    const VkBufferMemoryBarrier*                pBufferMemoryBarriers,
    uint32_t                                    imageMemoryBarrierCount,
    const VkImageMemoryBarrier*                 pImageMemoryBarriers);

If we ignore the memory barriers and flags here, we’re essentially left with two arguments, srcStageMask and dstStageMask. This represents the heart of the Vulkan synchronization model. We’re splitting the command stream in two with a barrier, where we consider “everything before” the barrier, and “everything after” the barrier, and these two halves are synchronized in some way.

Section 6.1 lays this out in rather obtuse language, but we boil it down to:

srcStageMask

This represents what we are waiting for. Vulkan does not let you add fine-grained dependencies between individual commands. Instead you get to look at all work which happens in certain pipeline stages. For example, if we were to submit this series of commands starting off a fresh VkDevice:

vkCmdDispatch (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)
vkCmdCopyBuffer (VK_PIPELINE_STAGE_TRANSFER_BIT)
vkCmdDispatch (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)
vkCmdPipelineBarrier (srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)

We would be referring to the two vkCmdDispatch commands, as they perform their work in the COMPUTE stage. Even if we split these 4 commands into 4 different vkQueueSubmits, we would still consider the same commands for synchronization. Essentially, the work we are waiting for is all commands which have ever been submitted to the queue including any previous commands in the command buffer we’re recording. srcStageMask then restricts the scope of what we are waiting for. Only work happening in COMPUTE_SHADER_BIT stage is relevant in this example. srcStageMask is a bit-mask as the name suggests, so it’s perfectly fine to wait for both COMPUTE and TRANSFER work.

There are also flags to refer to “all commands”, ALL_COMMANDS_BIT, which basically drains the entire queue for work. ALL_GRAPHICS_BIT is the same, but only for render passes.

NOTE: Here we will find a potential use case for TOP_OF_PIPE. srcStageMask of TOP_OF_PIPE is basically saying “wait for nothing”, or to be more precise, we’re waiting for the GPU to parse all commands, which is, a complete noop. We had to parse all commands before getting to the pipeline barrier command to begin with. When we get to memory barriers, this can be very useful.

dstStageMask

This represents the second half of the barrier. Any work submitted after this barrier will need to wait for the work represented by srcStageMask before it can execute. Only work in the specified stages are affected. For example, if dstStageMask is FRAGMENT_SHADER_BIT, vertex shading for future commands can begin executing early, we only need to wait once FRAGMENT_SHADER_BIT is reached.

NOTE: As an analog to srcStageMask with TOP_OF_PIPE, for dstStageMask, using BOTTOM_OF_PIPE can be kind of useful. This basically translates to “block the last stage of execution in the pipeline”. Basically, we translate this to mean “no work after this barrier is going to wait for us”. This might seem meaningless, but it will be useful when we discuss semaphores and memory barriers later.

A crude example

Let’s assume we record and submit some commands on a fresh VkDevice:

vkCmdDispatch
vkCmdDispatch
vkCmdDispatch
vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = COMPUTE)
vkCmdDispatch
vkCmdDispatch
vkCmdDispatch

With this barrier, the “before” set is commands {1, 2, 3}. The “after” set is {5, 6, 7}. A possible execution order here could be:

{1, 2, 3} can execute out-of-order, and so can {5, 6, 7}, but these two sets of commands can not interleave execution. In spec lingo {1, 2, 3} happens-before {5, 6, 7}.

https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples has some examples of how these stages are used in practice.

Events aka. split barriers

Vulkan provides a way to get overlapping work in-between barriers. The idea of VkEvent is to get some unrelated commands in-between the “before” and “after” set of commands, e.g.:

vkCmdDispatch
vkCmdDispatch
vkCmdSetEvent(event, srcStageMask = COMPUTE)
vkCmdDispatch
vkCmdWaitEvent(event, dstStageMask = COMPUTE)
vkCmdDispatch
vkCmdDispatch

The “before” set is now {1, 2}, and the after set is {6, 7}. 4 here is not affected by any synchronization and it can fill in the parallelism “bubble” we get when draining the GPU of work from 1, 2, 3. For advanced compute, this is a very important thing to know about, but not all GPUs and drivers can take advantage of this feature.

Execution dependency chain

This is a subtle – but very important – point which I don’t think is well enough understood. The general gist of it is that when we use dstStageMask to block stages, the dependencies in srcStageMask are carried forward into the blocked stages. Waiting for dstStageMask later will also wait for any dependencies dstStageMask had. It is easier to show an example here:

vkCmdDispatch
vkCmdDispatch
vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)
vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)
vkCmdDispatch
vkCmdDispatch

In this example we actually get a dependency between {1, 2} and {5, 6}. This is because we created a chain of dependencies between COMPUTE -> TRANSFER -> COMPUTE. When we wait for TRANSFER in 4. we must also wait for anything which is currently blocking TRANSFER. This might seem confusing, but it makes sense if we consider a slightly modified example.

vkCmdDispatch
vkCmdDispatch
vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)
vkCmdMagicDummyTransferOperation
vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)
vkCmdDispatch
vkCmdDispatch

In this scenario, it’s clear that {4} must wait for {1, 2}. And {6, 7} must wait for {4}. So, we have created a chain where {1, 2} -> {4} -> {6, 7}, and as {4} is noop, {1, 2} -> {6, 7} is achieved. That’s essentially the chain.

This has some uses when you want to “link up” barriers for whatever reason. I kinda wish Vulkan had some special “scoreboard” pipeline stages just for this use case …

Pipeline stages and render passes

COMPUTE and TRANSFER work is very simple when it comes to pipeline stages. The only stages they execute are:

TOP_OF_PIPE
DRAW_INDIRECT (for indirect compute only)
COMPUTE / TRANSFER
BOTTOM_OF_PIPE

Render passes are a bit more intricate, and it’s very easy to confuse which pipelines stages do what.

In render passes there are two “families” of pipeline stages, those which concern themselves with geometry processing, and the fragment family, which does rasterization / frame buffer operations.

Aside from TOP_OF_PIPE/BOTTOM_OF_PIPE, we have

Geometry

DRAW_INDIRECT – Parses indirect buffers
VERTEX_INPUT – Consumes fixed function VBOs and IBOs
VERTEX_SHADER – Actual vertex shader
TESSELLATION_CONTROL_SHADER
TESSELLATION_EVALUATION_SHADER
GEOMETRY_SHADER

Fragment

EARLY_FRAGMENT_TESTS
FRAGMENT_SHADER
LATE_FRAGMENT_TESTS
COLOR_ATTACHMENT_OUTPUT

For the most part, it’s the Fragment stages which are a bit confusing. Each of them have their own use cases.

EARLY_FRAGMENT_TESTS

This is the stage where early depth/stencil tests happen. This stage isn’t all that useful or meaningful except in some very obscure scenarios with frame buffer self-dependencies (aka, GL_ARB_texture_barrier). This is also where a render pass performs its loadOp of a depth/stencil attachment.

LATE_FRAGMENT_TESTS

This is where late depth-stencil tests take place, and also where depth-stencil attachments are stored with storeOp when a render pass is done.

Helpful tip on fragment test stages

It’s somewhat confusing to have two stages which basically do the same thing. When you’re waiting for a depth map to have been rendered in an earlier render pass, you should use srcStageMask = LATE_FRAGMENT_TESTS_BIT, as that will wait for the storeOp to finish its work.

When blocking a render pass with dstStageMask, just use a mask of EARLY_FRAGMENT_TESTS | LATE_FRAGMENT_TESTS.

NOTE: dstStageMask = EARLY_FRAGMENT_TESTS alone might work since that will block loadOp, but there might be shenanigans with memory barriers if you are 100% pedantic about any memory access happening in LATE_FRAGMENT_TESTS. If you’re blocking an early stage, it never hurts to block a later stage as well.

COLOR_ATTACHMENT_OUTPUT

This one is where loadOp, storeOp, MSAA resolves and frame buffer blend stage takes place, basically anything which touches a color attachment in a render pass in some way. If you’re waiting for a render pass which uses color to be complete, use srcStageMask = COLOR_ATTACHMENT_OUTPUT, and similar for dstStageMask when blocking render passes from execution.

Memory barriers

Now that we have the basics for execution barriers, we can kick it up a notch and consider memory barriers.

Execution order and memory order are two different things. GPUs are notorious for having multiple, incoherent caches which all need to be carefully managed to avoid glitched out rendering. This means that just synchronizing execution alone is not enough to ensure that different units on the GPU can transfer data between themselves.

If you are familiar with how C++11 introduced memory order and atomics, it is a good start, but the C++11 memory model does not consider that memory access can be incoherent to my knowledge. All CPU memory is assumed to be coherent, but memory order is weak on basically anything non-x86. Vulkan expands on this concept.

The two concepts in the Vulkan specification we need to understand is memory being available and memory being visible. This is an abstraction over the fact that GPUs have incoherent caches. To explain this I will describe a mental model of a hypothetical GPU design which should make sense if you are familiar with how caches work.

NOTE: There is a formal Vulkan memory model now which covers all of this in extreme detail. I admit I have not studied it enough to make references to it here, but developers really don’t need to know that level of detail to use Vulkan correctly.

The L2 cache / main memory

We will let the last cache hierarchy represent the “master” memory controller which all caches are connected to. This cache is connected to all other L1 caches, and external DDR memory. The GPU DDR memory is connected to the CPU memory controller in some way (PCI-e or UMA).

When our L2 cache contains the most up-to-date data there is, we can say that memory is available, because L1 caches connected to L2 can pull in the most up-to-date data there is.

Incoherent L1 caches

Vulkan specifies a bunch of flags in the VK_ACCESS_ series of enums. These flags represent memory access which can be performed. Each pipeline stage can perform certain memory access, and thus we take the combination of pipeline stage + access mask and we get potentially a very large number of incoherent caches on the system. Each GPU core has its own set of L1 caches as well.

Of course, real GPUs will only have a fraction of the possible caches here, but as long as we are explicit about this in the API, any GPU driver can simplify this as needed.

Under section 6.1.3, table 4 in the Vulkan spec you can see a list of all possible access masks which can be used with a pipeline stage. These access masks either read from a cache, or write to an L1 cache in our mental model.

We say that memory is visible to a particular stage + access combination if memory has been made available and we then make that memory visible to the relevant stage + access mask.

Once a shader stage writes to memory, the L2 cache no longer has the most up-to-date data there is, so that memory is no longer considered available. If other caches try to read from L2, it will see undefined data. Whatever wrote that data must make those writes available before the data can be made visible again.

Cache flush and invalidate

To be clear, we can say that “making memory available” is all about flushing caches, and “making memory visible” is invalidating caches. This should make it more obvious what is going on. I will use the spec terminology however.

VkMemoryBarrier

If we revisit vkCmdPipelineBarrier, we can pass in a list of global memory barriers.

typedef struct VkMemoryBarrier {
    VkStructureType sType;
    const void* pNext;
    VkAccessFlags srcAccessMask;
    VkAccessFlags dstAccessMask;
} VkMemoryBarrier;

A global memory barrier deals with access to any resource, and it’s the simplest form of a memory barrier. This means that in vkCmdPipelineBarrier, we are specifying 4 things to happen in order:

Wait for srcStageMask to complete
Make all writes performed in possible combinations of srcStageMask + srcAccessMask available
Make available memory visible to possible combinations of dstStageMask + dstAccessMask.
Unblock work in dstStageMask.

A common misconception I see is that _READ flags are passed into srcAccessMask, but this is redundant. It does not make sense to make reads available, i.e. you don’t flush caches when you’re done reading data.

Memory access and TOP_OF_PIPE/BOTTOM_OF_PIPE

Never use AccessMask != 0 with these stages. These stages do not perform memory accesses, so any srcAccessMask and dstAccessMask combination with either stage will be meaningless, and spec disallows this. TOP_OF_PIPE and BOTTOM_OF_PIPE are purely there for the sake of execution barriers, not memory barriers.

Split memory barriers

A very important point here is that it’s perfectly possible to split up the “make available” and “make visible” operations. This is similar to the execution dependency chain discussed earlier.

We can do something silly like:

vkCmdDispatch – writes to an SSBO, VK_ACCESS_SHADER_WRITE_BIT
vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER, srcAccessMask = SHADER_WRITE_BIT, dstAccessMask = 0)
vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE, srcAccessMask = 0, dstAccessMask = SHADER_READ_BIT)
vkCmdDispatch – read from the same SSBO, VK_ACCESS_SHADER_READ_BIT

While StageMask cannot be 0, AccessMask can be 0.

VkBufferMemoryBarrier

This is not very interesting, we’re just restricting memory availability and visibility to a specific buffer. No GPU I know of actually cares, I think it makes more sense to just use VkMemoryBarrier rather than bothering with buffer barriers.

VkImageMemoryBarrier

Unlike VkBufferMemoryBarrier, this one is critical. You have to change image layouts at some point and this is done as part of an image memory barrier.

typedef struct VkImageMemoryBarrier {
    VkStructureType sType;
    const void* pNext;
    VkAccessFlags srcAccessMask;
    VkAccessFlags dstAccessMask;
    VkImageLayout oldLayout;
    VkImageLayout newLayout;
    uint32_t srcQueueFamilyIndex;
    uint32_t dstQueueFamilyIndex;
    VkImage image;
    VkImageSubresourceRange subresourceRange;
} VkImageMemoryBarrier;

The interesting bits are oldLayout and newLayout. The layout transition happens in-between the make available and make visible stages of a memory barrier. The layout transition itself is considered a read/write operation, and the rules are basically that memory for the image must be available before the layout transition takes place. After a layout transition, that memory is automatically made available (but not visible!). Basically, think of the layout transition as some kind of in-place data munging which happens in L2 cache somehow.

A practical TOP_OF_PIPE example

Now we can actually make a practical example with TOP_OF_PIPE. If we just allocated an image and want to start using it, what we want to do is to just perform a layout transition, but we don’t need to wait for anything in order to do this transition. This is where TOP_OF_PIPE is useful. Let’s say that we’re allocating a fresh image, and we’re going to use it in a compute shader as a storage image. The pipeline barrier looks like:

srcStageMask = TOP_OF_PIPE – Wait for nothing
dstStageMask = COMPUTE – Unblock compute after the layout transition is done
srcAccessMask = 0 – This is key, there are no pending writes to flush out. This is the only way to use TOP_OF_PIPE in a memory barrier. It’s important to note that freshly allocated memory in Vulkan is always considered available and visible to all stages and access types. You cannot have stale caches when the memory was never accessed … What about recycled/aliased memory you ask? Excellent question, we’ll cover that too later.
oldLayout = UNDEFINED – Input is garbage
newLayout = GENERAL – Storage image compatible layout
dstAccessMask = SHADER_READ | SHADER_WRITE

A practical BOTTOM_OF_PIPE example

My favourite example here is swapchain images. We have to transition them into VK_IMAGE_LAYOUT_PRESENT_SRC_KHR before passing the image over to the presentation engine.

After transitioning into this PRESENT layout, we’re not going to touch the image again until we reacquire the image, so dstStageMask = BOTTOM_OF_PIPE is appropriate.

srcStageMask = COLOR_ATTACHMENT_OUTPUT (assuming we rendered to swapchain in a render pass)
srcAccessMask = COLOR_ATTACHMENT_WRITE
oldLayout = COLOR_ATTACHMENT_OPTIMAL
newLayout = PRESENT_SRC_KHR
dstStageMask = BOTTOM_OF_PIPE
dstAccessMask = 0

Having dstStageMask = BOTTOM_OF_PIPE and access mask being 0 is perfectly fine. We don’t care about making this memory visible to any stage beyond this point. We will use semaphores to synchronize with the presentation engine anyways.

Implicit memory ordering – semaphores and fences

Semaphores and fences are quite similar things in Vulkan, but serve a different purpose. Semaphores facilitate GPU <-> GPU synchronization across Vulkan queues, and fences facilitate GPU -> CPU synchronization.

These objects are signaled as part of a vkQueueSubmit. However, one very important thing to note about semaphores and fences is how they interact with memory. To signal a semaphore or fence, all previously submitted commands to the queue must complete. If this were a regular pipeline barrier, we would have srcStageMask = ALL_COMMANDS_BIT. However, we also get a full memory barrier, in the sense that all pending writes are made available. Essentially, srcAccessMask = MEMORY_WRITE_BIT.

Implicit memory guarantees on vkQueueSubmit

While signaling a fence or semaphore works like a full cache flush, submitting commands to the Vulkan queue, makes all memory access performed by host visible to all stages and access masks. Basically, submitting a batch issues a cache invalidation on host visible memory. A common mistake is to think that you need to do this invalidation manually when the CPU is writing into staging buffers or similar. Something like:

srcStageMask = HOST
srcAccessMask = HOST_WRITE_BIT
dstStageMask = TRANSFER
dstAccessMask = TRANSFER_READ

If the write happened before vkQueueSubmit, this is automatically done for you.

NOTE: This kind of barrier is necessary if you are using vkCmdWaitEvents where you wait for host to signal the event with vkSetEvent. In that case, you might be writing the necessary host data after vkQueueSubmit was called, which means you need a pipeline barrier like this. This is not exactly a common use case, but it’s important to understand when these API constructs are useful.

Implicit memory guarantees when waiting for a semaphore

While signalling a semaphore makes all memory available, waiting for a semaphore makes memory visible. This basically means you do not need a memory barrier if you use synchronization with semaphores since signal/wait pairs of semaphores works like a full memory barrier. Let’s see an example where queue 1 writes to an SSBO in compute, and consumes that buffer as a UBO in a fragment shader in queue 2. We’re going to assume the buffer was created with QUEUE_FAMILY_CONCURRENT.

NOTE: If you need to transfer ownership to a different queue family, you need memory barriers, one in each queue to release/acquire ownership.

Queue 1

vkCmdDispatch
vkQueueSubmit(signal = my_semaphore)

There is no pipeline barrier needed here. Signalling the semaphore waits for all commands, and all writes in the dispatch are made available to the device before the semaphore is actually signaled.

Queue 2

vkCmdBeginRenderPass
vkCmdDraw
vkCmdEndRenderPass
vkQueueSubmit(wait = my_semaphore, pDstWaitStageMask = FRAGMENT_SHADER)

When we wait for the semaphore, we specify which stages should wait for this semaphore, in this case the FRAGMENT_SHADER stage. All relevant memory access is automatically made visible, so we can safely access UNIFORM_READ_BIT in FRAGMENT_SHADER stage, without having extra barriers. The semaphores take care of this automatically, nice!

Execution dependency chain with semaphore

Just like pipeline barriers having execution dependency chains, we can create execution dependency chains with semaphores as well. pDstWaitStageMask in vkQueueSubmit blocks certain stages from executing.

If we create a pipeline barrier with srcStageMask targeting one of the stages in the wait stage mask, we also wait for the semaphore to be signaled. This is extremely useful for doing image layout transitions on swapchain images. We need to wait for the image to be acquired, and only then can we perform a layout transition. The best way to do this is to use pDstWaitStageMask = COLOR_ATTACHMENT_OUTPUT_BIT, and then use srcStageMask = COLOR_ATTACHMENT_OUTPUT_BIT in a pipeline barrier which transitions the swapchain image after semaphore is signaled.

Host memory reads

While signalling a fence makes all memory available, it does not make them available to the CPU, just within the device. This is where dstStageMask = PIPELINE_STAGE_HOST and dstAccessMask = ACCESS_HOST_READ_BIT flags come in. If you intend to read back data to the CPU, you must issue a pipeline barrier which makes memory available to the HOST as well. In our mental model, we can think of this as flushing the GPU L2 cache out to GPU main memory, so that CPU can access it over some bus interface.

Safely recycling memory and aliasing memory

We earlier had an example with creating a fresh VkImage and transitioning it from UNDEFINED, and waiting for TOP_OF_PIPE. As explained, we did not need to specify any srcAccessMask since we knew that memory was guaranteed to be available. The reason for this is because of the implied guarantee of signalling a fence. In order to recycle memory, we must have observed that the GPU was done using the image with a fence. In order to signal that fence, any pending writes to that memory must have been made available, so even recycled memory can be safely reused without a memory barrier. This point is kind of subtle, but it really helps your sanity not having to inject memory barriers everywhere.

However, what if we consider we want to alias memory inside a command buffer? The rule here is that in order to safely alias, all memory access from the active alias must be made available before a new alias can take its place. Here’s an example for a case where we have two VkImages which are used in two render passes, and they alias memory. When one image alias is written to, all other images immediately become “undefined”. There are some exceptions in the specification for when multiple aliases can be valid at the same time, but for now we assume that is not the case.

vkCmdPipelineBarrier(image = image1, oldLayout = UNDEFINED, newLayout = COLOR_ATTACHMENT_OPTIMAL, srcStageMask = COLOR_ATTACHMENT_OUTPUT, srcAccessMask = COLOR_ATTACHMENT_WRITE, dstStageMask = COLOR_ATTACHMENT_OUTPUT, dstAccessMask = COLOR_ATTACHMENT_WRITE|READ)

image1 will contain garbage here so we need to transition away from UNDEFINED. We need to make any pending writes to COLOR_ATTACHMENT_WRITE available before the layout transition takes place, assuming that we’re running these commands every frame. The following render pass will wait for this transition to take place using dstStageMask/dstAccessMask.

vkCmdBeginRenderPass/EndRenderPass
vkCmdPipelineBarrier(image = image2, …)
vkCmdBeginRenderPass/EndRenderPass

image1 was written to, so image2 was invalidated. Similar to the pipeline barrier for image1, we need to transition away from UNDEFINED. We need to make sure any write to image1 is made available before we can perform the transition. Next frame, image1 needs to take ownership again, and so on.

External subpass dependencies

Render passes in Vulkan have a concept of EXTERNAL subpass dependencies. This is arguably the most misunderstood aspect of Vulkan sync. I’d like to dedicate a section to this, because too many developers are lured into using it in cases where it’s not terribly useful and very likely to cause bugs.

The main purpose of external subpass dependencies is to deal with initialLayout and finalLayout of an attachment reference. If initialLayout != layout used in the first subpass, the render pass is forced to perform a layout transition.

If you don’t specify anything else, that layout transition will wait for nothing before it performs the transition. Or rather, the driver will inject a dummy subpass dependency for you with srcStageMask = TOP_OF_PIPE_BIT. This is not what you want since it’s almost certainly going to be a race condition. You can set up a subpass dependency with the appropriate srcStageMask and srcAcessMask. The external subpass dependency is basically just a vkCmdPipelineBarrier injected for you by the driver. The whole premise here is that it’s theoretically better to do it this way because the driver has more information, but this is questionable, at least on current hardware and drivers.

There is a very similar external subpass dependency setup for finalLayout. If finalLayout differs from the last use in a subpass, driver will transition into the final layout automatically. Here you get to change dstStageMask/dstAccessMask. If you do nothing here, you get BOTTOM_OF_PIPE/0, which can actually be just fine. A prime use case here is swapchain images which have finalLayout = PRESENT_SRC_KHR.

Essentially, you can ignore external subpass dependencies. Their added complexity give very little gain. Render pass compatibility rules also imply that if you change even minor things like which stages to wait for, you need to create new pipelines! This is dumb, and will hopefully be fixed at some point in the spec.

However, while the usefulness of external subpass dependencies is questionable, they have some convenient use cases I’d like to go over:

Automatically transitioning TRANSIENT_ATTACHMENT images

If you’re on mobile, you should be using transient images where possible. When using these attachments in a render pass, it makes sense to always have them as initialLayout = UNDEFINED. Since we know that these images can only ever be used in COLOR_ATTACHMENT_OUTPUT or EARLY/LATE_FRAGMENT_TEST stages depending on their image format, the external subpass dependency writes itself, and we can just use transient attachments without having to think too hard about how to synchronize them. This is what I do in my Granite engine, and it’s quite useful. Of course, we could just inject a pipeline barrier for this exact same purpose, but that’s more boilerplate.

Automatically transitioning swapchain images

Typically, swapchain images are always just used once per frame, and we can deal with all synchronization using external subpass dependencies. We want initialLayout = UNDEFINED, and finalLayout = PRESENT_SRC_KHR.

srcStageMask is COLOR_ATTACHMENT_OUTPUT which lets us link up with the swapchain acquire semaphore. For this case, we will need an external subpass dependency. For the finalLayout transition after the render pass, we are fine with BOTTOM_OF_PIPE being used. We’re going to use semaphores here anyways.

I also do this in Granite.

Conclusion

I hope this was useful. The post got a bit more mechanical than I hoped for, but it should be a more distilled version of the specification.

A tour of Granite’s Vulkan backend – Part 6

Pipelines – what is your pain tolerance?

A lot of thought goes into pipelines. Eager or lazy creation, dynamic or static render state. Forget about one size fits all. How close will you approach the volcano? Make sure there is no lava under your feet when you’re done.

My pain tolerance is kinda low, I’d rather watch it on TV. Granite is a bit similar, it prefers to be cooled off magma instead.

The ideal case

Vulkan is designed to let you forget about filthy, filthy render state management and work exclusively with pristine VkPipeline objects. These objects encode every possible choice you can make when flipping the fixed-function bits and bobs on the GPU.

Getting to a point when you only think in terms of VkPipelines, and all pipelines are compiled up front in load-time is a holy grail of modern graphics API implementation. Gone are the stutters, the hitches, the sad 100 ms glitches which throw you off guard when you peek around the wall.

To get there, you must sacrifice all notions of flexibility, no last minute decisions, everything must be planned out in detail ahead of time. There is a lot of state which is pulled together to form a VkPipeline, an all-star cast of colorful characters and a plot with a lot of depth.

… ahem, that got a bit weird.

Shader modules

Obviously, the core part of a pipeline is the shader modules, the Vulkan::Program in Granite. From the program we automatically know the VkPipelineLayout because of reflection, so no problems there.

Render pass

We also need to know the render pass (and subpass index!) in order to create a pipeline. This one can be really counter-intuitive. The shader compiler often needs to know which render target formats are in use in order to generate final ISA. This is where we start running into problems. There is no obvious reason to combine a render pass and shader modules together. In my mental model these two should not know about each other, but drivers would really like that to be the case. For example, if I were to render a scene it would look something like:

Start rendering to some attachments (VkRenderPass is known here)
Set up the default rendering state appropriate for the pass. There are different “default” states for depth-only, opaque, lighting, and transparency rendering. Part of the render state vector is determined here.
Ask the renderer to render some list of visible objects which survived culling. Shader modules are known at this level, and some render state might be per-material, like two-sided rendering, etc.

There are a few ways to make this work, but somewhere you must have higher-level knowledge which shader modules are used in which render passes. If an application has a baking step during build, that might be a nice place to do it, but not all graphics API use cases work this way. Emulation comes to mind where you cannot know what an application will do until you execute it. User scripting could be a nightmare as well …

Render passes also have a lot of combinatorial explosion. If we just change from MSAA 2x to MSAA 4x, that means new render passes, and new pipelines which are compatible with those render passes. Clearly we see that something trivial like changing a setting in the options menu of most games will trickle down into a completely different set of pipelines for all materials. This kind of coupling isn’t what I call clean, but sometimes sanity must be sacrificed for performance. I’d prefer to keep my sanity.

Fixed-function vertex bindings

This consists of attributes, bindings, strides and input rates. This one is usually not a problem if you control the asset pipeline. You can decide on a “standard” vertex buffer layout and forget about it. There is some slight annoyance here if we want to support glTF or other scene transmission formats unless we’re prepared to rewrite all vertex buffers to match the standard layout.

Shader compilers like to know about this information since some ISAs need to fetch vertices in software, and therefore need to be able to compute correct offsets based on VertexIndex/InstanceIndex.

10 – Fixed-function render state

When rendering triangles in Vulkan, there is still a ton of state to deal with. Vulkan takes all the gunk you’d set in glEnable/glDisable and various other functions and bundles it together into one massive struct. I wrote up a sample which demonstrates how render state is set, saved and restored.

I have to admit I kinda like the old-school way of setting state individually. Isolating render state to a command buffer avoid almost all the horrifying issues with state management in OpenGL. In GL, the state is global, and leaked between modules and render passes. This is really scary, and you’re basically forced to make a custom state tracker on top of GL to keep yourself sane. There was also no good way of “saving” just the state you cared about and restoring it without writing a lot of custom code. I like the idea of setting some “standard” state which clears out any possible leakage of state. Overall, Granite’s model is maximum convenience.

A concept I’ve seen in other projects is the idea of creating big structures on the user side which mimics a pipeline, but I don’t think this is very useful unless it’s basically a full VkGraphicsPiplineCreateInfo with all the bells and whistles. If we don’t, we still don’t have the information we need to create a pipeline anyways, like render pass information for example, and we’re back to hashing with lazy creation.

Even just render state tends to be split in two halves for me. Some state tends to be “global” in nature and some tends to be “local”. This is state which is set by the higher level renderer which thinks in terms of:

Opaque pass vs transparent pass (alpha blending)
Depth-only? (depth write enable, depth bias?, equal test?)
Lighting pass? (additive blending?)
Stencil? (for deferred)

This state is saved and restored as necessary, then we have the objects which are rendered in a render pass which typically think in terms of:

Two sided mesh? (face culling)
Primitive restart?
Topology?
Shader program?
Vertex attributes?

I don’t like to couple these parts of the renderer together, so a tightly packed blob of state in Vulkan::CommandBuffer does the job for me. At the end of the day, the only real cost of this flexibility is some extra hashing cost. It doesn’t light up in the profile for me.

Overall, I like the “immediate” nature of the CommandBuffer interface. There’s always a hybrid solution if that is ever needed where I would set the state I’m interested in, then pull out a persistent VkPipeline handle which can be used later and bypasses any hashing of state when bound.

Avoiding stutters

The real problem with lazily creating pipelines is vkCreateGraphicsPipelines in my opinion. Doing this at the last minute is almost a guaranteed hitch, and it should be avoided at all cost. Avoiding last minute pipeline compilation is the real reason we should know all state combinations up front, not because we get to bind VkPipelines directly and avoid some small hashing cost.

My strategy for dealing with this problem is pre-warming the hashmaps with previously seen data. Granite integrates the Fossilize project to solve the problem of serializing all information needed to create pipelines in a GPU and driver independent way. In theory, I would be able to ship a Fossilize database as part of an application and use that to pre-warm all historically observed pipelines and their dependent objects at Vulkan::Device creation time.

To my knowledge, this is basically how all GL and D3D11 drivers behave. Cache all the things.

Conclusion

Granite’s render state management is old-school, but I like it. Pre-warming the various hashmaps in Vulkan::Device is the strategy I used to avoid any pipeline compilation stutters.

There are many alternatives for any graphics API abstraction. There are things I like in legacy APIs, and things I hate. I wanted to keep the parts I liked, and avoid the parts I disliked.

… that’s all folks!

I think this is the end of this series for now. I’ve gone over the Vulkan backend in broad strokes, and I hope it was interesting and useful.

A tour of Granite’s Vulkan backend – Part 5

Render passes and synchronization

This is part 5 in the tour of Granite‘s Vulkan backend. We’re going to get knee-deep in the aspects of Vulkan which are the most difficult to learn in my opinion, and mastering these topics of Vulkan is the real hurdle towards mastery. This level of understanding is something high-level APIs will prevent you from reaching.

This post isn’t intended to be a tutorial on Vulkan synchronization, so I’ll assume some basic level of knowledge.

Render passes

Render passes is a new fundamental part of Vulkan which does not exist in any of the legacy APIs. In older APIs you can freewheel how you render to frame buffers, but that approach was always terrible on tile-based GPUs, and these days with hybrid tilers, it’s probably terrible on desktop as well. Vulkan forces you to think about rendering all you need in one go to a frame buffer and then proceed to the next.

In Granite, I wanted to make sure most of the flexibility and explicitness of Vulkan render passes could be expressed with minimal boilerplate. Most projects don’t seem to pay attention to this part except that it’s something you just have to do, and very few see the benefits they bring. That is probably a reasonable stance for 2019 if you do not care about mobile performance. If you need to target mobile though, it is worth the extra work. As of writing, the feature is quite mobile-centric, but desktop GPUs seem to be inching towards tile-based architectures, so it will be interesting to see if this view on render passes will shift in the future. Even D3D12 recently got render passes too, albeit in a simplified form.

In the most basic form, render passes in Vulkan can be rather daunting to set up, and it’s one of the many battles you have to fight to get hello triangle on screen. If we take a render pass with just one sub-pass (the case we care about 99% of the time), we need to specify up front:

How many attachments?
Which formats are used?
How many MSAA samples?
initialLayout and finalLayout
Which image layouts to use while rendering?
Do we load from memory or clear the attachment on render pass begin?
Do we bother storing the attachments to memory?

Most of this information is boilerplate we can automate, but things like load/clear/store we cannot deduce in the backend before it is too late. Knowing this kind of information up-front can be very beneficial for bandwidth consumption, at least on mobile.

The ugly framebuffer objects

An ugly aspect of Vulkan is the use of VkFramebuffer. I want an API where I just say “start a render pass where we render to these attachments”. Creating “FBOs” up front was really ugly in GL, and I think it’s a bad abstraction to have API users carry around ownership of objects which represent little to no useful work. FBOs are empty husks which might as well just be an array of image views.

We could just create VkFramebuffers every render pass we begin and defer the deletion of it right away, but creating these objects have some cost. There’s a handle allocation in the driver at minimum, and probably a little more on certain drivers. Here I just reuse the temporary hashmap allocator which I introduced in the descriptor set model post. VkFramebuffer objects can be reused over multiple frames, but if they aren’t used for a while, they are just deleted since VkFramebuffer objects are immutable.

Automating VkRenderPass creation

This topic is actually quite complicated when we start diving into the deep end of Vulkan render passes, but we can start with the trivial cases. The API in Granite looks something like:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 1;
rp.color_attachments[0] = &rt->get_view();
rp.store_attachments = 1 << 0;
rp.clear_attachments = 1 << 0;

rp.clear_color[0].float32[0] = 1.0f;
rp.clear_color[0].float32[1] = 0.0f;
rp.clear_color[0].float32[2] = 1.0f;
rp.clear_color[0].float32[3] = 0.0f;

cmd->begin_render_pass(rp);

This is an immediate way of starting a render pass, no reason to create frame buffers up front and all that. VkRenderPass can be created lazily on-demand like everything else.

Formats / MSAA sample counts

Render passes need to know formats and sample counts, and since we pass the concrete attachments directly in begin_render_pass(), we have the information we need right here.

Image layouts and VK_SUBPASS_EXTERNAL dependencies

There are three kinds of attachments in Granite:

User-created. These attachments are render targets which are created with Device::create_image() and the backend does not own the resource or knows anything about how long the resource will live. Common case for user-created render targets.
WSI images. These images are special since they came from VkSwapchainKHR or some equivalent mechanism. We know that these images are only used for rendering and they are only consumed by the presentation engine, or some other mechanism.
Transient images. Images with transient usage flags only live inside render passes. They cannot be sampled from, their memory does not necessarily exist except in page tables which point to /dev/null. We don’t care what happens to these images once the render pass is done.

To deduce image layouts for a render pass we have a few different paths.

wsi images

I don’t care about preserving WSI images over multiple frames, and I don’t care about sampling from WSI images or any such weird use case after rendering, so the flow of image layouts is:

initialLayout = UNDEFINED (discard)
VkAttachmentReference -> COLOR_ATTACHMENT_OPTIMAL or whatever is required for the subpass
finalLayout = PRESENT_SRC_KHR or whatever layout we need when using external WSI. For something like libretro, this will be SHADER_READ_ONLY_OPTIMAL since the image will be handed off to some other render pass which we don’t know or care about. For headless PNG/video dumping, it might be TRANSFER_SRC_OPTIMAL.

When initialLayout != the layout used in the first subpass, vkCmdBeginRenderPass will actually need to perform a memory barrier implicitly to make this work. The big question is when this memory barrier takes place, and the answer is “as soon as possible” (TOP_OF_PIPE_BIT) if we don’t specify it anywhere. For this case, Granite will add a subpass dependency which waits for VK_SUBPASS_EXTERNAL in the COLOR_ATTACHMENT_OUTPUT stage. This latches onto the WSI acquire semaphore, more on that later.

Final layout != last layout is used, so we get a transition at the end of the render pass, but we don’t need to care about external subpass dependencies here. The automatically generated one is perfect, and we’re going to use the WSI release semaphore to properly synchronize this image anyways.

When we see a WSI image in a render pass, we can trivially mark this command buffer as “touching WSI”. This will affect command buffer submission later. This is indeed the kind of tracking which I have been arguing against in earlier posts, but it’s so trivial that it’s a no-brainer to me in this case.

Transient images

For transient images, we automate it just like WSI images, except that finalLayout will match last used layout in the render pass. This way we avoid some useless image layout transition at the end of the render pass. Next time we use the image, it’s going to be discarded anyways.

Because we deal with transitions automatically, users can freely pull images from Vulkan::Device with get_transient_attachment(), render to it, and forget about it. This is super convenient for things like MSAA rendering where the multi-sampled attachment just needs to exist temporarily for purposes of being resolved into the single-sampled attachment we care about. Having to care about synchronization for resources we don’t own is weird I think.

other images

For any other image, we need to avoid any implicit layout transition, so we simply force initialLayout to match the first use in the render pass, and finalLayout will match the last use. In our small sample, it’s all going to be COLOR_ATTACHMENT_OPTIMAL. It’s up to the API user to know what layouts a render pass will expect, but it’s straight forward to map a render pass to expected layout. Color attachments are COLOR_ATTACHMENT_OPTIMAL, depth-stencil is DEPTH_STENCIL_OPTIMAL or DEPTH_STENCIL_READ_ONLY_OPTIMAL based on the read/write mode for depth, input attachments are SHADER_READ_ONLY, etc. It’s possible to use an attachment for multiple purposes in a subpass, and Granite supports that as well. Some examples:

Color attachment + input attachment: Feedback loop ala GL_ARB_texture_barrier (super useful for certain emulators) -> GENERAL
RW Depth attachment + input attachment (some weird decal algorithm?) -> GENERAL
Read-Only depth + input attachment (deferred shading use case) -> DEPTH_STENCIL_READ_ONLY_OPTIMAL

All of this is analyzed when a newly observed VkRenderPass is created, and subpass dependencies are set up automatically. Anything which happens outside the render pass is the user’s responsibility.

08 – Bare-bones “deferred rendering” sample

I made a cut-down sample to show how the API expresses multi-pass in the context of deferred rendering with transient gbuffer + depth. The meat of it is:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 3;
rp.color_attachments[0] = &device.get_swapchain_view();

rp.color_attachments[1] = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		VK_FORMAT_R8G8B8A8_UNORM,
		0);
rp.color_attachments[2] = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		VK_FORMAT_R8G8B8A8_UNORM,
		1);

rp.depth_stencil = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		device.get_default_depth_format());

rp.store_attachments = 1 << 0;
rp.clear_attachments = (1 << 0) | (1 << 1) | (1 << 2);
rp.op_flags = Vulkan::RENDER_PASS_OP_CLEAR_DEPTH_STENCIL_BIT;

Vulkan::RenderPassInfo::Subpass subpasses[2];
rp.num_subpasses = 2;
rp.subpasses = subpasses;

rp.clear_depth_stencil.depth = 1.0f;

subpasses[0].num_color_attachments = 2;
subpasses[0].color_attachments[0] = 1;
subpasses[0].color_attachments[1] = 2;

subpasses[0].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadWrite;

subpasses[1].num_color_attachments = 1;
subpasses[1].color_attachments[0] = 0;
subpasses[1].num_input_attachments = 3;
subpasses[1].input_attachments[0] = 1;
subpasses[1].input_attachments[1] = 2;
subpasses[1].input_attachments[2] = 3;  // Depth attachment
subpasses[1].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadOnly;

cmd->begin_render_pass(rp);
// Do work
cmd->next_subpass();
// Do work
cmd->end_render_pass();

See code comments in sample for more detail. To write this kind of sample in raw Vulkan would be almost a full day’s project.

Synchronization

Unlike many aspects of Granite which are reasonably high-level, synchronization in Granite is almost 100% explicit. The general philosophy of Granite is that excessive tracking of resource use is a no-no, unless it is trivial to do so (e.g. WSI images). Synchronization is a case where you need a lot of tracking to do a good job, and it is impossible to do a perfect job since you end up relying on heuristics, at least if you are to implement automatic synchronization on top of Vulkan. GL and D3D11 drivers have an advantage here since they can tap into GPU-specific synchronization mechanisms which might be better suited for implicit synchronization. A good example here is the i915 driver stack in the Linux DRM stack which can do automatic synchronization in kernel space somehow. I’m sure that simplifies the Mesa GL driver a lot, but I don’t know the details.

Let’s go through a thought experiment where we look at the big problems we run into if we are to implement a fully automatic barrier system. (I have tried :p)

Problem #1 – We cannot rewind time

When touching a resource, we must ask ourselves: “When and where was this resource accessed last?” We have three potential solutions to resolve a hazard:

Pipeline barrier (was used just now)
Event (was used some time ago in this queue)
Semaphore (was used in a different queue)

Ideally, we need to inject a barrier at the exact point where a resource was last used, but we cannot inject new commands in the middle of a command buffer which has already been recorded.

There is no winning this fight, either we eagerly inject barriers after every command in the hope that some future command will need to synchronize against it (VkEvent is nice here), or we inject barriers too late, stalling the GPU needlessly.

Eagerly injecting barriers is pure insanity if we take into the account that the resource might be used on a different queue in the future. That means signalling a heavy semaphore after every render pass or command. We could simply ignore supporting multiple queues, but that’s a huge compromise to make.

Problem #2 – Redundant tracking of read-only resources

A problem I found while trying to implement automatic barrier tracking was that static resources might be written in the future, so we need to keep track of them. This is a waste of CPU time, but it might be possible to explicitly mark these resources as “do not track, they’re truly static, pinky promise”, but I feel this is bolting on hacks.

Problem #3 – Multi-threading

The question of “where was this resource touched last” might not actually be possible to answer in a multi-threaded scenario. If we are recording command buffers in parallel, the backend has no idea what is going on until we serialize execution in vkQueueSubmit. A common solution I have seen for this problem is to resolve synchronization internally in command buffers as they are recorded, and on command buffer submission time, we can look at all used resources and resolve any cross-command buffer synchronization points right before submitting the command buffers in vkQueueSubmit. The complexity starts to shoot through the roof though. That’s a good sign we need to rethink.

I think this is the kind of solution you end up with when you have no choice but to port some old legacy API to Vulkan, and breaking the abstraction API is not an option.

Render graphs

A Vulkan backend which solves synchronization can only look back in time and deal with hazards at the last minute, but that is only because we do not have any context about what the application is doing. At a higher level, we can know what is going to happen in the future, and we can make automated decisions at that level, where we actually have context about what is going on. This is another reason why I do not want to have automatic synchronization in a Vulkan backend. Either we get a sub-optimal solution, or we try to close the gap with heuristics, but now run-time behavior is completely non-deterministic. We should aim for something better.

I believe the synchronization problem should be solved at a higher level, like a render graph. I wrote a blog some time ago about this topic: http://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/

Signalling fences

Granite’s way of signalling fences is very similar to plain Vulkan.

Vulkan::Fence fence;
device.submit(cmd, &fence);

// Somewhere else, potentially in a different thread.
fence->wait();
// fence ref-count goes to 0, queued up for recycling.

There is a pool of VkFence objects which can be reused. Signalling a fence forces a vkQueueSubmit. Once the lifetime of a Vulkan::Fence ends it is recycled back to the frame context. Nothing out of the ordinary here.

Semaphores

Semaphores work very similar to fences and are requested in Device::submit() like fences. Like Vulkan, they have a restriction that they can only be waited on once. Semaphores can be waited on in other queues with Device::add_wait_semaphore() in a particular queue and pipeline stage. This matches pDstWaitStages. Semaphores are also recycled like fences.

Events

Events can be signalled and later waited on in the same queue. Again, we have a pool of VkEvent objects, CommandBuffer::signal_event() requests a fresh event, signals it with vkCmdSetEvent and hands it to the user. VkEvents are recycled using the frame context. There is a similar CommandBuffer::wait_event() which maps 1:1 to vkCmdWaitEvents.

Barriers

Granite has many different methods to inject pipeline barriers, the most common ones are:

cmd->barrier(srcStage, srcAccess, dstStage, dstAccess);

which maps to a vkCmdPipelineBarrier with a VkMemoryBarrier, and image barriers which act on all subresources:

cmd->image_barrier(image, oldLayout, newLayout, srcStage, srcAccess, dstStage, dstAccess);

There are cases where we want to batch barriers or otherwise use more complicated commands than this, so there are also 1:1 interfaces to vkCmdPipelineBarrier where the full structures are passed in, but these are only really used by the render graph since writing out full structures is super painful.

The automatic barriers in Granite

There are a few instances where I think having automatic barriers makes sense. These are cases where it’s convenient to do so, and there is no tracking required, so we can resolve all hazards right away and forget about it. Some of them we’ve seen already, like WSI images and transient images in render passes.

The other major case is static read-only resources. If you pass in initial_data to Device::create_buffer() or Device::create_image(), we generally have a desire to upload some data, and never touch it again.

The general gist of it is that we can upload data with a staging buffer over the transfer queue and inject semaphores which block all possible pipeline stages (based on bufferUsage/imageUsage flags). The downside is that we might end up creating too many submissions if we somehow want to upload a ridiculous amount of buffers or images in one go, but we can opt-out of this automatic behavior by simply not passing initial_data and do all the batching and synchronization work ourselves.

The end goal is that we should be able to call create_buffer or create_image and just use the static resource right away without having to think about synchronization at all.

09 – Rendering to image and reading it back to CPU on transfer queue

I wrote a sample which flexes most of the synchronization APIs. It renders a small 4×4 texture on the graphics queue, synchronizes that with the transfer queue with a semaphore and reads it back to a CACHED host buffer. We spawn threads which wait on a fence, maps the buffer and reads the results.

Conclusion

In these parts of the backend, the low-level explicit nature of Vulkan shines through. I think we have to be fairly low-level, or we inherit most of the problems with the older APIs.

… up next!

In the next installment, we’ll have a look at pipeline creation.

A tour of Granite’s Vulkan backend – Part 4

Optimizing for scratch data allocation

Allocating memory from a heap is fine and all, but very often in an engine we need to allocate throwaway data. Uniform buffers are the perfect use case here. With transient command buffers, certain data is also going to be transient. It’s very common to just want to send some constant data to a draw call and forget about it.

In Vulkan, there is a perfect descriptor type for this use case, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. It’s not just uniform buffers, it’s fairly common that we want to allocate scratch vertex buffers, index buffers and staging data for texture updates.

Being able to implement allocators like these with no API overhead is a huge deal with Vulkan for me. In legacy APIs there are extremely painful limitations where “fire and forget” buffer allocations are very hard to implement well. Buffers generally cannot be mapped when submitting draw calls, so we need to fight really hard and think about copy-on-write behavior, discard behavior, API overhead to call map/unmap all the time (which breaks threaded driver optimizations) or batch up allocations and memcpy data around a couple of extra times. It’s too painful and a lot of CPU performance can go down the drain if we don’t hit all the fast paths.

The only proper solution in legacy APIs I can think of is GL 4.5’s GL_ARB_buffer_storage, which supports persistently mapped buffers like Vulkan, but relying on GL 4.5 (or GLES 3.2 + extensions) just does not seem reasonable to me, since targeting GL should be considered a compatibility option for old GPUs which do not have Vulkan drivers. This feature was a cornerstone of the “AZDO” (approaching zero driver overhead) buzz back in the GL days. D3D11 is still going to be the “compat” option on Windows for a long time, and forget about relying on latest and greatest GLES on Android.

This is the perfect occasion to present a “hello triangle” sample which uses most of these features, but we first need WSI, so let’s start there.

06 – Pushing pixels with SDL2

Granite’s main codebase normally uses GLFW, so to get a less redundant sample working, I wrote this sample to use SDL2’s Vulkan support, which is very similar to GLFW’s support starting with SDL 2.0.8.

Implementing WSI is similar to instance and device creation where there is a lot of boilerplate to churn through, with little room for design considerations. Granite’s WSI implementation has two main paths:

On-screen / VK_KHR_surface

In this mode, The WSI class creates and owns the Vulkan::Context and Vulkan::Device automatically for us and owns a VkSwapchainKHR. The only thing it cannot on its own is create the VkSurfaceKHR, which is platform dependent. Fortunately, the surface is the only platform-dependent object so we can supply an interface implementation to create this surface when Vulkan::WSI needs it. The sample implements an SDL2Platform class which uses SDL2’s built in wrappers for surface creation, nifty!

Off-screen / externally owned swapchains

Granite is also used in cases where we don’t necessarily own a swapchain which is displayed on screen. We might want to supply already created images in lieu of VkSwapchainKHR and provide our own image indices as well as acquire/release semaphores. After completing a frame, we can pass along the fake swapchain image to our consumer. The prime case for this is the libretro API implementation in Granite.

Pumping the main loop

Vulkan’s Acquire/Present model maps directly to a “begin” and “end” model in Granite. We call Vulkan::WSI::begin_frame() to acquire a new image index, advance the frame context and deal with any in-between frame work. We might have to deal with out-of-date swapchains here and various janitorial work which we never had to consider in old APIs.

Semaphores for WSI images are dealt with automatically. WSI images are treated specially and automatically handling synchronization for WSI resources is straight-forward to the point that there is no point in exposing that to the user. (Synchronization in Granite is in general very explicit, but WSI is one of few exceptions.) The main loop looks something like:

wsi.begin_frame(); // <-- acquire image if necessary, advances frame context
auto cmd = device.request_command_buffer();
// do work and render to swapchain
device.submit(cmd);
wsi.end_frame(); // <-- flushes frame, queues up presents if swapchain was rendered to this frame

Overall, WSI code is must to abstract in Vulkan, and I’m happy with the flexibility and simplicity in use I ended up with.

07 – Hello triangle (quad?) with scratch allocated VBO, IBO and UBO

Now that we can get stuff on-screen, now we’re getting to the actual meat of this post. https://github.com/Themaister/Granite-MicroSamples/blob/master/07_linear_allocators.cpp augments the WSI sample with a nice little quad. The VBO, IBO and UBOs are allocated directly on the command buffer.

Linear allocator – allocating memory at the speed of light

This allocator has many names – chain allocator, bump allocator, scratch allocator, stack allocator, etc. This is the perfect allocator for when we want to allocate a lot of small blobs, and just wink it all away at some point in the future. Allocation happens by incrementing an offset, and freeing happens by setting the offset to 0 again, i.e. all memory in one go is just “winked away”.

Buffer pools of linear allocators

Some engine implementations have a strategy where there is only one huge linear allocator in flight and once exhausted, it is considered OOM and a GPU stall is inevitable. This strategy is nice from an “explicit descriptor set” design standpoint if we use UNIFORM_DYNAMIC descriptor type, since we can use a fixed descriptor set for uniform data, as offsets into the UBO are encoded when binding the descriptor set. I find this concept a bit too limiting, since there is no obvious limit to use (very content and scene dependent). I opted for a recycled pool of smaller buffers instead since Granite’s descriptor binding model is very flexible as we saw in the previous post in this series. If I had to deal with explicit descriptor sets, uniform data would be kind of nightmarish to deal with.

Vulkan::CommandBuffer can request a suitable chunk of data from Vulkan::Device, and once exhausted or on submission, the buffers are recycled back again. We can only reuse the buffer once the frame is complete on the GPU, so we also use the frame context to recycle linear allocators back into the “ready for allocation” pool at the right time.

To DMA queue or not to DMA queue …

Discrete GPUs have a property where accessing memory in VRAM is very fast, while host memory can be accessed over PCI-e at a far slower rate. For staging data like vertex, index and uniform buffers, it might be reasonable to assume that we should copy the CPU-side to GPU-side and let the GPU consume the streamed data in fast VRAM. Granite supports two modes where we let the GPU read data read-only from HOST_VISIBLE, and one where we automatically perform staging buffer copies over to GPU from the CPU buffer.

In practice however, I don’t see any gain from doing the staging copy. The extra overhead of submitting a command buffer on the DMA queue which copies data over, and adding the extra synchronization overhead with semaphores and friends just does not seem worthwhile. Discrete GPUs can cache read-only data sourced from PCI-e just fine.

Super-convenient API

Since we have a very free-flowing descriptor binding model, we can have an API like this:

auto cmd = device.request_command_buffer();
MyUBO *ubo = cmd->allocate_typed_constant_data<MyUBO>(set, binding, count);
// Fill in data on persistently allocated buffer.
ubo->data1 = 1;
ubo->data2 = 2;

void *vert_data = cmd->allocate_vertex_data(binding, size, stride);
// Fill in data.
void *index_data = cmd->allocate_index_data(size, VK_INDEX_TYPE_UINT16);
// Fill in data.
cmd->draw_indexed();

// Pointers are now invalidated.
device.submit(cmd);

The allocation functions are just light wrappers which allocate, and bind the buffer at the appropriate offset. It’s perfectly possible to roll your own linear allocation system, e.g. you want to reuse a throwaway allocation in multiple command buffers in the same frame, or something like that.

Conclusion

I think spending time on making temporary allocations as convenient as possible will pay dividends like nothing else. The productivity boost of knowing you can allocate data on the command buffer for near-zero overhead simplifies a lot of code around the callsite, and there is little to no cost of implementing this. Linear allocators are trivial to implement.

… up next!

On the next episode of “this all seems so high-level, where’s my low-level goodness”, we will look at render passes and synchronization in Granite, which is where the low-level aspects of Granite will be exposed.

A tour of Granite’s Vulkan backend – Part 3

Shaders and descriptor sets

This is part 3 of a blog series I’m writing on Granite‘s Vulkan backend. In this episode we are looking at how we deal with shaders and descriptor sets. At this point in our design process, there are many, many choices to make. Especially descriptor sets need to be carefully considered.

Hash all the things

A theme we start to see now is hashmaps and lazy creation of objects. One thing you run into with Vulkan’s pipeline-related types are how much work it is to be explicit all the time. The amount of information we need to provide is staggering. I believe it not healthy for mind and soul to work at low levels here except in special cases, and we should aggressively hide away detail where we can. There is naturally a clock cycles vs. sanity tradeoff to be made here.

You can argue that the lines between high-level GL/D3D11-style design and Granite’s model are quite blurred. The (mental) price to pay to be explicit is just not worth it in my opinion. I will try to explore the obvious alternatives here and provide more context why the design is the way it is.

04 – Shaders and pipeline layouts

The first step in creating a pipeline is of course, to create a VkShaderModule from our SPIR-V code. This is a no-brainer, but next we need a pipeline layout, which in turn requires VkDescriptorSetLayouts. The sample is here https://github.com/Themaister/Granite-MicroSamples/blob/master/04_shaders_and_programs.cpp.

Rather than manually declaring the pipeline layout like a caveman I think using reflection to automatically generate layouts is a good idea. There is no reason for users to copy information which exists in the shaders already. For the reflection, I use SPIRV-Cross. If we never need to compile SPIR-V in runtime (game engine scenario), there is no reason why we cannot shift the reflection step to off-line as well and just pass the side-band data along to remove a runtime dependency. I never got as far as building a nice off-line SPIR-V baking pipeline, so I just compile GLSL on the fly with shaderc. However, the interface in the Vulkan backend just consumes raw SPIR-V.

A common mistake beginners tend to do is to think that names are important in binding interfaces. This is a mistake carried over from the GL and D3D11 days. The only things we should care about are descriptor sets, bindings and location decorations as well as push constant use. This is the only semantic information we need to create binding interfaces, i.e. pipeline layouts and pipelines.

A pipeline layout in Vulkan needs to know all shader stages a-la GL programs, so we also need a step to combine shaders into a Vulkan::Program. Here we take the union of reflection information and request handles for Vulkan::DescriptorSetAllocator and Vulkan::PipelineLayout. This is hashed, but there is no performance concern here since we should do all of this work in load time when possible anyways. These handles are all owned internally in Vulkan::Device, and there is no reason to worry about object lifetime for these objects.

I don’t think there is a reason to deviate far from this design unless you have a very specific scheme in mind with descriptor set allocation. As I’ll explore later, using bindless descriptors extensions or explicit descriptor set allocation could motivate use of a “standard” pipeline layout, in which case reflection gets kind of meaningless anyways.

05 – The binding model – embracing laziness

I never really had a problem with the old-school way of binding resources to binding slots. It just isn’t the part of the old APIs I felt were lacking, so Granite is kind of old school here, but it does have full consideration for descriptor sets and I removed any impedance mismatch with Vulkan (i.e. no translation needed to bridge between Granite and Vulkan). E.g.:

cmd->set_storage_buffer(set, binding, *resource);
cmd->set_texture(set, binding, resource->get_view(), Vulkan::StockSampler::LinearClamp);

The old binding models in GL/D3D11 have flat binding spaces with no separation of per-frame, per-material, or per-draw bindings. In Granite I wanted to take full advantage of the descriptor set feature where we can assign some kind of “frequency” and relation between bindings. Here is an example to illustrate how it is used: https://github.com/Themaister/Granite-MicroSamples/blob/master/05_descriptor_sets_and_binding_model.cpp.

In draw time, we can use the current pipeline layout and pull in the binding points which are active and make sure we bind descriptor sets with the correct resources. This is actually hot code, so I spent time designing a nice system here which tries to be as optimal as possible, given these restrictions.

Because of mobile, we need some conservative limits. I use 4 descriptor sets and 16 (dense) binding points per set (minimum spec of Vulkan). This allows for fairly compact pipeline layout descriptions, and we can loop over bitsets to look at resources. This is also just fine for my use cases.

When it comes to allocation of descriptor sets themselves, I think I have a very different approach to most. A Vulkan::DescriptorSetAllocator is represented as:

The VkDescriptorSetLayout
A bunch of VkDescriptorPools which can only allocate VkDescriptorSets of this set layout. Pools are added on-demand.
A pool of unused VkDescriptorSets which are already allocated and can be freely updated.
A temporary hashmap which keeps track of which descriptor sets have been requested recently. This allows us to reuse descriptor sets directly. In the ideal case, we almost never actually need to call vkUpdateDescriptorSets. We end up with hash -> get VkDescriptorSet -> vkCmdBindDescriptorSets. When a descriptor set has not been used for a couple of frames (8), we assume that it is no longer relevant, and the set is recycled, and some other descriptor set can reuse it and just call vkUpdateDescriptorSet. We definitely do not want to keep track of when any buffer or image resources is destroyed, and recycle early. That’s tracking hell which slows everything down.

The temporary hashmap is a data structure I’m quite happy with. It’s used for a few other resources as well. See https://github.com/Themaister/Granite/blob/master/util/temporary_hashmap.hpp for the implementation.

On certain GPUs, allocating descriptor sets is, or at least used to be very costly. The descriptor pools might not be implemented as true pools (sigh …), so every vkAllocateDescriptorSets would mean a global heap allocation, absolutely horrible for performance. This is the reason I’m not a big fan of the “one large pool” design. In this model, we just allocate a massive VkDescriptorPool, and we just allocate from that, for any kind of descriptor set. This means recycling VkDescriptorSet handles over many frames is impractical. The intended use pattern is to call vkResetDescriptorPool and allocate new descriptor sets which are only valid for one frame at a time, just like command buffers. There is also the problem of knowing how to balance the descriptor load for these massive pools, what’s the ratio of image descriptors vs uniform buffer descriptors, etc. With per-descriptor set layout allocators, there is zero guess work involved.

Alternative design – Bindless

Bindless is all the rage right now. The only real complaint I have is that it’s only supported on desktop and requires an EXT extension. It also means writing shaders in a very specific way. I don’t really need it for my use cases, but bindless enables certain complex algorithms which benefit from accessing a huge set of resources dynamically.

Alternative design – persistent explicit VkDescriptorSets

An alternative is exposing descriptor sets directly and only allow users to bind descriptor sets rather than individual resources. The API user would need to build the sets manually. While this is an idea, I think there are too many hurdles to make it practical.

We need to know and declare the target imageLayout of textures up front. This is obvious 99% of the time (e.g. a group of material textures which are SHADER_READ_ONLY_OPTIMAL), but in certain cases, especially with depth textures, things can get rather ambiguous. This does seem to me like an API design fault. It is unclear why this information is needed.
Some resources are completely transient in nature and it does not make sense to place them in persistent descriptor sets. The perfect example here is uniform buffers. In later samples, we’ll look at the linear allocator system for transient data.
Some resources depend on the frame buffer, i.e. input attachments. Baking descriptor sets for these resources is not obvious, since we need to know the combination pipeline layout + frame buffer, which should have nothing to do with each other.
We need to know the descriptor set layout (and by extension, the shaders as well) up-front. This is problematic if resources are to be used in more than one shader. The common fix here is to settle on a “standard” pipeline layout so we can decouple shaders and resources. This usually means a lot of padding and redundant descriptor allocations instead. We have a limited amount of descriptor sets when targeting mobile (4). We do not have the luxury of splitting every individual “group” of resources into their own sets, some combinatorial effects are inevitable, making persistent descriptor sets less practical. On desktop, 8 sets is the norm, so that might be something to consider.
Hybrid solutions are possible, but complexity is increased for little obvious gain.

Conclusion

I’m happy with my design. It’s very easy to use, but there is a CPU prize I’m willing to pay and I honestly never saw it in the profiler. I think resource binding models are cases where shaving overhead away will shave your sanity away as well, at least if you want to be compatible with a wide range of hardware. It’s much easier if you only cater to high-end desktop where bindless can be deployed.

… up next!

Next up we will explore the linear allocators for uniform, vertex, index and staging data.