Clustered shading evolution in Granite

Shading many lights in a 3D engine is kinda hard once you step outside the bubble of classic deferred shading. Granite supports a fair amount of different ways to do lighting, mostly because I like to experiment with different rendering structures.

I presented some work on this topic at SIGGRAPH 2018 in the Moving Mobile Graphics course when I still worked at Arm: https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-20-66/5127.siggraph_2D00_2018_2D00_mmg_2D00_3_2D00_rendering_2D00_hanskristian.pdf. Unfortunately, some slides have videos embedded, and they didn’t seem to have made the transition unscathed to PDF.

Since then I’ve looked into VK_EXT_descriptor_indexing in order to remove some critical limitations with my older implementation and I ended up with some uncommon implementation details.

Classic deferred

Just to summarize what I mean by this, this is the good old method of rendering light volumes on top of a G-buffer and light is accumulated per-light by blending on top of the frame buffer. This method is considered completely obsolete on desktop these days, but it’s still quite viable on mobile with on-chip G-buffers.

Where classic deferred breaks down

  • Very high bandwidth on desktop due to fill-rate/bandwidth
  • No forward shading support (well, duh)
  • No transparent objects
  • No volumetric lighting

Thus, I explored some alternatives …

Why I don’t like tile deferred shading

This is a somewhat overloaded term, but this is the method where you send the G-buffer to a compute shader. The G-buffer is split into tiles, assigned to a workgroup, depth range (or multiple ranges) is found for the tile, which then forms a frustum. Lights are then culled against this frustum, and shaded in one go. This technique was a really great fit for the PS3 SPUs, as shown by Battlefield 3 back in the day.

It still doesn’t really solve the underlying issues of classic deferred except that it is far more bandwidth efficient on desktop-class GPUs.

Forward shading isn’t feasible unless you split the algorithm into multiple stages with Z-prepass -> build light list per tile -> resubmit geometry and shade, but then it’s probably called something entirely different … (Is this what’s known as Forward+ perhaps?)

Transparency isn’t doable unless you render all transparent geometry in a separate pass to find min/max depth per pixel or something, and forget about volumetrics.

On mobile TBDR architectures you lose all bandwidth savings from staying on-tile, and doing FRAGMENT -> COMPUTE barriers on a tile-based architecture is usually a terrible idea for pipelining. The exception here is tile shaders on recent iPhone hardware which seems almost designed to do this algorithm in hardware.

Clustered shading – the old implementation

Clustered shading is really nice in that it is completely agnostic to a depth buffer, so all the problems mentioned earlier just go away. Lights are assigned in 3D-space rather than screen-space. The original paper on the subject is from around the same time tile deferred was getting popular.

Abandoning the frustum structure

In Granite, I chose early on to abandon the “frustum” layout. Culling spot lights and point lights against elongated frustums analytically is very hard and computationally expensive, and getting the Z slices correct is also very fiddly.

A common workaround for culling is using conservative rasterization, but that is a feature I cannot rely on. I figured that using a more grid-like structure I could get away with much simpler culling math. Since all elements in the grid-based cluster are near-perfect cubes, I could get away with treating the elements as spheres, and https://bartwronski.com/2017/04/13/cull-that-cone/ makes spot light-to-sphere culling very cheap and quite tight. Here’s a visualization of the structure. This structure is stored in a 3D texture. Each “mip-level” is packed in the Z dimension as the resolution of each level is the same.

Bitscan loops instead of light lists

Light lists is the approach where each element in the cluster contains a (start, count) and all lights found in that range are shaded. Computing this list on GPU is rather messy. The memory footprint for a single element is unknown in CPU timeline, and we cannot deal properly with worst-case scenarios. This is easier when we cluster on the CPU instead, but that’s boring!

I really wanted to cluster lights on the GPU, so I landed on a bitmask approach instead. The worst case storage is just 1 bit per light per element rather than 32 bit.

The main limitation of this technique is still the number of lights we could feasibly support. With a bitmask structure we need to allocate for worst-case and it can get out of hand when we consider worst case with 1000+ lights. I only had modest ideas in the beginning, so I supported 32 spot lights and 32 point lights, which were encoded in RG32UI per element in a 3D texture. At a resolution of the cluster at 64x32x16x9, culling on the GPU is very fast, even on mobile. We can set the ceiling higher of course if we expand to RGBA32UI or use more texels per element.

Bitscan loops are great for scalarization

A thing I realized quickly when doing clustered forward shading is the importance of keeping VGPRs down on AMD hardware. The trick to move VGPRs to more plentiful SGPRs is to ensure that values are uniform across a subgroup. E.g. instead of doing this:

// VGPR
int light_bitmask = fetch_bitmask_for_world_coord(coord);

vec3 color = vec3(0.0);
while (light_bitmask != 0)
{
    int lsb = findLSB(light_bitmask);

    // All light data must be loaded in VGRPs since lsb is a VGPR.
    color += shade_light(lsb);
    light_bitmask &= ~(1 << lsb);
}

we can do a simple trick with subgroup operations:

// VGPR
int light_bitmask = fetch_bitmask_for_world_coord(coord);

// OR over all active threads.
// As this is the same value for all threads, compiler promoted to SGPR.
light_bitmask = subgroupOr(light_bitmask);

vec3 color = vec3(0.0);
while (light_bitmask != 0)
{
    int lsb = findLSB(light_bitmask);

    // All light data can be loaded into SGPRs instead.
    // Far better occupancy, much amaze, wow!
    color += shade_light(lsb);
    light_bitmask &= ~(1 << lsb);
}

Uniformly loading light data from buffers is excellent. I’ve observed up to 15% uplift on AMD by doing this. The light list approach mentioned earlier has a much harder time employing this kind of optimization. We would have to scalarize on the cluster element itself, which could lead to very bad worst-case performance.

No bindless – ugly atlasing

Another problem with clustered shading (and tile deferred for that matter) is that we need to shade a lot of lights in one go, and those lights can have shadow maps. Without bindless, all shadow maps for spot lights must fit into one texture, and point lights must fit into one texture. Atlassing is the classic solution here, but it is a little too messy for my taste. As the number of lights was rather low, I just had a plain 2D texture for spot lights, and a cube array for point lights. Implementing variable resolution with an atlas is also rather annoying, and for point lights, I would be forced to flatten the cube down to 6 2D rects and do manual cube lookup instead without proper seam filtering, ugh.

Scaling to “arbitrary” number of lights

While performance for reasonable number of lights was quite excellent compared to alternative techniques, I couldn’t really scale number of lights arbitrarily, and it has been nagging me a bit. Memory becomes a concern, and while the “list of lights” approach is likely less memory hungry in the average case, it has even worse worst-case memory requirements, and it’s not very friendly for GPU culling.

Kinda clustered shading? New bindless hotness

The talk on Call of Duty’s renderer in Advances in Real-Time Rendering 2017 presents a very fresh idea on how to do shading, and it hits all the right buttons. Culling on GPU, bitscan loops, scalarization, scales to a lot of lights, a lot of things to like here.

I spent some time this holiday season implementing a new path for clustered shading based on this technique, and I ended up deviating in a few places. I’ll go through my implementation of it, but it will make more sense once you study the presentation first.

Decoupling XY culling from Z

The key feature of the Call of Duty implementation is how it partitions space. Rather than a full 3D cluster, we decouple XY and Z dimensions, so rather than O(X * Y * Z) we get O(X * Y + Z).

Z is also binned linearly, and we can have several thousand bins in Z. This makes everything much nicer to deal with later. Culling here is trivial, since we compute min/max Z in view space for each light, which is very simple.

For each Z-slice, we just need to figure out the minimum light index and maximum light index which hits our Z-slice. Of course, to make the ranges as tight as possible, we sort lights by Z distance.

Data structures

  • Each tile in XY needs a bitmask array, u32[ceil(num_lights / 32)] for each tile. This can be tightly packed in a single buffer.
  • A buffer containing Z-slices as described above.
  • Per-light information: position/radius/cone/light type/shadow matrix/etc

Going back to frustum culling

Now that we cluster in XY and Z separately, I went back to frustum partitioning, and now we need a way to do frustum culling against spot and point lights … Conservative rasterization really is the perfect extension to use here, just a shame it’s not widely available yet on all relevant hardware.

The presentation has an alternative for conservative rasterization as current consoles do not support this feature, which is mostly to render light volumes a-la classic deferred (not completely dead yet!) at full-resolution and splatting out bits with atomics as fragment threads are spawned. If you have depth information you can eliminate coverage using classic deferred techniques. However, I went in a completely different direction without using depth at all.

Compute shader conservative rasterization

It felt natural to do all the culling work in compute shaders. This is where most of the “fun” is. This is also a rather esoteric way of doing it, but I like doing esoteric stuff with compute shaders, see https://themaister.net/blog/2019/10/12/emulating-a-fake-retro-gpu-in-vulkan-compute/ as proof of that.

Conservative sphere rasterization

To solve this problem, we’re going to tackle it geometrically instead. First, we solve the screen-space bounding box problem. Fortunately, this problem is separable, so we can compute screen-space bounds in X and Y separately.

(Behold glorious Inkscape skillz ._.)

What we want to do is to figure out where P_lo and P_hi intersect the near plane. P can be rotated in 2D by the half-angle in two directions. This way we find tangent points on the circle. sin(theta) is conveniently equal to r / length(L), so building a 2×2 rotation matrix is very easy. After rotating P, we can project P_lo and P_hi, and now we have clip space bounds in one dimension. Compute separately for XZ and YZ dimensions and we have screen-space boundaries.

Projecting a sphere to clip space creates an ellipsis, and to compute the ellipsis, we need to rotate view space such that the sphere center lies perfectly on the X or Y axis. For simplicity, we orient it on the +X axis. We can then perform the range test, and an ellipsis is formed. We can now test any point directly against this ellipsis. If the sphere intersects the near plane in any way, we can fall back to screen-space bounding box.

Here’s a ShaderToy demonstrating the math. Of course, Inigo Quilez did it first :p

In the real implementation, we only need to compute setup data once for each point light, to rasterize a pixel, we apply a transform, and perform a conservative ellipsis test, which is rather straight forward.

Conservative spot light rasterization

I tried an analytical approach, but I gave up. Spot-Frustum culling is really hard if you want tight culling, so I went with the simpler approach of just straight up rasterizing 6 triangles which forms a spot light. We can rasterize in floating point since we don’t care about water-tight rasterization rules, and it’s conservative anyways. It’s not the prettiest thing in the world to do primitive clipping inside a shader, but you gotta do what you gotta do …

The … peculiar shader can be found here.

Binning shader

Once we have setup data for point lights and spot lights, we do the classic culling optimization where we bin N lights in parallel over a tile, broadcast the results to all threads, which then computes the relevant lights per pixel. Subgroup ballots is a nice trick here, which replaces the old shared memory approach. Each workgroup preferably works on 32 lights at a time to compute a 32-bit bitmask.

Shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer_bindless_binning.comp

The shading loop

The final loop to shade becomes something like:

uint cluster_mask_range(uint mask, uvec2 range, uint start_index)
{
	range.x = clamp(range.x, start_index, start_index + 32u);
	range.y = clamp(range.y + 1u, range.x, start_index + 32u);

	uint num_bits = range.y - range.x;
	uint range_mask = num_bits == 32 ?
		0xffffffffu :
		((1u << num_bits) - 1u) << (range.x - start_index);
	return mask & uint(range_mask);
}

vec3 shade_clustered(Material material, vec3 world_pos)
{
    ivec2 cluster_coord = compute_clustered_coord(gl_FragCoord.xy);
    int linear_cluster_coord = linearize_coord(cluster_coord);
    int z = compute_z_slice(dot(world_pos - camera_pos, camera_front));

    uvec2 z_range = cluster_range[z];

    // Find min/max light we need to consider when shading slice Z.
    // Make this uniform across subgroup. SGPR.
    int z_start = int(subgroupMin(z_range.x) >> 5u);
    int z_end = int(subgroupMax(z_range.y) >> 5u);

    for (int i = z_start; i <= z_end; i++)
    {
        // SGPR
        uint mask =
            cluster_bitmask[linear_cluster_coord *
                            num_lights_div_32 + i];

        // Restrict to lights within our Z-range. VGPR now.
        mask = cluster_mask_range(mask, z_range, 32u * i);
        // SGPR again.
        mask = subgroupOr(mask);

        // SGPR
        uint type_mask = cluster_transforms.type_mask[i];

        // Good old scalarized loop <3
        while (mask != 0u)
        {
            int bit_index = findLSB(mask);
            int light_index = 32 * i + bit_index;
            if ((type_mask & (1 << bit_index)) != 0)
            {
                result += compute_point_light(light_index,
                                              material,
                                              world_pos);
            }
            else
            {
                result += compute_spot_light(light_index,
                                             material,
                                             world_pos);
            }
        }
    }
}

Shader: https://github.com/Themaister/Granite/blob/master/assets/shaders/lights/clusterer_bindless.h

Potential performance problems

By decoupling XY and Z in culling there’s a lot of potential for false positives where large lights might dominate how Z-ranges are computed and trigger a lot of over-shading. I haven’t done much testing here though, but this is probably the only real weakness I can think of with this technique. Regular tile-deferred has similar issues.

Culling tightness

I’m using smaller lights here to demonstrate. Red or green light signifies that a light was computed for that pixel:

Placing point light in volumetric fog for good measure. The weird red/green “artifacting” around the edges is caused by the forward shading and subgroupOr logic when shading to ensure subgroup uniform behavior.

Potential improvements

Clipping Z-range calculation against view frustum might help a fair bit, since Z-range can be way too conservative for large positional lights like these. Especially spot lights which point to the side like the image above. Classic deferred has a very similar problem case unless ancient stencil culling techniques are used to get double sided depth tests.

Conclusion

I’m happy with the implementation. Performance seems very good, but I haven’t dug deep in analysis there yet. I was mostly concerned getting it to work. Just waiting for mobile GPU vendors to support bindless, so I can test there as well I guess.