A cute little trick to running classic IIR filters on the GPU

In my previous behemoth post I mentioned the struggle of implementing a comb + notch filter that looked passable for analog composite video. The blurring effect of adding the notch with FIR filtering was far too great, and there’s no way TVs at the time did it that way.

Nagging issues like these haunt me until I solve it, and I started experimenting with IIR filtering since it’s one of those topics I never really explored on GPUs, due to it being very GPU unfriendly. A notch IIR biquad filter is extremely effective on paper, and there has to be a way to do it well in parallel.

A biquad IIR

This is the building block of a ton of audio filters out there, since they can easily model classic analog filters and are easy to tweak parametrically. In the Z plane they have two zeroes and two poles:

// Z is a complex number representing where frequency response hits 0.
// P is a complex number representing a pole
// where the frequency response hits infinity (or pure resonance).
// Frequency response at a given frequency is measured by setting
// z = exp(2 * j * w)

// I probably screwed this formula up somehow,
// don't read too much into it.
H(z) = (z - Z) * (z - conj(Z)) / ((z - P) * (z - conj(P));

Having conjugated poles and zeroes ensures the filter coefficients are real, so the actual filter looks more like this in code:

out[x] =
  in[x] * b0 + in[x - 1] * b1 + in[x - 2] * b2 + // FIR portion
  out[x - 1] * a1 + out[x - 2] * a2 // IIR portion

The Z plane is often a little abstract, but in this case, we will design the notch directly in the Z domain. For example, we place zeroes where we want to remove frequency response, e.g. at the chroma subcarrier, then place poles close to it at the same frequency. The intuition is that close to the carrier, the response will fall to 0, and further away the pole and zero mostly negate each other, leaving a unity frequency response. The response can be made as sharp as desired, but sharper notches lead to more ringing.

E.g. for a pole set at 0.9 times the zero. Quite sharp. Getting this kind of response out of a normal convolution requires a comical number of taps. The response at DC here isn’t exactly 0 dB, but the FIR portion can be rescaled to ensure unity gain.

struct
{
  float b0, b1, b2, a1, a2; // a0 is implicitly 1.0.
} push = {};

// Compute a biquad IIR notch filter.
double w = 2.0 * muglm::pi<double>() / 27.0;
constexpr double fir_q = 0.99;
constexpr double iir_q = 0.85;

// Don't do a complete notch since that leads to really bad ringing.
// We just need a fairly narrow band-stop though ...

if (options.system == System::NTSC)
  w *= 315.0 / 88.0;
else
  w *= 4.43361875;

// Preconvolves two zeroes at exp(i * w) and exp(-i * w).
double b0 = 1.0;
double b1 = -2.0 * fir_q * cos(w);
double b2 = fir_q * fir_q;

// Preconvolves two poles at exp(i * w * Q) and exp(-i * w * Q).
// Flip sign here since we're designing the filter as B(z) / A(z) and
// when evaluating the filter we flip the signs.
double a1 = 2.0 * iir_q * cos(w);
double a2 = -iir_q * iir_q;

// Normalize the FIR gain to obtain a 0 dB gain at DC.
double fir_gain = b0 + b1 + b2;
double target_fir_gain = 1.0 - a1 - a2;
double fir_scale = target_fir_gain / fir_gain;

b0 *= fir_scale;
b1 *= fir_scale;
b2 *= fir_scale;

push.b0 = float(b0);
push.b1 = float(b1);
push.b2 = float(b2);
push.a1 = float(a1);
push.a2 = float(a2);

cmd.push_constants(&push, 0, sizeof(push));

The trick – a spicy parallel scan

While IIRs have infinite impulse response, we’re only really interested in the output of the active signal area, which is ~1440 pixels. A 1440-tap convolution is fine if we want to use FFT convolution, but that’s quite overkill for what’s literally a 5-tap filter. Surely there are ways to do IIRs effectively without falling back to serial algorithms.

Many parallel GPU algorithms boil down to scan operations in some way, or have scans as a primary ingredient of their madness. I found an older paper from 1990 (Prefix Sums and Their Applications, Guy E. Blelloch, section 1.4) which goes through this algorithm in a way that was easy enough to digest. It’s not the state of the art I think, but it gets the job more than adequately for my needs. For the IIR filter, the FIR portion of it is handled trivially with a filter kernel. We only need to concern ourselves with the feedback portion of the filter.

Usually we think of scans as just simple binary operators.

for i in range(len(data)):
  data[i] += data[i - 1]

This is common enough that we have special subgroup operations for them, like subgroupInclusiveAdd. This is not enough to implement IIRs since it cannot capture the “decay” of an impulse response. The two intuitions from the paper is that the binary operator can be as complex as we want it to be as long as it follows some properties that are expected of binary operators, and the data doesn’t have to be a scalar either. For an IIR with two histories, we need a 2×2 matrix and 2 element vector, then define a binary operator which is effectively:

struct Biquad
{
  mat2 A;
  vec2 B;
};

Biquad init(float v)
{
  // Note that GLSL is column major here.
  return Biquad(mat2(vec2(a1, a2), vec2(1, 0)), vec2(v, 0));
}

Biquad scan(Biquad prev, Biquad current)
{
  return Biquad(prev.A * current.A, prev.B * current.A + current.B);
}

The intuition is that the A matrix keeps track of impulse response decay over time, while the B vector serves as the history buffer. Once the scan is done, the result is extracted from B.x and written out to memory. The performance is excellent and processes my frame at like sub 10 microseconds. Can’t complain about that.

For the full shader example, see https://github.com/Arntzen-Software/parallel-gs/blob/main/analog/shaders/iir.comp, but the general gist of it is:

256 thread workgroup – work on N pixels per thread

The horizontal resolution is expected to be reasonable, so N is about 6 in my use case. Not having to split the scan over multiple kernels is good.

This means we can load 8 inputs and compute the FIR portion of it for 6 inputs all in registers. The initial scan can also be done in registers:

float xs[PER_THREAD + 2];
Biquad quad[PER_THREAD];

[[unroll]]
for (int i = -2; i < PER_THREAD; i++)
  xs[i + 2] = LoadInputInRow(PER_THREAD * int(index) + i);

// Do the FIR portion and produce 6 values.
[[unroll]]
for (int i = 0; i < PER_THREAD; i++)
  quad[i] = init(xs[i + 2] * b0 + xs[i + 1] * b1 + xs[i + 0] * b2);

[[unroll]]
for (int i = 1; i < PER_THREAD; i++)
  quad[i] = scan(quad[i - 1], quad[i]);

Custom scan with magic biquad operator

Then we do the custom scan over the subgroup using ShuffleUp.

for (uint step = 1; step < gl_SubgroupSize; step *= 2)
{
  Biquad prev;
  prev = subgroupShuffleUp(quad[PER_THREAD - 1], step);
  if (gl_SubgroupInvocationID >= step)
  {
    [[unroll]]
    for (int i = 0; i < PER_THREAD; i++)
      quad[i] = scan(prev, quad[i]);
  }
}

… and then complete the scan via groupshared memory with same strategy. Groupshared memory is kept at a minimum since we only need to store the last value per subgroup. That’s about it, and with that I have a decoded composite signal I’m happy with.

For very large IIR filters, this approach gets out of hand quickly since we need a big N x N matrix to hold the recurring state per element, but chaining together biquads is done trivially by just scanning multiple times. Since we can do the horizontal filter in one compute workgroup, the intermediate steps can easily live in shared memory if desired to avoid excessive memory bandwidth.

Emulating old junk from yesteryear – or my obsession making native resolution PS2 emulation look good

Lately I’ve been on a kick to tackle the latter part of PS2 graphics emulation that never seems to come up, analog TV emulation. 99% of the PS2 emulation space is all about upscaling to the max, cranking out 4K with polygons so sharp they can cleave mountains in half, but that’s not particularly to my taste, which is why paraLLEl-GS focuses more on the super-sampling aspects rather than raw upscaling. Earlier blog post on the paraLLEl-GS here if you have no idea what I’m referring to.

For example, a raw native render with progressive scan (640×448) with paraLLEl-GS backend:

With Vulkan HW renderer in PCSX2, a basic 2x upscale would look like:

which is obviously higher resolution, but does not attempt to resolve aliasing either, and as upscaling factors go up, the mismatch between texture resolution, polygon counts and output resolution create a jarring effect for me. The approach I tend to favor is super-sampling and keeping the resolution native, even if it means a more blurred resolve. Here’s how it’d look with 16x SSAA. This is quite overkill for most cases, but why not.

There is a special mode in paraLLEl-GS that can scanout the 16x super-samples at double the resolution, which is effectively 4x SSAA at double the resolution. It can look great when it works well, but the key word is when. This approach is not playing to the strengths of paraLLEl-GS at all, but it’s a thing.

The main problem is that 2D images like HUD elements remain at native resolution with integer scale, which can create a jarring look, and post processing passes can mess up things in some cases. My main focus is on the native resolution, super sampled output, since it has very few gotchas compared to other upscaling styles.

The stopgap solution – FSR1

For a while, I’ve relied on FSR1 to do the upscaling, which is just a temporary hack. While properly anti-aliased 3D can look surprisingly good with FSR1 when blown up to large resolutions (what it is designed for), 2D game elements still create questionable artifacts. Text upscaling starts looking like those old HQ2x, SuperEagle and xBR filters that used to be popular with SNES emulators. That’s likely because FSR1 is “just” edge-aware Lanczos filtering at its core. Still, at SD resolutions there’s only so much you can do to blow it up to a modern display. The result here is of course quite blurry, but looking at it from a reasonable distance, it can look sort-of okay on a good day.

The only proper way to display SD content is in my opinion to use a CRT, or in lieu of getting a nerd tan, CRT shaders. Simulating CRTs has been done to death to the point Hel is blushing, so I feel a bit uncomfortable even trying to write about it, but this post wouldn’t be complete without it. RetroArch has like 23428323 CRT shader presets already, and there’s nothing novel about any of this. However, there are some considerations for PS2 that most CRT shaders don’t target:

  • Analog TV input path
  • 480i instead of 240p
  • HDR output
  • High refresh rate simulation

A lot of CRT shaders assume a high quality VGA-style monitor. Nothing wrong with that of course, but I find most of them a bit “too” good for PS2. Where we’re going, we’ll need that analog fuzz too.

Analog signal paths

I’ve been deep down the rabbit hole looking up old specifications to get a deeper appreciation for the brilliant engineering involved in making color TV work all those years ago, and the PS2 era was the last hurrah for SD CRTs, 480i warts and all. I’ve sort of gone through all of this stuff before back in my early days of programming, but I’d like to think I’m a little smarter than I was back then, and I learned far more details than I knew before.

From a graphics programming PoV this is hardly “difficult” stuff like debugging random GPU hangs at 10pm on a Friday coming from the latest and greatest AAA game, but hey, sometimes you just gotta relax with some good old DSP coding to stay sane. The signal processing for NTSC and PAL is fairly straight forward and it’s actually a good entry point into signal processing for graphics programmers since it combines very visual things with actual real world problems.

The cables

Component is the highest quality analog cable (actually, 3 cables!) available to consumers. It supports progressive scan which very few PS2 games natively supported. Simulating this cable is quite trivial.

Composite is the yellow cable “everyone” used. This one is the hardest since separating luma from chroma in the same single-channel signal is actually kinda complicated to do well. Most of the time I spent was trying various strategies for this problem and try to come up with a look that seems authentic and “tastefully shitty” 😀

S-Video is very similar to composite, except that luma and chroma are separate signals. Chroma is packed into one cable and chroma decoding is basically same as for composite signals, where phase and amplitude dictate the hue and saturation. Main difference to component is that bandwidth for chroma should be quite low, so more color smearing should be present.

Us Euro-bros technically had RGB SCART too, but I’ve never seen those cables myself for a Sony console. I believe our old N64 and GameCube had those, but I’d have to double check next time I visit the family …

Reference literature

While it is very easy to find a ton of content online about old TV standards, your random YouTube video is not going to have the minute detail needed to implement much. I found “Video Demystified – A Handbook for the Digital Engineer (2005)” by Keith Jack which has ton of great detail that is often left out to make sure my implementation stays grounded.

The bridge between digital and analog – BT.601

BT.601 defined how to take NTSC and PAL signals and turn them into the digital domain. It is the foundation for all digital video today. The primary interest for us is that the standard defines a 13.5 MHz sampling rate. This was supposedly chosen since it was convenient for both NTSC and PAL and filtering requirements. This scheme works out to 720 horizontal pixels for NTSC and PAL when you account for 525 lines at 29.97 Hz and 625 lines at 30 Hz. The H-sync part of the analog signal pads out to a bit over 800 pixels, but that’s irrelevant here.

I also learned just recently why it’s called YUV444, 422, 420, etc:

The “4:2:2” notation now commonly used
originally applied to NTSC and PAL video, 480i and 480p Systems
implying that Y, U and V were sampled at 4×,
2× and 2× the color subcarrier frequency,
respectively. The “4:2:2” notation was then
adapted to BT.601 digital component video,
implying that the sampling frequencies of Y,
Cb and Cr were 4×, 2× and 2× 3.375 MHz,
respectively.

Now you know! The Nyquist frequency of SD video is thus 6.75 MHz. Talking about video like this is a little weird, but it will make more sense later since analog signals like NTSC and PAL have a certain bandwidth, which limits how much horizontal resolution we can cram into it. The number of video lines is hardcoded. The video signal is really just a 1D signal if you look at it in an oscilloscope (haven’t seen those since my Bachelors …).

The next question was to determine the sampling rate of the PS2 CRTC. Given it generates analog signals, it’s not obvious that PS2 would use a standard frequency here. However, several old forum threads I found did indeed suggest that PS2 is based around the 13.5 MHz rate. This is likely because it could use off-the-shelf video DACs at the time. The nominal maximum horizontal resolution of PS2 is 640 pixels, but it seems like overscan is supposed to stretch the 640 pixels out to fill the screen anyway. Even if 480 lines are “visible” in NTSC, games typically just render 448 lines since the top and bottom is eaten by overscan by most TVs.

The actual CRTC clock seems to be 54 MHz, because when programming the CRTC, you’re supposed to set some dividers which ends up determining the resolution. E.g. 640 pixel width is a divider of 4, 512 pixels a divider of 5 and so on. It conveniently supports most relevant horizontal resolutions like this.

My armchair theory for how this works is that the CRTC runs at 54 MHz and uses the divider to do a zero order hold (aka nearest filtering) which is then fed into the DAC. Either that or the console is really doing horizontal linear filtering up to 640 pixels and does composite video generation at 13.5 MHz. I’m not sure what really happens, so I just have to guess. I’m not quite obsessed enough to have an oscilloscope to suss out micro-details like these from a real PS2. The PS2 is technically able to emit 1080i and 720p signals, so there’s no reason why it couldn’t do analog video processing at the full 54 MHz.

Step 0 – Prepare a test image

To debug any of this we need test images. I don’t exactly have test signal generators that TV engineers had back in the day, so I had to synthesize something. The use for these is to validate various edge cases in the NTSC and PAL decoding process.

The basic idea is to have some 75% color bars at the top to validate the color pipeline. The middle section is a sweeping increase of horizontal frequency up to Nyquist of 6.75 MHz. Each 1 MHz section is delimited by a single line color bar.

The lower portion is a sweep with increasing frequencies. This is used to test the comb filter.

Step 0.5 – Install GNU Octave

Doing FIR filter design by hand is uh … not something I have time for. Anything beyond the basic windowed sinc design process gets annoying very quickly. GNU Octave is a free alternative to Matlab that does what we need for these tasks.

Step 1 – Generate a bandlimited signal

While we can do composite video generation at the 13.5 MHz rate, it is not easy to avoid aliasing, especially since we will be modulating with a chroma subcarrier that will shift the spectrum all over the place, potentially creating aliases out of thin air. The handbook calls for a 2x oversampling to avoid this.

First, the image is padded out to fill 720 horizontal pixels for BT.601 reasons. Then, integer scale the image up to 2880xN (to match with how I understand the CRTC to work), and downsample that with a low-pass to 1440xN which completes a clean 2x oversampling. A polyphase upsampling filter is of course possible too to avoid the intermediate upscale, but that’s needless complexity for something that literally takes 10 microseconds on the GPU 😀 The immediate upscale is not written to memory at least, the filter just assumes a 2880 pixel input and it’s just sampling the input texture redundantly instead.

Since these are fundamentally analog signals, we only filter horizontally. The natural image type for these is 1D Array. All the processing shaders are 1D as well.

(The 1504 width is just extra padding for convolutions.)

Step 2 – Convert to YUV / YIQ / whatever

It’s all the same thing really. NTSC is the oddball one with YIQ but in practice this difference is completely irrelevant. The original idea of IQ was to take the blue and red difference signals and rotate them by 33 degrees to make it so that I would align with skin tones better, and give I more bandwidth than Q. However, the handbook doesn’t seem to give it too much consideration. Especially not for composite signals which are not meant to be sent over the air.

The implementation of this is really just doing a 3×3 matrix multiply with RGB, nothing special.

Step 3 – Low-pass Y and low-pass the hell out of chroma

NTSC and PAL have fairly narrow bandwidth defined for their broadcast signals, but composite signals don’t really have those strict limits. However, since the digital input has Nyquist at 6.75 MHz the handbook calls for bandlimiting the signal to about 6 MHz anyway. Super sharp falloff filters just lead to ringing.

Component video is specified by BT.1358 and doubles the sampling rate to 27 MHz. Y should fall off at about 11 MHz and chroma half that. Interpreting that in interlaced terms, the falloff should start at 5.5 MHz, getting close-ish to the Nyquist limit for SD video. After filtering luma and chroma, just decode back to RGB and we have a nice signal, done.

Chroma is nuked down to ~1.3 MHz or so for composite / S-Video. Sometimes, 0.6 MHz is called for it seems, but it’s quite unclear … S-Video can ignore the modulation + demodulation part if we just want to simulate the smear, but it was easier to let it go through the full chain for completeness. Despite the chroma bandwidth being so horrible, it sort of looks good. Amazing how terrible our eyes are at seeing color detail. I wonder if this is where the esoteric 4:1:1 YCbCr subsampling mode comes from now that I think about it …

E.g. NTSC luma filter, which starts falling off at about 4.2 MHz and stops at 6.75 MHz-ish. Broadcast NTSC is capped well below this bandwidth.

Chroma encode, with ~1.3 MHz passband:

Main difference for PAL is that luma passband is a bit higher. PAL-B (which Norway used) seems to specify 5 MHz rather than 4.2 MHz. Generating these filters can be done easily with firls and friends in Octave.

The handbook calls for a gaussian kernel for chroma to avoid any ringing, but I missed that memo 😀 Either way, these are implemented with trivial convolutions which GPUs eat up like butter.

Step 4 – Modulate chroma and combine into composite signal

This is the first point where NTSC and PAL differ significantly. While NTSC and PAL have different color subcarrier frequencies, PAL is also a bit more sophisticated.

// NTSC
C = Y + I * cos(wt + offset) + Q * sin(wt + offset)

// PAL
C = Y + U * sin(wt + offset) + sign * V * cos(wt + offset)

The “sign” of V is where Phase Alternate Line comes in. It flips every scanline. In broadcasting, this was meant to fix bad colors being introduced during broadcasting through complicated terrain (and boy do we have that over here in Norway). Bad phase shifts being introduced by NTSC will manifest as hue shifts. The basic idea behind PAL is that if phase shifts are introduced by signal reflections the sign flipping of V every scanline ensures that the decoded hue is complementary from scanline to scanline. This manifests as the Hanover Bar artifact. By averaging chroma from scanlines, the errors cancel out and we recover the correct color with slightly less saturation. The cost is of course reduced vertical chroma precision, but given how comically smeared chroma is horizontally, I’m not sure this matters, and digital video uses 4:2:0 subsampling anyway.

Now, broadcasting considerations are kind of irrelevant for something like composite video (I would think), but I’m not sure if PAL TVs skip the filter for anything not coming from an antenna. I kept the vertical chroma filter in my implementation because it’s neat to have.

The NTSC chroma subcarrier is constructed such that every scanline completes 227.5 cycles. Every line flips chroma phase which is very convenient and makes luma and chroma separation less complicated. The NTSC chroma pattern is a checkerboard as a result.

PAL is more annoying here. A half cycle method would not work since V is already flipping every line, so PAL chose 287.75 cycles. On top of that a tiny 1 / 625 cycle offset per line is added for … reasons. The V flipping leads to a chroma pattern where U takes a diagonal pattern while V has a pattern along the other diagonal.

Frame progression is also a concern. As fields are scanned, each time the same field is drawn, the chroma subcarrier should have opposite phases. This follows naturally from how fields are drawn. The period of NTSC is 4 fields:

  • Field 0: 262 lines, chroma carrier starts in + phase, ends in + phase due to even number of lines
  • Field 1: 263 lines, chroma carrier starts in + phase, ends in – phase due to odd number of lines
  • Field 2: 262 lines, chroma carrier starts in – phase, …

As we can see, things repeat after 4 fields and this point is easy to miss. The artifacts introduced by composite video should ideally cancel themselves out over time which manifests itself as flickery noise rather than a horribly glitched image.

PAL is annoying and has a longer cycle of 8 fields due to the three-quarter of a cycle setup. After completing 2500 lines (625 * 4), the chroma subcarrier has completed an integer number of cycles, and the sign of V is back to where it started.

After modulation, the NTSC color bar signal looks more like:

and the next line flips phase as expected:

A note on really old consoles

Some old Nintendo consoles (and likely others) emit NTSC and PAL in non-standard ways. E.g. NES is infamous for shifting the chroma carrier by 120 degrees instead of 180 which leads to very particular artifacts. See NESdev Wiki for more detail.

Step 5 – Validate composite output against color bar reference charts

It’s easy to mess up RGB to YUV conversions and the compositing process. The handbook had reference outputs for RGB inputs in NTSC and PAL where I could confirm that the math was indeed correct. The things to check are that minimum and maximum of the signal are what they should be. NTSC, at least the US variant of it maps black to 7.5 IRE (just think of it as some abstract voltage) and white to 100.0 IRE, and it tripped me up a bit at first since the NTSC color bars were defined in terms of this shifted and scaled IRE value.

Looking at the peaks and valleys of the generated composite signal in RenderDoc is enough since we just need to eyeball it. Close enough is good enough.

Step 6 – Separate chroma and luma – NTSC

This is non-trivial and a source of endless head scratching. Chroma information lives in the frequency spectrum around the carrier, but so does higher frequency luma detail.

The basic theory for comb filters is to take advantage of the opposing phase of the chroma carrier. For NTSC, I used this basic structure from the handbook:

In code, this translates to:

void chroma_separation_ntsc()
{
  float line0 = sample_previous_line();
  float line1 = sample_current_line();
  float line2 = sample_next_line();
  float estimate = 0.5 * (line1 - 0.5 * (line2 + line0));
  store(estimate); // Bandpass the result.
}

Ideally, by adding two neighbor lines, chroma should cancel out and only luma remains. Subtracting, we cancel luma and only chroma remains.

We know that chroma won’t (or at least shouldn’t) exist outside its bandwidth, so the result is run through a bandpass filter that centers around the carrier frequency, and we have an estimate for the modulated chroma signal. Since the composite signal is Y + C, we subtract the chroma estimate from composite signal to form a Y estimate. Chroma can now be demodulated and low-passed to remove the harmonics introduced by demodulation.

This filter works “perfectly” for regions where the chroma is constant, but not so much where there are discontinuities. This results in “chroma dots” where the color subcarrier bleeds into decoded luminance.

Notice the dot pattern on the bottom of the image.

Thus, different colors modulate the
luminance intensity differently, creating a
“dot” pattern on the scan line between two col-
ors. To eliminate these “hanging dots,” a
chroma trap filter is sometimes used after the
comb filter.

In the real world of analog circuitry, having a perfectly locked signal like this is probably also not too realistic to assume.

The literature also calls out for notch filtering as an approach.

I attempted combining a comb filter with a notch filter on top to reduce the artifacts, but it is quite tricky to create a notch filter that works well. A simple FIR notch filter with zeroes is easy enough to make:

% 2x oversampled, so 27 MHz.
ntsc_subcarrier_w = ntsc_subcarrier / (2 * bt601_rate);

% Hope you know your complex numbers.
subcarrier_zero = exp(2 * pi * j * ntsc_subcarrier_w);

% To get a real-valued signal
% we need zeroes at positive and negative freqs.
% use real() here just a precaution.
% It should not be needed with perfect FP math accuracy.
subcarrier_notch =
  real(conv([1, -subcarrier_zero], [1, -conj(subcarrier_zero)]));

% Normalize DC gain at 0 dB
subcarrier_notch = subcarrier_notch' / sum(subcarrier_notch);

freqz(subcarrier_notch);

This filter is convolved with a simple low-pass to complete the luma decoding filter.

This approach just leads to severe blurring for NTSC and a band-stop filter approach just lead to less severe blurring and ringing instead, so I’m not sure what should be done. Unfortunately, the handbook isn’t clear on what kind of filtering is called for here. IIR notch filter designs can be super sharp to surgically carve out the carrier, but IIR filtering is also a massive pain in the ass on GPUs. It’s also likely to ring heavily too, which I found rather annoying in my testing.

E.g. 3-line comb NTSC without the notch (integer nearest upscale from 640×240):

and with notch:

Yikes. There’s no way this notch approach is correct. It’s like we’re getting double vision here. It does clean up the chroma dots though, so … yay?

Going beyond these base techniques there’s adaptive filtering where the filtering strategy changes based on which kind of case we’re dealing with. And even more sophisticated is taking advantage of temporal information (look ma’, TAA has been a thing since forever :D) since N fields in the past we have complementary chroma phase perfectly aligned to our pixel grid.

Very cool stuff, but I doubt consumer TVs at the time would have those. The added latency for doing this kind of analysis doesn’t sound like something you’d want for games at least …

Either way, I’m not designing high-end TV circuitry in the late 90s/early 2000s here. We can just flip on S-Video to simulate the perfect Y/C separator, so at some point I have to decide that I’ve done enough filter masturbation and move on.

Step 7 – Separating chroma and luma in PAL

This was way harder than expected and I had to bang my head against the wall for a while to come up with a good solution. The ~90 degree shift every scanline means the basic comb filter for NTSC won’t work at all.

The handbook has two main strategies here. Either a delay line which is slightly longer than a scanline to align the phases:

Or use a highly magical “PAL modifier”:

The function of this modifier is esoteric as all hell, but I think the purpose of it is to phase shift the signal by 90 degrees to “realign” the carrier somehow (it will still be off by 0.6 degrees). This filter path with two bandpass filters just got so messy, and I couldn’t figure out how to debug the thing effectively (were the inevitable visual artifacts my bugs or just the filter being bad?) that I eventually gave up and designed my own filter. That’s more fun anyway.

The magic cross filter

I started from first principles and designed a 3×3 kernel that should be able to perfectly pass a chrominance signal and 100% reject any luminance signal that is DC in either horizontal or vertical direction.

To make things simple I started with the assumption of 4 samples per subcarrier cycle to make the examples easier. Given a constant value of 1.0 for U, a signal would look something like:

 0  1  0 -1  0  1  0 -1  0  1  0 -1 ...
-1  0  1  0 -1  0  1  0 -1  0  1  0 ...
 0 -1  0  1  0 -1  0  1  0 -1  0  1 ...
 1  0 -1  0  1  0 -1  0  1  0 -1  0 ...
 0  1  0 -1  0  1  0 -1  0  1  0 -1 ...
-1  0  1  0 -1  0  1  0 -1  0  1  0 ...
 0 -1  0  1  0 -1  0  1  0 -1  0  1 ...
 1  0 -1  0  1  0 -1  0  1  0 -1  0 ...

U = sin(wt) with N + 0.75 cycles per line here. A kernel that satisfies the criteria is:

|  0.25  -0.25   0.0  |
| -0.25   0.5   -0.25 |
|  0.0   -0.25   0.25 |

The sum of all rows and column is 0, meaning that if the signal is DC either horizontally or vertically, the result is completely filtered out. The filter also rejects V signals.

V = +/- cos(wt) and looks like

 1  0 -1  0  1  0 -1  0  1 ...
 0 -1  0  1  0 -1  0  1  0 ...
-1  0  1  0 -1  0  1  0 -1 ...
 0  1  0 -1  0  1  0 -1  0 ...
 1  0 -1  0  1  0 -1  0  1 ...
 0 -1  0  1  0 -1  0  1  0 ...
-1  0  1  0 -1  0  1  0 -1 ...
 0  1  0 -1  0  1  0 -1  0 ...

This is just the same signal flipped horizontally, so:

|  0.0  -0.25  0.25 |
| -0.25  0.5  -0.25 |
|  0.25 -0.25  0.0  |

Then a combined filter is made that accepts U and V signals together. U and V can be perfectly split later during demodulation so that is okay.

|  0.25 -0.50  0.25 |
| -0.50  1.00 -0.50 |
|  0.25 -0.50  0.25 |

This just boils down to a simple diagonal edge detection filter (high pass vertically and horizontally), but actually works quite well.

To deal with the actual 2x oversampled rate of 27 MHz and the PAL subcarrier at ~4.433 MHz, the ~90 degree shift per line is about 1.51 samples, so to make this sort of work, I stretched out the horizontal kernel to a 5-tap filter:

[ -0.2602, -0.2398, 1.0000, -0.2398, -0.2602 ]

The vertical kernel remains the same of

[ -0.5, 1.0, -0.5 ]

Some error is introduced since we’re not sampling the signal 100% correctly anymore (theoretically we need a sinc to reconstruct the signal perfectly), but I measured the reconstructed error to be -40 dB, which is good enough I think. The measured error for U and V were also similar, which indicates no weird artifacts from the V flips.

With this 15-tap kernel, we get a pretty good chroma estimate even in PAL. From here the same ideas as NTSC apply, bandpass the estimate and subtract it from composite signal to get luma.

Notch filtering to cleanup the chroma dots worked way, way better for PAL than NTSC, likely because the carrier has a much higher frequency on PAL, so the low pass behavior of the notch isn’t as devastating to image quality as it is on NTSC.

In the end I think I prefer 3-line comb + notch for PAL and just plain 3-line comb for NTSC. These screenshots are just one still frame (or rather, field). The color fringing will cancel out the next field and it’s hard to show the effect without seeing it at full 60 fields per second.

Bonus hack – PAL60

While PlayStation 2 didn’t support this hack mode, GameCube did back in the day. It’s a non-standard video mode that has same refresh rate as NTSC and vertical resolution, but retains the bandwidth and chroma encoding system of PAL, the best of both worlds! My implementation can trivially implement this by just enabling PAL on 60 Hz games. Only thing I’m not quite clear on is how the 1 / 625 subcarrier offset per line is supposed to work, but it’s a non-standard mode anyway, so eeeeeh.

Step 8 – Validate the color cards in different configurations

With comb filter and notch on NTSC:

As expected, luma detail is murdered around ~3.58 MHz carrier. Also serious color fringing due to the extreme high frequency diagonal patterns. In the pattern at the bottom, no fringing is observed since the comb filter did its job as expected.

PAL is similar, but the carrier and notch moves to ~4.43 MHz instead:

Step 9 – Validate the Hanover Bar filter for PAL

The main feature of PAL is being robust against phase shifts during analog broadcasts. It’s a little unclear if composite inputs cared about this case, but for completeness sake, I implemented it. This path is naturally skipped for S-Video and Component outputs since I can’t imagine a TV caring about that for the more luxurious inputs. It doesn’t take many degrees of error in the phase to get quite different colors for NTSC.

For PAL, the phase error manifests as complementary errors every line.

However, by averaging out chroma vertically, we can recover the original chroma almost perfectly. At worst, a little less saturation.

 

Writing a semi-competent CRT shader

This topic is kind of unfortunate in that it’s done to death already, and it’s a 100% subjective topic meaning that everyone has some kind of opinion, none which agree with each other. Holy wars have been fought over less.

Trying to write anything fresh about this topic is futile in 2026 – the heyday of CRT hobbyist shader development was in the early 2010s – but I felt the need to explain what I did at the very least. If anything, it’s a useful intro to writing your own shader.

https://nyanpasu64.gitlab.io/blog/crt-appearance-tv-monitor/ is a good read too for more background information.

Scanlines – and what not to do at 480i

The most obvious part of a CRT filter is scanlines, however, the idea that CRT images should have clearly visible scanlines is actually an artifact of 240p. For PS2 games, we’re operating at either 480i or 480p for NTSC.

For interlaced video, we expect each individual field to have clearly visible scanlines, but the complete image (fused together by our brains) should not. The beam profile should be tuned as such. What most shaders do is for each scanline to take a gaussian distribution in the Y direction, sampled for the neighbor lines to cover the useful portion of the kernel.

vec3 inv_stddev = ...;
vec3 inv_variance = inv_stddev * inv_stddev;
vec3 gaussian = exp(-0.5 * inv_variance * phase * phase);

// Scaling factors to ensure integral of the distribution remains 1.
const float one_over_sqrt_two_pi = 0.3989422;
gaussian *= one_over_sqrt_two_pi * inv_stddev;

Breathing effect

Another common effect is that very bright scanlines are smeared out, supposedly due to the electron guns not being as stable when they’re driven at high voltages. This can be simulated by varying the standard deviation. It can be subtle, but creates a neat effect I think. Exactly how to come up with the beam profile for a given input voltage is purely up to taste I suppose, I doubt there is a linear relationship between R’G’B’ value and standard deviation of the beam 🙂

// Breathing effect.
vec3 inv_stddev = mix(
  vec3(registers.scan_factor_narrow),
  vec3(registers.scan_factor_wide), sampled);

The sampled RGB value is in gamma-space still, since the CRT gamma curve is due to the phosphor response, not the CRT itself adjusting the gamma curve.

Applying gamma in the appropriate place

The BT.1886 standard calls for a 2.4 gamma for SD content, which is the default and looks good. I also added options to use 2.2 (NTSC legacy) and 2.8 (?!, PAL legacy) for fun. Most CRT shaders I’ve seen apply gamma in this way:

// A little unclear if we should do gamma before or after.
// Before makes a little more sense I think.
color += pow(sampled, vec3(registers.gamma)) * gaussian;

I think it would depend on whether or not the phosphor’s response is a function of how many electrons hit it, or individual “particles” respond to the energy of the electrons hitting them, where the gaussian beam profile is just a distribution of how many particles light up. In the latter interpretation, the code as-is makes sense, while the first interpretation would call for apply a gamma function on the gaussian profile. The visual output as-is looks good to me at any rate.

After this point, all color math happens in linear light, so floating point render targets is a must.

Applying a colored dots mask

Color CRTs get their colors by having colored phosphors that are arranged in some kind of grid. The venerable Trinitrons use vertical stripes of RGB, and I like that look. While CRTs don’t really have a horizontal resolution, there is a “dot pitch”, which sort of dictates resolution. This part is the key to create the “texture” of a CRT.

From what I read online a typical dot pitch for consumer TVs was 0.5 – 1.0 mm, and for a typical 20″ CRT, I estimated a reasonable number of RGB triads to be ~640 or so. Close enough to BT.601 standard horizontal resolution, neat. From what I understand, this value is also referred to as “TVL”, and these values seem ballpark reasonable. When looked at a distance from the screen, the dots blend together nicely as we’d expect. It’s basically just LCD subpixels, just larger.

The dot layout I used was mostly lifted from Lotte’s CRT shader, but the approach can probably be found in a million shaders already. Just alternating stripes of R, G and B.

vec3 grille(vec3 color, vec2 pos)
{
  vec3 mask = vec3(0.25);
  pos.x = fract(pos.x / 3.0);

  if (pos.x < 0.333)
    mask.r = 1.0;
  else if (pos.x < 0.666)
    mask.g = 1.0;
  else
    mask.b = 1.0;

  return color * mask;
}

I suppose a perfect mask of 0.0 should be used here, but it doesn’t end up looking as good as I’d like, even after adding glow effects, so I think the intent behind passing through a portion of other colors is more of a pragmatic decision. At least I cannot think of a physical interpretation of why we’d want to do it.

Avoiding aliasing

The signal we are creating has a ton of high frequency information and we need to be very careful sampling it such that obvious aliasing is avoided. The common mistake is to just render this shader at output resolution and hoping for the best. This will almost surely lead to terrible aliasing patterns in the image. Bad aliasing of the scanlines in Y direction leads to a low frequency pumping pattern in the image which is extremely distracting, and bad aliasing in the X direction leads to a horribly noisy pattern due to uneven sampling of the dot mask.

The easy fix is to render the effect at an integer multiple. E.g. if input image is 240 lines, render to a height that is an integer multiple of that. For the color dot mask, make sure the horizontal resolution is e.g. 3 times the dot resolution (one pixel for R, G, B dots). I ended up with something like width = 640 * 3, and height = 240 * 6 (3x sampling for progressive).

Glow / Bloom effect

A nebulous effect of the CRT is the glow aspect to it. Anyone can tell it’s there, but it’s not entirely clear to me why this happens. Google searches don’t turn up anything useful either. Without knowing the physical reason for it, it’s hard to emulate accurately. Could it be scattering effects inside the thick CRT glass perhaps?

Either way, the common way to emulate this effect is to compute a gaussian blur (lots of those around here) and composite it over the original image. Very similar to the usual HDR + bloom effect that games in the 2010s loved to overuse. The main effect here is that the phosphor dots end up blending together nicely, yet retain the added “texture” that the aperture grille pattern gives. Humans like to see some high frequency detail, even if that detail is completely bogus. That’s the common trick behind video compression after all.

The glow component, boosted up a ton to make it very visible:

With the typical HDR effect in HD games, only very bright pixels participate in the effect, effectively spreading excess light energy over a larger area of pixels. It makes little sense to do anything like this for a CRT shader unless there is a physical threshold where phosphors just randomly start to glow more than they should, but all of this is purely up to taste anyway.

Here’s from Soul Calibur II with progressive scan, component cable emulation and 16x SSAA, without any glow added. The look is very harsh to me. (The full-screen image is needed to see it without the extreme aliasing caused by thumbnailing.)

Some glow on top and it looks like:

Purely up to taste how much to add of course. I like a decent amount of glow.

Feedback / Phosphor persistence

Phosphors don’t turn off right away when they’re lit. It’s very quick though, but adding a few percent of feedback between frames seems to help a bit with making 480i games look better with less flicker. This is not really how things work in the real world I think, but a reasonable approximation.

Scaling the image to screen

We need a high quality rescaler to get the integer sampled CRT to the screen without introducing significant aliasing from the aperture grille or scanlines. The way to go here is simply to use a proper windowed sinc or something like that.

Curvature?

I don’t like it, so I don’t implement it. If you do, make sure to consider resampling it properly to not introduce more aliasing.

Using the correct color space

A point many shaders miss is that the RGB of a modern monitor is not the same as RGB on an old CRT TV. What we think of RGB today is usually BT.709 sRGB which defines a set of color primaries and white point. Old SDTV era video uses BT.601 which is a bit narrower than BT.709. In linear RGB space, this transform is a trivial 3×3 matrix multiply.

I actually learned that the very old NTSC 1953 standard defines a set of primaries that were extremely saturated compared to the standards of today. While the primaries of 1953 were aspirational, it was clearly way too early.

SMPTE refined NTSC to use more reasonable primaries as part of BT.601. PAL primaries are almost exactly the same (TVs tended to use the same phosphor formulation across the world I suppose), but there’s a theoretical difference so I added both for good measure.

Supposedly, Japan kept the use of legacy NTSC 1953 primaries, so that opens an interesting question if the same games actually looked vastly different in Japan compared to the rest of the world? I’ve never heard anything mention this before, so who knows. Either way, I support enabling those primaries for fun. The look of it is quite … something? It would need a HDR monitor with solid gamut to do justice.

Here’s with standard BT.601 primaries: (From Legaia 2, which is a purely field rendered game, hence the scanlines)

and with NTSC 1953 primaries. Almost like the “Interpret sRGB as Display P3” bullshit that phones do these days to “pop”.

Bonus round – HDR

Given all the masking we’re doing which lower average brightness, it’s beneficial to support HDR10 rendering. Now that HDR is widely available on Linux too, enjoying some HDR CRT shaders is a good time 😀

I added a few modes where I can target a specified maximum nit level. KDE at least respects MaxCLL HDR metadata and disables tonemapping if MaxCLL falls within bounds of the display. I also added a no-tonemap option where MaxCLL = 0 (unknown), which makes KDE tonemap how it wants to.

Experimental round – High refresh rate

Black Frame Insertion and their friends have been a thing in emulation for ages, and it works quite well with CRT simulation, especially to sell the de-interlacing effect properly. In my implementation, I query EXT_present_timing and decide how many frames I should insert in-between. There’s a gentler falloff between the frames. The overall screen brightness decreases a lot as expected, but with HDR, we can crank the brightness of the proper frame up to compensate.

It’s still very experimental and any missed frame leads to horrrrrrrible flicker at the moment (big epilepsy warning), so it’s not something I actually recommend, but it’s fun to experiment with.

Getting de-interlacing for “free”

While simulating each field independently with scanlines, we get de-interlacing the same way a CRT would in theory. This is not free from flicker of course, but most games had mitigation strategies for this.

The two CRTCs of the PS2 – FRAME mode

The PS2 supported blending two frames being sent to the video output. Most games render internally at e.g. 448p @ 30 FPS, but since they cannot output that resolution without component cables, the frame is output interlaced over two fields where the CRTC scans every other line rather than every line. That tends to look quite flickery if done as-is given how aliased PS2 graphics are, so what pretty much every game did was using the two CRTCs to blend the two frames vertically before sending it by programming a 1 pixel offset with duplicate inputs. By shifting the offset every field, the 30 FPS progressive image could be scanned out nicely into a 60 FPS interlaced image.

This is the “flicker filter” that some games allowed toggling. Here’s a RenderDoc capture showing that CRTC 1 and 2 are configured with one pixel offset in Y:

After merging and blending the frames together, a smoothed image is sent.

PCSX2 GSdx and paraLLEl-GS have modes to detect this pattern and just scan out the full 480p of course, without the added blurring. Most games fall into this pattern, which is fortunate if we want to avoid interlacing shenanigans, but not all games are so nice to deal with. The “Anti-Blur” option is designed precisely for disabling this filter.

I also added an option to force-disable the automatic progressive scan, mostly to test the video output that the games would actually have output back in the day, which is interlaced video.

FIELD mode

Some games decided that they wanted to render at 60 FPS and sacrifice half the vertical resolution to do so, jittering the rendering to stay in sync with the interlaced output. These are painful to deal with to this day since they absolutely require some kind of de-interlacing solution to look good.

I never got satisfactory results with a typical de-interlacer, but the CRT simulation does a quite good job at it I think. It’s not perfect (interlacing wasn’t exactly perfect on CRTs either), but it’s usable for me to the point now that I can play interlaced games as intended. It’s not really possible to demonstrate this with screenshots.

The really cursed shit

Some games break if you try to promote them to progressive scan, because the games might decide for some stupid reason to use the SCANMSK feature to discard pixels every other line, and rely on the FRAME scanout mode to exactly scan out the pixels that were not masked. Kings’s Field IV is an example of this absolute insanity.

PCSX2 integration

I maintain a patch for PCSX2-git which supports parallel-gs and now this CRT/analog thing. This is super niche stuff that I don’t really expect many people to actually use, but it’s there for those who are interested.

It does what I need at least, and that’s what’s important to me.

Putting it all together

I put together some test video clips in HEVC/PQ/4:4:4/1440p at ridiculously high bitrates. I tried AV1 but my CPU could not decode it in real time, so it is what it is ._.

Clip 1 is: Raw RGB passed into the CRT. It uses the game’s native 480p and widescreen support. Super sampling is 4x SSAA.

Clip 2 is: “PAL60” with 3-line comb + notch. It also uses the more default 4:3 and interlaced video. It has some frame drops which completely botch the interlace and even at comically large bitrates, the aperture grill effect doesn’t translate well, so it is what it is, but it’s a rough approximation of what it should look like.

Anyway, I’m happy with the results. Time to actually play something instead of debugging stuff 😀

Walking backwards into the future – A look at descriptor heap in Granite

It seems like I can never quite escape the allure of fiddling with bits more efficiently every passing year. I recently went through the process of porting over Granite’s Vulkan backend to use VK_EXT_descriptor_heap. There wasn’t exactly a burning need to do this work, but science demands I sacrifice my limited free time for these experiments. My name may or may not be on the extension summary, and it’s important to eat your own dog food. In this post, I want to explore ways in which we can port over an old school binding model to newer APIs should the need arise.

The basic Vulkan 1.0 model

Granite’s binding model is designed for really old Vulkan. The project started in January 2017 after all, at which point Vulkan was in its infancy. Bindless was not really a thing yet, and I had to contend with really old mobile hardware. Slot-based bindings have been with us since OpenGL and early D3D. I still think it’s a fine model from a user’s perspective. I have no problem writing code like:

cmd.set_storage_buffer(set, binding, buffer);
cmd.set_texture(set, binding, texture);
cmd.set_sampler(set, binding, sampler);
cmd.draw();
// etc

It’s very friendly to tooling and validation and I just find it easy to use overall. GPU performance is great too since vendors have maximal flexibility in how to implement the API. The major downside is the relatively heavy CPU cost associated with it since there are many API calls to make. In my projects, it’s rarely a concern, but when doing heavy CPU-bound workloads like PS2 GS emulation, it did start to matter quite a bit 😀

Automatic reflection and layout generation

When SPIR-V shaders are consumed in Granite, they are automatically reflected. E.g., with GLSL:

layout(set = 0, binding = 1) uniform texture2D A[3];
// For arrays, I map consecutive bindings 1, 2, 3 on host code.
// While Vulkan allows binding = 2 for B below, Granite does not.
layout(set = 0, binding = 4) uniform sampler B;
layout(set = 1, binding = 0) uniform sampler2D C;
layout(set = 2, binding = 2) readonly buffer SSBO { uint v; };

layout(push_constant) uniform PushConstants { ... };

I automatically generate VkDescriptorSetLayout for each unique set, and combine these into a VkPipelineLayout as one does. VkDescriptorSetLayouts is hash’n’cached into a DescriptorSetAllocator.

The implicit assumption by shaders I write is that low-frequency updates have lower set values. This matches Vulkan’s pipeline layout compatibility rules too.

Given the hardcore descriptor churn this old model can incur, UBOs originally used VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. Since linearly allocating new UBOs per draw is a hot path, I wanted to avoid having to allocate and write new descriptor sets all the time. This is precisely what the dynamic buffer types were designed for. I did not use it for SSBOs since DYNAMIC has some unfortunate interactions with descriptor size, since you cannot change the size, only offset. The size of UBOs is somewhat irrelevant, and I just hardcoded in a 64K window.

Slab allocating VkDescriptorSet

There are two main strategies for allocating sets from a VkDescriptorPool, both which are kinda bad. The typical model I believe most do is the “jumbo” allocator where you create a big pool with many sets and many descriptors with different descriptor types and pray for the best. When the pool is OOM-ed, allocate another. One unfortunate thing about the jumbo pool is that you can’t really know up front exactly how to balance the descriptor types properly. It will always be a shaky heuristic.

typedef struct VkDescriptorPoolCreateInfo {
  VkStructureType sType;
  const void* pNext;
  VkDescriptorPoolCreateFlags flags;
  uint32_t maxSets; // Who knows?
  uint32_t poolSizeCount;
  const VkDescriptorPoolSize* pPoolSizes; // Parcel out per descriptor type.
} VkDescriptorPoolCreateInfo;

In raw Vulkan 1.0, it was straight up illegal to allocate any further once a limit had been reached, causing even more headaches. The very first maintenance extension to Vulkan fixed this and added OUT_OF_POOL_MEMORY which allows applications to just keep going until the pool is exhausted. Fun fact is that some vendors would never exhaust the pool and just straight up ignore what you pass into vkCreateDescriptorPool, so that’s fun.

Granite went the route of a slab allocator per VkDescriptorSetLayout instead, one allocator per thread. Allocate a group of like 64 VkDescriptorSets in one go and parcel them out as needed. Main advantage here was no need to keep calling vkAllocateDescriptorSets over and over, and in the early years, I even hash’n’cached the descriptor sets. The primary reason for doing that was that some early mobile drivers were extreeeeeeeemely slow at vkUpdateDescriptorSets for some reason. Not a great time. This slab approach lead to memory bloat though.

Update templates

At some point VK_KHR_descriptor_update_template was added which aims to accelerate vkUpdateDescriptorsSets. Instead of having the driver parse the structs and switching on the descriptorType to write descriptors, the update template allows drivers in theory to “precompile” a highly optimized function that updates descriptors based on the template that is provided in vkCreateDescriptorUpdateTemplate. This was a nice incremental thing to add to Granite. I don’t think the promise of update templates really worked out in the end though. Most drivers I think just resorted to parsing the original template instead, leading to no speedup. 🙁

Push descriptors

Push descriptors were designed quite early on in Vulkan’s life, but its adoption was … spotty at best. It didn’t make it into core until Vulkan 1.4! Push descriptors solved some issues for us slot and binding troglodytes since there was simply no need to mess around with allocating sets and pools when we could just push descriptors and the driver would deal with it. The major downside is that only one descriptor set can be a push set, but in Granite’s case, I could design for that limitation when writing shaders. The last set index in a VkPipelineLayout would get assigned as a push set. After going push descriptors, I dropped the old UBO_DYNAMIC path, since push descriptors are not compatible with it, and the UBO_DYNAMIC wins were … questionable at best anyway.

It took a while to move to this model though. AMD Windows driver was infamously dragging its feet for years before finally accepting reality and at that point I was ready to move over. It’s still not a hard requirement in Granite due to mobile concerns, but then the driver hits the slow path, and I don’t really care anymore 🙂

Adding descriptor indexing support

At some point, any modern renderer has to deal with this and Granite hit this wall with clustered shading, where an array of shadow maps became a hard necessity. I’m not a big fan of “everything is bindless” myself, since I think it makes debugging way more annoying and stresses tooling and validation more than it should, but sometimes the scissor juggling is necessary.

When Granite reflects a shader looking like this:

layout(set = 1, binding = 0) uniform texture2D Blah[];
// binding must be 0.
// no other bindings can be used alongside the variable size array.

The set layout is converted into an UPDATE_AFTER_BIND set with VARIABLE_COUNT array length. There is also a special helper function to aid in allocating these bindless sets where the API mostly turns into:

pool.begin();
pool.push(imageView0);
pool.push(imageView1);
pool.push(imageView2);
...
cmd.set_bindless(set, pool.commit());

The CPU overhead of this isn’t quite trivial either, but with the set and pool model, it’s not easy to escape this reality without a lot of rewrites. For now, I only support sampled images with bindless and I never really had any need or desire to add more. For bindless buffers, there is the glorious buffer_device_address instead.

The end of the line for legacy model

This model has served and keeps serving Granite well. Once this model is in place, the only real reason to go beyond this for my use cases is performance (and curiosity).

Moving to descriptor buffers

VK_EXT_descriptor_buffer asks the question of what happens when we just remove the worst parts of the descriptor API:

  • VkDescriptorSet
  • VkDescriptorPool
  • vkUpdateDescriptorSets (kinda)

Sets are now backed by a slice of memory, and pools are replaced by a big descriptor buffer that is bound to a command buffer. Some warts remain however, as VkDescriptorSetLayout and PipelineLayout persist. If you’re porting from the legacy model like I was, this poses no issues at all, and actually reduces the friction. Descriptor buffers are a perfectly sound middle-ground alternative for those who aren’t a complete bindless junkie yet, but want some CPU gains along the way.

Step 1: Add an arena allocator for descriptors

In the ideal use case for descriptor buffers, we have one big descriptor buffer that is always bound. This is allocated with PCI-e BAR on dGPUs, so DEVICE_LOCAL | HOST_VISIBLE. Instead of allocating descriptor sets, command buffer performs a linear allocation which is backed by slices allocated from the global descriptor buffer. No API calls needed.

CommandBuffer::DescriptorSlice
CommandBuffer::allocate_descriptor_slice(
  VkDeviceSize size, VkDeviceSize align)
{
  DescriptorSlice slice = {};

  desc_buffer_alloc_offset = align(desc_buffer_alloc_offset, align);

  if (desc_buffer_alloc_offset + size > desc_buffer.get_size())
  {
    // Page in a new block.
    if (desc_buffer.get_size())
      device->free_descriptor_buffer_allocation(desc_buffer);

    VkDeviceSize padded_size =
      std::max<VkDeviceSize>(size, 16 * 1024);
    desc_buffer =
      device->managers.descriptor_buffer.allocate(padded_size);
    desc_buffer_alloc_offset = 0;
  }

  slice.offset = desc_buffer.get_offset() + desc_buffer_alloc_offset;
  slice.mapped =
    device->managers.descriptor_buffer.get_resource_heap().mapped +
    slice.offset;
  desc_buffer_alloc_offset += size;
  return slice;
}

The size to allocate for VkDescriptorSet is queried from the set layout itself, and each descriptor is assigned an offset that the driver controls.

A note on descriptor buffer min-spec

There is a wart in the spec where the min-spec for sampler descriptor buffers is very small (4K samplers). In this case, there is a risk that just linearly allocating out of the heap will trivially OOM the entire thing and we have to allocate new sampler descriptor buffers all the time. In practice, this limitation is completely moot. Granite only opts into descriptor buffers if the limits are reasonable.

There is supposed to be a performance hit to rebinding descriptor buffers, but in practice, no vendor actually ended up implementing descriptor buffers like that. However, since VK_EXT_descriptor_heap will be way more strict about these kinds of limitations, I designed the descriptor_buffer implementation around the single global heap model to avoid rewrites later.

There is certainly a risk of going OOM when linearly allocating like this, but I’ve never hit close to the limits. It’s not hard to write an app that would break Granite in half though, but I consider that a “doctor, my GPU hurts when I allocate like this” kind of situation.

Step 2: Writing the descriptors

This is where we should have a major win, but it’s not all that clear. For each descriptor type, I have different strategies on how to deal with them. The basic idea of descriptor buffers is that we can call vkGetDescriptorEXT to build a descriptor in raw bytes. This descriptor can now be copied around freely by the CPU with e.g. memcpy, or even on the GPU in shaders (but that’s a level of scissor juggling I am not brave enough for).

Plain images and samplers

These are the simplest ones to contend with. Descriptor buffers still retain the VkImageView and VkSampler object.

  • VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE
  • VK_DESCRIPTOR_TYPE_STORAGE_IMAGE
  • VK_DESCRIPTOR_TYPE_INPUT_ATTACHMENT
  • VK_DESCRIPTOR_TYPE_SAMPLER

The main addition I made was to allocate a small payload up front and write the descriptor once. E.g.:

VkDescriptorImageInfo image_info = {};
image_info.imageView = view.view;

VkDescriptorGetInfoEXT get_info =
  { VK_STRUCTURE_TYPE_DESCRIPTOR_GET_INFO_EXT };

get_info.data.pSampledImage = &image_info;

auto &props =
  device->get_device_features().descriptor_buffer_properties;

if (usage & VK_IMAGE_USAGE_SAMPLED_BIT)
{
  get_info.type = VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE;
  image_info.imageLayout =
    layout == ImageLayout::Optimal ?
    VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL :
    VK_IMAGE_LAYOUT_GENERAL;
  view.sampled = alloc_sampled_image(); // slab allocated
  table.vkGetDescriptorEXT(device->get_device(), &get_info,
    props.sampledImageDescriptorSize, view.sampled.ptr);
}

Instead of vkUpdateDescriptorSets, we can now replace it with a trivial memcpy.

void CommandBuffer::set_texture(unsigned set, unsigned binding, ...)
{
  auto &b = bindings.bindings[set][binding];
  ...
  if (desc_buffer_enable || desc_heap_enable)
  {
    b.image.ptr = payload.ptr;
...
Util::for_each_bit(set_layout.separate_image_mask, [&](unsigned binding) {
    for (unsigned i = 0; i < set_layout.meta[binding].array_size; i++)
    {
       auto *ptr = bindings.bindings[set][binding + i].image.ptr;
       VK_ASSERT(ptr);
       device->managers.descriptor_buffer.copy_sampled_image(
             mapped + set_allocator->get_binding_offset(binding + i), ptr);
    }
});

The memcpy functions are function pointers that resolve the byte count. This is a nice optimization since the memcpy functions can unroll to perfectly unrolled SIMD load-store.

Allocating bindless sets of sampled images with this method becomes super efficient, since it boils down to a special function that does:

template <size_t N>
static void static_memcpy_n(
  uint8_t *dst, const uint8_t * const *src, size_t count, size_t)
{
    // memcpy with static size is way more efficient than dynamic size.
    for (size_t i = 0; i < count; i++, dst += N)
       memcpy(dst, src[i], N);
}
device->managers.descriptor_buffer.copy_sampled_image_n(
       device->managers.descriptor_buffer.get_resource_heap().mapped +
       desc_set.handle.offset + allocator->get_variable_offset(),
       info_ptrs.data(), write_count);

Texel buffers

I rarely use these, but they are also quite neat in descriptor buffers. VkBufferView is gone now, so we just need to create a descriptor payload once from VkDeviceAddress and it’s otherwise the same as above.

Combined image samplers

This descriptor type is somewhat of a relic these days, but anyone coming from a GL/GLES background instead of D3D will likely use this descriptor type out of old habit, me included. The API here is slightly more unfortunate, since there is no obvious way to create these descriptors up-front. We don’t necessarily know all the samplers an image will be combined with, so we have to do it last minute, calling vkGetDescriptorEXT to create the combined descriptor.

Buffer descriptors

  • VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER
  • VK_DESCRIPTOR_TYPE_STORAGE_BUFFER
  • VK_DESCRIPTOR_TYPE_ACCELERATION_STRUCTURE_KHR

We cannot meaningfully pre-create descriptors for UBOs and SSBOs so we’re in a similar situation where we have to call vkGetDescriptorEXT for each buffer last-minute. Unfortunately, there is no array of descriptor version for GetDescriptorEXT, so in the extreme cases, descriptor buffers can actually have worse CPU overhead than legacy model. DXVK going via winevulkan .dll <-> .so translation overhead has been known to hit this, but for everyone else I’d expect the difference to be moot.

Potentially using push descriptors?

Since descriptor buffer is an incremental improvement over legacy model, we retain optional support for push descriptors. This can be useful in some use cases (it’s critical for vkd3d-proton), but Granite does not need it. Once we’re in descriptor buffer land, we’re locked in.

Are we there yet?

Descriptor buffers are battle tested and very well supported at this point. Perhaps not on very old mobile drivers, but slightly newer devices tend to have it, so there’s that! RenderDoc has solid support these days as well.

The new hotness

At a quick glance, descriptor heap looks very similar to D3D12 (and it is), but there are various additions on top to make it more compatible with the various binding models that exist out there in the wild, especially for people who come from a GL/Vulkan 1.0 kind of engine design. The normal D3D12 model has some flaws if you’re not fully committed to bindless all day every day, mainly that:

  • You very quickly end up having to call CopyDescriptorsSimple a LOT to shuffle descriptors into the heap. Since this is a call into the driver just to copy a few bytes around, it can quickly be a source of performance issues. In vkd3d-proton, we went to hell and back to optimize this case because in many titles it was the number 1 performance overhead.
  • Dealing with samplers is a major pain. The 2K sampler heap limit can be rather limiting since there is no good way to linearly allocate on such a small heap. Static samplers are quite common as a result, but they have other problems. Recompiling shaders because you change Aniso 4x to 8x in the settings menu is kinda a hilarious situation to be in, but some games have been known to do just that …

Step 1: Split the descriptor buffer into a resource heap and sampler heap

This is to match how some hardware works, nothing too complicated. I allocate for the supported ~1 million resource descriptors and 4096 samplers. There is a reserved region for descriptors as well which is new to this extension. In D3D12 this is all abstracted away since applications don’t have direct access to the descriptor heap memory.

Step 2: Split the heap into 3 parts

if (heap)
{
  resource_heap.reserved_offset = info.size -
    heap_props.minResourceHeapReservedRange;
  // Ensure reserved offset is valid.
  resource_heap.reserved_offset &= ~VkDeviceSize(alignment - 1);

  // Split the resource heap in two.
  // Lower half is POT sized and allows for dynamic allocation.
  // This is used to spill out UBOs and SSBOs which
  // must live as descriptors.
  // Also used to allocate bindless images for GPU perf.
  auto heap_dynamic_allocator_size =
    VkDeviceSize(sub_block_size) << max_sub_blocks_log2;
  auto num_application_resources =
    (resource_heap.reserved_offset - heap_dynamic_allocator_size) >>
    device->get_device_features().resource_heap_resource_desc_size_log2;
  auto resource_slab_offset = heap_dynamic_allocator_size >>
    device->get_device_features().resource_heap_resource_desc_size_log2;

  heap_resource_indices.reserve(num_application_resources);
  for (uint32_t i = num_application_resources; i; i--)
    heap_resource_indices.push_back(resource_slab_offset + i - 1);

For the resource heap, we have a 512 K descriptor area which can be freely allocated from, like we did with descriptor buffer. Unlike descriptor buffer where we hammer this arena allocator all the time, we will only rarely need to touch it with descriptor heap.

The next ~500k or so descriptors are dedicated to holding the descriptor payload for VkImageView, VkSampler and VkBufferView. All of these objects are now obsolete. When Granite creates a Vulkan::ImageView, it internally allocates a free slab index from this upper region, writes the descriptor there and stores the heap index instead. This enables “true” bindless in a performant way. We could have done this before if we wanted to, but in descriptor buffer we would have eaten a painful indirection on a lot of hardware, which is not great. Some Vulkan drivers actually works just like this internally. You can easily tell, because some drivers report that an image descriptor is just sizeof(uint32_t). We’d have our index into the “heap”, which gets translated into yet another index into the “true” (hidden) heap. Chasing pointers is bad for perf as we all know.

view.sampled = alloc_sampled_image();
view.sampled.heap_index = allocate_single_resource_heap_entry();
if (view.sampled.heap_index == UINT32_MAX)
  return false;

infos[count] = { VK_STRUCTURE_TYPE_RESOURCE_DESCRIPTOR_INFO_EXT };
infos[count].type = VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE;
infos[count].data.pImage = &images[count];

images[count] = { VK_STRUCTURE_TYPE_IMAGE_DESCRIPTOR_INFO_EXT };
images[count].pView = &info; // It takes VkImageViewCreateInfo :D
images[count].layout = layout == ImageLayout::Optimal ?
  VK_IMAGE_LAYOUT_READ_ONLY_OPTIMAL : VK_IMAGE_LAYOUT_GENERAL;

addrs[count].address = view.sampled.ptr;
addrs[count].size = 
  get_descriptor_size_for_type(VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE);
count++;

We keep a copy of the descriptor payload in CPU memory too, in case we have to write to the arena allocated portion of the heap later.

The upper region of ~10k descriptors or so (depends on the driver) is just a reserved region we bind and never touch. It’s there so that drivers can deal with CmdResolveImage, CmdBlitImage and other such special APIs that internally require descriptors.

For samplers, there is no arena allocator. It’s so tiny. Instead, when creating a sampler, we allocate a slab index and return a dummy handle by just pointer casting the index instead. We’ll make good use of the mapping APIs later to deal with this lack of arena allocation. In fact, we will never have to copy sampler descriptor payloads around, and we don’t have to mess around with static samplers either, neat! For the static sampler crowd, there is full support for embedded samplers which functions just like D3D12 static samplers, so there’s that but Granite doesn’t use it.

It was a non-trivial amount of code to get to this point, but hey, that’s what happens when you try to support 3 descriptor models at once I guess …

Step 3: Understanding push data – push constants with more spice

Core Vulkan 1.0 settled on 128 bytes of push constants being the limit. This was raised in Vulkan 1.4 but Granite keeps the old limit (I could probably live with 32 or 64 bytes to be fair). Push data expands to 256 byte as a minimum, and the main idea behind descriptor heap is that pipeline layouts are completely gone, and we get to decide how the driver should interpret the push data space. This is similar to D3D12 root parameters except it’s not abstracted behind a SetRootParameter() kind of interface that is called one at a time. In Vulkan, we can call CmdPushDataEXT once.

VkPipelineLayout and VkDescriptorSetLayout is just gone now, poof, does not exist at all. This is huge for usability. Effectively, we can pretend that the VkPipelineLayout is now just push constant range of 256 bytes, and that’s it.

If you’re fully committed to go bindless, we could just do the equivalent of SM 6.6 ResourceDescriptorHeap and SamplerDescriptorHeap and buffer_device_address to get everything done. However, Granite is still a good old slot based system, so I need to use the mapping features to tell the driver how to translate set/binding into actual descriptors. This mapping can be different per-shader too, which fixes a lot of really annoying problems with EXT_graphics_pipeline_library and EXT_shader_object if I feel like going down that path in the future.

Parceling out push data

The natural thing to do for me was to split up the space into maximum 128 byte push constants, then 32 bytes per descriptor set (I support 4 sets, Vulkan 1.0 min-spec). It’s certainly possible to parcel out the data more intelligently, but that causes some issues with set compatibility which I don’t want to deal with.

For every set, I split it up into buffers and images and decide on a strategy for each. Buffers are decided first since they have the largest impact on performance in my experience.

Inline descriptors – VK_DESCRIPTOR_MAPPING_SOURCE_PUSH_ADDRESS_EXT

This is very simple. If there are 3 or fewer buffers in a set (24 bytes), we can just stuff the raw pointers into push data and tell the driver to use that pointer. This is D3D12 root descriptors in a nutshell. Especially for UBOs, this is very handy for performance. We lose robustness here, but I never rely on buffer robustness anyway. The push data layout looks something like this:

layout(set = 0, binding = 0) buffer SSBO0;
layout(set = 0, binding = 1) buffer SSBO1;
layout(set = 0, binding = 2) uniform UBO;

struct PushData
{
  VkDeviceAddress SSBO0;
  VkDeviceAddress SSBO1;
  VkDeviceAddress UBO;
} push = { ... fill out on CPU };

vkCmdPushDataEXT(&push, sizeof(PushData));
Too many descriptors – VK_DESCRIPTOR_MAPPING_SOURCE_INDIRECT_ADDRESS_EXT

This is a new Vulkan speciality. Without modifying the shaders, we can tell the driver to load a buffer device address from a pointer in push data instead. This way we don’t have to allocate from the descriptor heap itself, we can just do a normal linear UBO allocation, write some VkDeviceAddresses in there and have fun. Given the single indirection to load the “descriptor” here, this looks a lot like Vulkan 1.0 descriptor sets, except there’s no API necessary to write them.

layout(set = 0, binding = 0) buffer SSBO0;
layout(set = 0, binding = 1) buffer SSBO1;
layout(set = 0, binding = 2) buffer SSBO2;
layout(set = 0, binding = 3) buffer SSBO3;
layout(set = 0, binding = 4) buffer SSBO4;

struct IndirectTable
{
  VkDeviceAddress SSBO0, SSBO1, SSBO2, SSBO3, SSBO4;
};

struct PushData
{
  VkDeviceAddress pointer_to_indirect_table;
};
Fallback to heap – VK_DESCRIPTOR_MAPPING_SOURCE_HEAP_WITH_PUSH_INDEX_EXT

This isn’t the ideal path, but sometimes we’re forced to allocate from the heap. This can happen if we have one of these cases:

  • The shader is using OpArrayLength on an SSBO. We need real descriptors in this case. The current implementation just scans the SPIR-V shader module for this instruction, but could be improved in theory.
  • The shader is using an array of descriptors. For buffers, this should be very rare, but the PUSH_ADDRESS and INDIRECT_ADDRESS interfaces do not support this.
  • Robustness is enabled.

This is a pretty much D3D12’s root tables, but in Vulkan we can be a bit more optimal with memory since buffer descriptors tend to be smaller than image descriptors and we can pack them tightly. D3D12 has one global stride for any resource descriptor while Vulkan exposes separate sizes that applications can take advantage of.

layout(set = 0, binding = 0) buffer SSBO { ... } ssbos[2];

struct Heap
{
  ... somewhere in the heap
  uint8_t real_ssbo_descriptor0[N]; N = ArrayStride
  uint8_t real_ssbo_descriptor1[N];
  ...
};

struct PushData
{
  uint32_t offset_into_resource_heap; // scaled by IndexStride
};

vkWriteResourceDescriptorsEXT is required here to write the SSBO descriptors.

After buffers are parceled out for a descriptor set, we have some space left for images. At minimum, we have 8 bytes left (32 – 3 * sizeof(VkDeviceAddress)).

Inline image indices

This is the common and ideal case. If we don’t have any arrays of images, we can just have a bunch of uint32_t indices directly into the heap. At image view and buffer view creation time, we already allocated a persistent index into the heap that we can refer to. No API calls required when emitting commands.

layout(set = 0, binding = 0) uniform image2D Storage;
layout(set = 0, binding = 1) uniform texture2D Sampled;
layout(set = 0, binding = 2) uniform sampler Sampler;
layout(set = 0, binding = 3) uniform subpassInput InputAttachment;
layout(set = 0, binding = 4) uniform samplerBuffer TexelBuffer;
layout(set = 0, binding = 5) uniform imageBuffer StorageTexelBuffer;

struct PushData
{
  uint32_t StorageHeapIndex;
  uint32_t SampledHeapIndex;
  uint32_t SamplerHeapIndex;
  uint32_t InputAttachmentHeapIndex;
  uint32_t TexelBufferHeapIndex;
  uint32_t StorageTexelBufferHeapIndex;
};

Combined image samplers work quite well in this model, because Vulkan adds a special mapping mode that packs both sampler index and the image index together. This fixes one of the annoying issues in EXT_descriptor_buffer.

mapping.resourceMask |=
  VK_SPIRV_RESOURCE_TYPE_COMBINED_SAMPLED_IMAGE_BIT_EXT;
mapping.source = VK_DESCRIPTOR_MAPPING_SOURCE_HEAP_WITH_PUSH_INDEX_EXT;
mapping.sourceData.pushIndex.pushOffset =
  push_offset + heap.push_inline_offsets[set_index];
mapping.sourceData.pushIndex.heapOffset = 0;
mapping.sourceData.pushIndex.heapArrayStride =
  device->get_device_features().resource_heap_resource_desc_size;
mapping.sourceData.pushIndex.heapIndexStride =
  device->get_device_features().resource_heap_resource_desc_size;

if ((desc_set.sampled_image_mask & (1u << bit)) != 0)
{
  mapping.sourceData.pushIndex.useCombinedImageSamplerIndex =
    VK_TRUE;
  mapping.sourceData.pushIndex.samplerHeapArrayStride =
    sampler_desc_size;
  mapping.sourceData.pushIndex.samplerHeapIndexStride =
    sampler_desc_size;
}

layout(set = 0, binding = 0) uniform sampler2D CombinedImageSampler;

struct PushData
{
  uint32_t ImageIndex : 20; // This layout is hardcoded by spec.
  uint32_t SamplerIndex : 12;
};
Using the heap

If we cannot use the simple inline indices, we have two options. The preferred one right now is to just allocate space in the descriptor heap just like the descriptor buffer path, because I’m quite concerned with unnecessary indirections when possible. At least we get to copy the payloads around without API commands. This path is also used for bindless sets.

layout(set = 0, binding = 0) uniform image2D Storage;
layout(set = 0, binding = 1) uniform texture2D Sampled;
layout(set = 0, binding = 2) uniform subpassInput InputAttachment;
layout(set = 0, binding = 3) uniform samplerBuffer TexelBuffer;
layout(set = 0, binding = 4) uniform imageBuffer StorageTexelBuffer;

struct ResourceHeap
{
  ...
  uint8_t StorageDesc[N]; // N is usually 32 or 64
  uint8_t SampledDesc[N];
  ... etc
  ...
};

struct PushData
{
  uint32_t resource_heap_offset;
};

Unlike the descriptor buffer path, there is a major problem which is that linearly allocating from the sampler heap is not viable. The sampler heap is really small now just like in D3D12. In this case, Vulkan has an answer.

VK_DESCRIPTOR_MAPPING_SOURCE_HEAP_WITH_INDIRECT_INDEX_ARRAY_EXT

This is a special Vulkan feature that functions like an indirect root table. This one is similar to INDIRECT_ADDRESS in that we don’t have to allocate anything from the heap directly and we can just stuff heap indices straight into a UBO.

layout(set = 0, binding = 0) uniform sampler2D Combined[2];
layout(set = 0, binding = 2) uniform texture2D Tex[2];
layout(set = 0, binding = 4) uniform sampler Samplers[3];

struct IndirectTable
{
   struct
   {
     uint32_t ImageIndex : 20;
     uint32_t SamplerIndex : 12;
   } Combined[2];
   uint32_t TextureIndices[2];
   uint32_t SamplerIndices[3]; 
};

struct PushData
{
  VkDeviceAddress pointer_to_indirect_table;
};

Overall, I think these new mapping types allows us to reuse old shaders quite effectively and it’s possible to start slowly rewriting shaders to take full advantage of descriptor_heap once this machinery is in place.

But performance though?

For GPU performance, it seemed to be on-par with the other descriptor models on NVIDIA and AMD which was expected. Granite does not really hit the cases where descriptor_heap should meaningfully improve GPU performance over descriptor_buffer, but I only did a rough glance.

For CPU performance, things were a bit more interesting, and I learned that Granite has quite significant overhead on its own, which is hardly surprising. That’s the cost of an old school slot and binding model after all, and I never did a serious optimization pass over it. A more forward looking rendering abstraction can eliminate most, if not all this overhead.

The numbers here are for RADV, but it’s using the pending merge request for descriptor_heap support.

Test #1 – Building bindless descriptor arrays

for (unsigned i = 0; i < NumPoolIterations; i++)
{
  timer.start();
  pool->reset();
  pool->allocate_descriptors(4096);
  for (unsigned j = 0; j < 4096 / 2; j++)
  {
    pool->push_texture(image0->get_view());
    pool->push_texture(image1->get_view());
  }
  pool->update();
  iter_times[i] = timer.end();
}

Legacy model

– ~27 us to write 4096 image descriptors on a Ryzen 3950x with a RX 6800.

Descriptor buffer and heap

This is basically exactly the same. ~13 us. This is really just a push_back and memcpy bench at this point.

Test #2 – Spam descriptor sets with a few buffers

layout(set = 0, binding = 0) buffer SSBO00;
layout(set = 0, binding = 1) uniform UBO01;
layout(set = 0, binding = 2) buffer SSBO02;
// Stamp it out 4 times for each set, 12 total descriptors.

This case hits the optimal inline BDA case for heap.

timer.start();
for (unsigned j = 0; j < NumDispatches; j++)
{
  for (unsigned set = 0; set < 4; set++)
  {
    cmd->set_storage_buffer(set, 0, *buffer, 256 * j, 256);
    cmd->set_uniform_buffer(set, 1, *buffer, 256 * j, 256);
    cmd->set_storage_buffer(set, 2, *buffer, 256 * j, 256);
  }

  cmd->dispatch(1, 1, 1);
}
iter_times[i] = timer.end();

Legacy

~ 279 ns per dispatch. Doesn’t feel very impressive. 🙂

Descriptor buffer

Basically same perf, but lots of overhead has now shifted over to Granite. Certainly things can be optimized further. GetDescriptorEXT is somehow much faster than UpdateDescriptorSetWithTemplate though.

Descriptor heap

~ 157 ns / dispatch now, and most of the overhead is now in Granite itself, which is ideal.

Test #3 – Buffer descriptor spilling

I added an extra buffer descriptor per set which hits the INDIRECT_ADDRESS path. Heap regressed significantly, but it’s all in Granite code at least. Likely related having to page in new UBO blocks, but I didn’t look too closely. ~ 375 ns / dispatch, hnnnnnng. The other paths don’t change much as is expected. About ~ 310 ns / dispatch for legacy and descriptor buffer models.

Test #4 – Image descriptors

This is the happy path for descriptor heap.

layout(set = 0, binding = 0) writeonly uniform image2D Write;
layout(set = 0, binding = 1) uniform texture2D A;
layout(set = 0, binding = 2) uniform texture2D B;
layout(set = 0, binding = 3) uniform texture2D C;
layout(set = 0, binding = 4) uniform texture2D D;
layout(set = 0, binding = 5) uniform texture2D E;
layout(set = 0, binding = 6) uniform texture2D F;
layout(set = 0, binding = 7) uniform texture2D G;

Legacy

~ 161 ns / dispatch

Descriptor buffer

~ 166 ns. Quite interesting that it got slower. The slab allocator for legacy sets seems to be doing its job very well. The actual descriptor copying vanished from the top list at least.

Descriptor heap

~ 145 ns. A very modest gain, and most of the overhead is now just Granite jank.

Test #5 – Image descriptor spilling to heap

All the paths look very similar now. ~ 170 ns or so.

Some NV numbers

On RTX 4070 with 595 drivers.

Legacy
  • Test #1: Write 4096 image descriptors: 17.6 us (copies u32 indices)
  • Test #2: 693 ns
  • Test #3: 726 ns
  • Test #4: 377 ns
  • Test #5: 408 ns
Descriptor buffer
  • Test #1: 10.2 us (copies u32 indices)
  • Test #2: 434 ns
  • Test #3: 479 ns
  • Test #4: 307 ns
  • Test #5: 315 ns
Descriptor heap
  • Test #1: 11 us (copies real 32 byte descriptors)
  • Test #2: 389 ns
  • Test #3: 405 ns
  • Test #4: 321 ns
  • Test #5: 365 ns

The improvements especially for buffers is quite large on NV, interestingly enough. For the legacy buffer tests, it’s heavily biased towards driver overhead:

For the image tests the gains are modest, which is somewhat expected given how NV implements image descriptors before descriptor heap. It’s just some trivial u32 indices.

Conclusion

Overall, it’s interesting how well the legacy Vulkan 1.0 model holds up here, at least on RADV on my implementation. Descriptor buffer and heap cannot truly shine unless the abstraction using it is written with performance in mind. This sentiment is hardly new. Just porting OpenGL-style code over to Vulkan doesn’t give amazing gains, just like porting old and crusty binding models won’t magically perform with newer APIs either.

Either way, this level of performance is good enough for my needs, and the days of spamming out 100k draw calls is kinda over anyway, since it’s all GPU driven with large bindless data sets these days. Adding descriptor buffer and heap support to Granite was generally motivated by curiosity rather than a desperate need for perf, but I hope this post serves as an example of what can be done.

There’s a lot of descriptor heap that hasn’t been explored here. GPU performance for heavily bindless workloads is another topic entirely, and I also haven’t really touched on how it would be more practical to start writing code like:

some_gpu_buffer->blah_index = view->get_heap_index();

// GLSL
layout(descriptor_heap) uniform texture2D Heap[];
access(Heap[some_gpu_buffer.blah_index]);

// HLSL SM 6.6 style
Texture2D<float4> Tex = ResouceDescriptorHeap[some_gpu_buffer.blah_index];

which would side-step almost all Granite overhead.

Overall I quite like what we’ve got now with descriptor heap as an API, a bastard child of descriptor buffer and D3D12 that gets the job done. As tooling and driver support matures, I will likely just delete the descriptor buffer path, keeping the legacy stuff around for compatibility.

A case for learning GPU programming with a compute-first mindset

Beginners coming into our little corner of the programming world have it rough. Normal CPU-centric programming tends to start out with a “Hello World” sample, which can be completed in mere minutes. It takes far longer to simply download the toolchains and set them up. If you’re on a developer friendly OS, this can be completed in seconds as well.

However, in the graphics world, young whippersnappers cut their teeth at rendering the elusive “Hello Triangle” to demonstrate that yes, we can indeed do what our forebears accomplished 40 years ago, except with 20x the effort and 100000x the performance.

There’s no shortage of examples of beginners rendering a simple triangle (or a cube), and with the new APIs having completely displaced the oxygen of older APIs, there is a certain expectation of ridiculous complexity and raw grit required to tickle some pixels on the display. 1000 lines of code, two weeks of grinding, debugging black screens etc, etc. Something is obviously wrong here, and it’s not going to get easier.

I would argue that trying to hammer through the brick wall of graphics is the wrong approach in 2025. Graphics itself is less and less relevant for any hopeful new GPU programmer. Notice I wrote “GPU programmer”, not graphics programmer, because most interesting work these days happens with compute shaders, not traditional “graphics” rendering.

Instead, I would argue we should start teaching compute with a debugger/profiler first mindset, building up the understanding of how GPUs execute code, and eventually introduce the fixed-function rasterization pipeline as a specialization once all the fundamentals are already in place. The raster pipeline was simple enough to teach 20 years ago, but those days are long gone, and unless you plan to hack on pre-historic games as a hobby project, it’s an extremely large API surface to learn.

When compute is the focus, there’s a lot of APIs we could ponder, like CUDA and OpenCL, but I think Vulkan compute is the best compute focused API to start out with. I’m totally not biased obviously 😀 The end goal is of course to also understand the graphics pipeline, and pure compute APIs will not help you there.

Goal for this blog

I don’t intend to write a big book here that has all the answers on how to become a competent GPU programmer. Instead, I want to try outlining some kind of “meta-tutorial” that could be fleshed out further. I’ve been writing compute shaders since the release of OpenGL 4.3 ages ago and I still learn new things.

To abstract or not to abstract

For this exercise, I will rely on a mid-level API abstraction like my own Granite. I don’t think throwing developers into the raw API is the best idea to begin with, but there must be some connection with the underlying API, i.e., no multi-API abstraction that barely resembles the actual API which you’ll typically find in large production engines. Granite is a pure Vulkan abstraction I’ve been chiseling away at for years for my own needs (I ship it in tons of side projects and stuff), but it’s not really an API I’ve intended others to actively use and ship software on. Migrating away from the training wheels quickly is important though and compute makes that fairly painless. Granite is just one of many approaches to tackling Vulkan, and the intent is not to present this as the one true way.

The debugging first approach

Getting something to show up on screen is important to keep the dopamine juices flowing. Fortunately, we can actually do this without messing around with graphics directly. With RenderDoc captures we get debugging + something visual in the same package, and learning tooling early is critical to be effective. Debugging sufficiently interesting GPU code is impossible without this.

Shading language

The debug flow I propose with RenderDoc will rely on a lot of shader replacements and roundtrips via SPIRV-Cross’ GLSL backend, so Vulkan GLSL is the appropriate language to start with. It’s more or less a dead language at this point, but it’s also the most documented language that has support for buffer device addresses, which I will introduce right away to avoid having the brick wall of descriptors and binding models. This is a very compute-centric move, but makes other parts of the API easier to grasp later.

HLSL from the Direct3D ecosystem is a popular option, but as a compute language, HLSL is weaker than Vulkan GLSL in my experience, due to lack of a lot of features that come up in more interesting compute workloads, but being bilingual in this area is unavoidable these days. No matter which language you use, someone will call you a filthy degenerate anyway. :v

Deferring synchronization

Being debugger-centric we can avoid poking at explicit synchronization for a very long time and once we get there, we can simplify a ton. You can do a lot of interesting things with a single dispatch after all.

Writing the first program

Here’s a very basic program that copies some data around. It should trivially build on Linux or Windows on the usual compilers. Make sure to clone or symlink Granite so that it can be picked up by the CMake build.

If you try to run this, the output might look something like this:

...
[INFO]: Enabling VK_LAYER_KHRONOS_validation.
[INFO]: Enabling instance extension: VK_EXT_debug_utils.
[INFO]: Found Vulkan GPU: AMD Radeon RX 6800 (RADV NAVI21)
[INFO]: API: 1.4.328
[INFO]: Driver: 25.2.99
[INFO]: Found Vulkan GPU: NVIDIA GeForce RTX 4070
[INFO]: API: 1.4.312
[INFO]: Driver: 580.328.576
[INFO]: Using Vulkan GPU: AMD Radeon RX 6800 (RADV NAVI21)
[INFO]: Enabling device extension: VK_KHR_external_semaphore_fd.
[INFO]: Enabling device extension: VK_KHR_external_memory_fd.
[INFO]: Enabling device extension: VK_EXT_external_memory_dma_buf.
[INFO]: Enabling device extension: VK_EXT_image_drm_format_modifier.
[INFO]: Enabling device extension: VK_KHR_calibrated_timestamps.
[INFO]: Enabling device extension: VK_EXT_conservative_rasterization.
[INFO]: Enabling device extension: VK_KHR_compute_shader_derivatives.
[INFO]: Enabling device extension: VK_KHR_performance_query.
[INFO]: Enabling device extension: VK_EXT_memory_priority.
[INFO]: Enabling device extension: VK_EXT_memory_budget.
[INFO]: Enabling device extension: VK_EXT_device_generated_commands.
[INFO]: Enabling device extension: VK_EXT_mesh_shader.
[INFO]: Enabling device extension: VK_EXT_external_memory_host.
[INFO]: Enabling device extension: VK_KHR_fragment_shader_barycentric.
[INFO]: Enabling device extension: VK_EXT_image_compression_control.
[INFO]: Graphics queue: family 0, index 0.
[INFO]: Compute queue: family 1, index 0.
[INFO]: Transfer queue: family 1, index 1.
[INFO]: Detected attached tool:
[INFO]: Name: Khronos Validation Layer
[INFO]: Description: Khronos Validation Layer
[INFO]: Version: 327
[INFO]: Detected tool which cares about debug markers.
[INFO]: Allocating 64.0 MiB on heap #1 (mode #3), before allocating budget: (0.0 MiB / 15096.3 MiB) [0.0 / 16368.0].
[INFO]: Allocating 64.0 MiB on heap #0 (mode #0), before allocating budget: (0.0 MiB / 64216.8 MiB) [0.0 / 64372.4].
[ERROR]: Failed to load RenderDoc, make sure RenderDoc started the application in capture mode.
[INFO]: Allocating 64.0 MiB on heap #1 (mode #1), before allocating budget: (64.6 MiB / 15096.3 MiB) [64.0 / 16368.0].

Capturing Vulkan code in RenderDoc

The code we just wrote executes on the GPU, but we have no easy way to observe the code actually running on the device. This is where RenderDoc comes in. Point to the executable we built.

After launching, the capture happens automatically, and when the process terminates, the capture should appear.

Clicking on the copy command and double-clicking the destination buffer, we can see the raw contents:

The zero-initialization flag we passed into buffer creation was technically not needed, but it helped make the capture a little easier to understand. That clear happened automagically inside Granite. Normally, memory is not assumed to be zero-cleared on allocation.

Running some actual shaders

Instead of using copies, we can create our own little memcpy. Here’s an updated sample gist. To keep things simple, we can use shaderc‘s method of compiling GLSL into a C header file.

glslc -o copy.h -mfmt=c --target-env=vulkan1.2 copy.comp

Vulkan 1.2 is used here since that introduced buffer device addresses in core. Building and capturing again gives us:

To inspect the push constants, it’s under uniform buffers:

RenderDoc understands how to resolve pointers in buffers into links that open the relevant buffer.

If you click on an event before the dispatch you’ll see the writes disappear.

Shader replacement workflow with SPIRV-Cross and debug prints

This workflow is extremely powerful for difficult debugging scenarios and I cannot do my job without this. It’s imperative to learn this early. Debug gives you a more traditional step debugger. Depending on the bug, this may be the correct tool (e.g. inspecting a particular broken pixel), but in my experience, when working with a ton of GPU threads in parallel, it’s often required to study it in aggregate to see what is going on, since you might not even know which thread is at fault to begin with.

For now, select Edit -> Decompile with SPIRV-Cross

The Vulkan API uses the SPIR-V intermediate representation and SPIRV-Cross converts this back to equivalent GLSL. Fortunately, this result looks very similar to our original shader. This is one of the main reasons I prefer working with GLSL since the translation back and forth to SPIR-V is the least lossy compared to alternatives.

E.g. we can hack in some debug prints, hit Apply and the dispatch will have messages attached to them. RenderDoc implements debugPrintfEXT by rewriting the SPIR-V to write the printf values back to host. The Vulkan driver itself does not understand how to printf. It will just ignore the SPIR-V command to printf.

Shader replacement like this is not just for debug prints, but you can modify the code and see the results without having to recompile the entire program and recapturing.

For fun, debug print “Hello World” instead, and you have your checkbox ticked off.

Inspecting ISA

If you’re on a driver that exposes it, you can study the machine code. For this purpose, I highly recommend AMD GPUs with RADV driver on Linux. The ISA is arguably the most straight forward to read. All Mesa drivers should give you ISA no matter the graphics card if you don’t have an AMD card lying around.

; s6 = WorkgroupID.x
; v0 = LocalInvocationID.x
; GlobalInvocationID.x =
;   (WorkgroupID.x << 4) + LocalInvocationID.x
v_lshl_add_u32 v0, s6, 4, v0

; v1 = int(v0) >> 31, i.e. a simple sign extend
; s4, s5: Holds dst GPU pointer
; This is a slower 64-bit address computation path which
; only shows up for raw device addresses
v_ashrrev_i32_e32 v1, 31, v0
v_lshlrev_b64 v[0:1], 2, v[0:1]
v_add_co_u32 v2, vcc_lo, s4, v0
v_add_co_ci_u32_e32 v3, vcc_lo, s5, v1, vcc_lo

; Load 32-bits from pointer
global_load_dword v2, v[2:3], off

; In parallel, compute dest address with 64-bit math.
v_add_co_u32 v4, vcc_lo, s2, v0
v_add_co_ci_u32_e32 v5, vcc_lo, s3, v1, vcc_lo

; Wait until the read request completes
s_waitcnt vmcnt(0)

; Store
global_store_dword v[4:5], v2, off
s_endpgm

Understanding how a compute dispatch is organized

Before trying to make sense of this, we need a mental model for how GPU compute executes on the device. This model was more or less introduced by CUDA in 2007 and has remained effectively unchanged since, neat!

At the highest level, in the CPU side, we dispatch a 3D grid of workgroups. In this sample, we just have a 1x1x1 cube or dispatches.

 cmd->dispatch(1, 1, 1);

For every workgroup, there is another 3D grid of invocations. Multiple invocations work together and are able to efficiently communicate with each other. Communicating across workgroups is possible in some situations, but requires some scissor juggling.

Why is there a two level hierarchy?

GPUs are extremely parallel machines. To get optimal performance, we have to map the very scalar-looking code to SIMT. The model employed for essentially all modern GPUs is that one lane in the vector units maps to one thread.

Inside the workgroup, the invocations are split up into subgroups. The mental model to understand the distinction is:

  • Workgroup -> runs concurrently on the same shader core
  • Subgroup -> runs in lock-step in a SIMT fashion

For a workgroup to be running well, the number of invocations in it should be an integer multiple of the subgroup size, otherwise there will be lanes doing nothing, and that’s no fun.

The subgroup sizes in the wild vary quite a lot, but there is an upper legal limit of 128. In practice, these are the values you can expect to find in the wild:

  • 4: Really old Mali Bifrost, old iPhones
  • 8: Intel Arc Alchemist
  • 16: Intel Arc Battlemage, Mali Valhall, Intel (runs slower if not Battlemage)
  • 32: AMD RDNA, NVIDIA, Intel upper limit (runs even slower)
  • 64: Adreno, AMD GCN + RDNA
  • 128: Adreno

Some vendors support multiple subgroup sizes. Usually we don’t have to care too much about this until you graduate to the more hardcore level of compute shader programming, but Vulkan gives you control to force subgroup sizes when need be.

For desktop use cases, catering to the range of 16 to 64 is reasonable. In the example we’ve been looking at, the workgroup size is just 16, so this is not optimal. On mobile GPUs, you might need to consider a wider range of hardware.

The rule of thumb (for desktop) is to use one of three constellations for local_size:

  • (64, 1, 1) for 1D
  • (8, 8, 1) for 2D
  • (4, 4, 4) for 3D

Integer multiples of these are fine too. This should make almost any GPU happy. The maximum limit is 1024 invocations, but I never recommend going that high unless there are very good reasons to.

AMD specifics

For the AMD case, v_ instructions are vector instructions, meaning while it looks like a simple register, there are multiple instances of it, one for every invocation in the subgroup.

s_ instructions are scalar. It runs once per subgroup and runs in parallel with the vector units. Taking advantage of subgroup uniform code can be very powerful.

One way to think of this is that v_ instructions look more like SIMD, except that the SIMD width is much larger than CPU instruction sets:

addps xmm0, xmm1

where scalar instructions look more like regular CPU instructions:

add eax, ebx

Vector load-store looks more like gather/scatter, and scalar load-store is more like normal CPU load-store.

Shader replacement by modifying SPIR-V assembly

In rare cases, it’s useful to do minor in-place modifications to SPIR-V itself. As a very ad-hoc sample, we can attempt to clean up the awkward 64-bit pointer math by using OpInBoundsAccessChain instead of OpAccessChain. Edit -> Decompile with spirv-dis

Now replace OpAccessChain with OpInBoundsAccessChain and apply. While keeping the shader tab open, go back to Pipeline State viewer and look at GPU ISA:

OpInBoundsAccessChain tells the compiler that we cannot index outside the array, which means negative indices and massively large indices are not allowed, and this allows the compiler to emit u64 + u32 addressing format. Sure looks much nicer now. This is way beyond what a beginner should care about, but the point is to demonstrate that we can replace raw SPIR-V too, and also demonstrates how you can inspect the SPIR-V of shaders easily.

Introducing descriptors

We can do a lot with simple buffer device addresses to process data, but there is a limit to how far we can get with that approach if the end goal is game rendering. There are things GPUs can do that raw pointers cannot:

  • Efficiently sample textures
  • Automatic format conversions
  • “Free” bounds checking

CPU ISAs do none of these. In this updated memcpy sample, I introduce two descriptor types, STORAGE_BUFFER and UNIFORM_TEXEL_BUFFER. Using descriptors like this is the “normal” way to use Vulkan, and should be preferred when feasible.

For pragmatic reasons, it’s easier to debug and validate descriptors compared to raw pointers. Raw pointers are also prone to GPU crashes, which are very painful and annoying to debug. Unlike CPU debugging, we cannot just capture a SIGSEGV in a debugger and call it a day.

Already, the resources show up in a more convenient way. Even though the shader didn’t specify 8-bit inputs anywhere, it just works. Typed formats have up to 4 components. This is enough to cover RGB and Alpha channels, which of course has its origins in graphics concepts. texelFetch cannot know if we have R8_UINT or R16G16B16A16_UINT for example, so we have to select the .x component in the shader.

s_mov_b32 s0, s3

; On RADV, descriptors live in a fixed 32-bit VA region:
; 0xffff8000'xxxxxxxx
; Since this is known by compiler, we only need to pass down 32-bits
; and synthesize the upper half with s_movk_i32 (0x8000 is sign-extended).
s_movk_i32 s3, 0x8000

; STORAGE_BUFFER and TEXEL_BUFFER are both 16 bytes.
; Both are loaded here in one go into scalar registers.
; Notice the s_load. These loads go into the constant cache,
; and is almost "free". The same load is shared for all threads
; in the subgroup.
s_load_dwordx8 s[8:15], s[2:3], null
v_lshl_add_u32 v0, s0, 4, v0

; Wait until the scalar load completes so we can use the descriptor.
s_waitcnt lgkmcnt(0)

; Typed load. The descriptor holds information about e.g. R8_UINT.
; Bounds checking is also free.
buffer_load_format_x v1, v0, s[12:15], 0 idxen
v_lshlrev_b32_e32 v0, 2, v0
s_waitcnt vmcnt(0)

; Store, but uses a descriptor instead.
; This is automatically bounds checked and is free on AMD.
buffer_store_dword v1, v0, s[8:11], 0 offen
s_endpgm

The basic Vulkan binding model

Granite implements a rather old school binding model, but I think this model is overall very easy to understand and use. More modern bindless design is introduced later when it becomes relevant.

In the shader, I declare things like layout(set = 0, binding = 1). In Granite, this simply means that we have to bind a resource of the appropriate type before dispatching, e.g.:

 cmd->set_buffer_view(/* set */ 0, /* binding */ 1, *buffer_view);

This papers over a ton of concepts, and makes it very easy and convenient to program. In reality, there are a lot of API objects in play here. When the compiler sees layout(set = 0, binding = 1) for example, it needs to check the provided VkPipelineLayout given to pipeline creation. A set denotes a group of resources that are bound together as one contiguous entity. In the ISA, set = 0 is determined to be initialized in a certain scalar register:

s_load_dwordx8 s[8:15], s[2:3], null

The VkPipelineLayout also contains information about what e.g. binding = 1 means. In this case, the driver happened to decide that binding = 0 is at offset 0, and binding = 1 is at offset 16. Since these descriptors are adjacent in memory we got a lucky optimization where we load 32 bytes at once.

On the API side, we need a compatible VkPipelineLayout object when recording the command buffer to ensure that everything lines up. Granite does this automatically, through shader reflection, which synthesizes a working layout for us.

Based on the contained VkDescriptorSetLayout inside the pipeline layout, it knows how to allocate a VkDescriptorSet from a VkDescriptorPool and write descriptors to it. Then it can bind the descriptor set to the command buffer before dispatching. We can see all of this in effect in the capture. Turn off the Filter and we get:

The descriptor set is updated, then later bound. In reality vkCmdBindDescriptorSets is just a 32-bit push constant, which the shader ends up reading in s3 register.

Deeper understanding with VK_EXT_descriptor_buffer

Managing descriptors is always a point of contention in Vulkan programming if you’re writing the raw API. There’s a ton of concepts to juggle and it’s mostly pretty dull stuff.

As an extension to the original old school model I outlined about, it’s possible to treat a descriptor set as raw memory which gets rid of a ton of jank. Granite supports using this model by opting in to it. Change the sample to use and recapture:

 if (!ctx.init_instance_and_device(
   nullptr, 0, nullptr, 0,
   Vulkan::CONTEXT_CREATION_ENABLE_ROBUSTNESS_2_BIT |
   Vulkan::CONTEXT_CREATION_ENABLE_DESCRIPTOR_BUFFER_BIT))
   return EXIT_FAILURE;

Make sure to make a release build of the test and not a debug build, otherwise descriptor buffers are disabled. Note that this requires a recent build of RenderDoc. The latest stable v1.40 release supports descriptor buffers.

Now we explicitly tell the driver that the descriptor set lives at offset 0 from the bound buffer. If we then inspect the bound descriptor buffer …

Now we can see the raw guts of the storage buffer and texel buffer being encoded. You can even see the 0x40 and 0x10 being encoded there which corresponds to the sizes of the descriptors.

Porting ShaderToy shaders to compute

To get something interesting on screen to end this bringup exercise, we could port some shadertoy shaders. These are super convenient since many of them don’t require anything fancy to run like external textures or anything. I picked some shadertoy arbitrarily. Store this to mandelbulb.glsl, and then we replace our shader with a mandelbulb.comp that calls the shadertoy code:

#version 460

// We're writing 2D images now, so this makes more sense.
layout(local_size_x = 8, local_size_y = 8) in;

layout(set = 0, binding = 0) writeonly uniform image2D output_image;

// Constants used by shadertoys.
layout(push_constant) uniform Registers
{
  vec2 iResolution;
  float iTime;
};

#include "mandelbulb.glsl"

void main()
{
  vec4 color;
  mainImage(color, vec2(gl_GlobalInvocationID.xy) + 0.5);

  // Stores the result to texture.
  imageStore(output_image, ivec2(gl_GlobalInvocationID.xy), color);
}

On the API side, we simply need to create a storage texture and bind it to the shader.

Just with this simple setup, you can go completely nuts and play around with the more math-heavy side of graphics if you want.

Where to go from here?

From here, I think the natural evolution is to learn about:

  • Atomics
    • Living in a world without mutexes: lockless programming with millions of threads
  • Shared memory
  • Subgroup operations
    • Case study: Scalarization
    • Case study: Bindless and non-uniform indexing of descriptors
  • Texture sampling and mip-mapping
    • The bread and butter of graphics
    • Case study: do some simple image processing with simple filters
  • Memory coherency and how to communicate with other workgroups
    • Case study: Single pass down-sampling
  • If relevant, start porting over the code to more shading languages
  • API synchronization and how to keep CPU and GPU pipelined
  • … and maybe only then start looking at getting some images on screen (with compute)
  • Bring up your own Vulkan code from scratch to get rid of the training wheels and make sure you understand how everything comes together

After that, it’s a matter of learning the common algorithms that show up all the time, like parallel scans, classification, binning, etc, etc. This naturally leads to indirect dispatches, and once these concepts are in place, we can design a very simple compute shader rasterizer that renders some simple glTF models. Only when those concepts land do we consider the graphics pipeline.

I designed my own ridiculously fast game streaming video codec – PyroWave

Streaming gameplay from one machine to another over a network is a reasonably popular use case these days. These use cases demand very, very low latency. Every millisecond counts here. We need to:

  • Send controller input from machine A to B over network
  • B renders a frame on the GPU
  • B encodes the frame into a bitstream
  • B sends the result over a network to A
  • A decodes the bitstream
  • A displays image on screen
  • Dopamine is released in target brain

Every step in this chain adds latency and we want to minimize this as much as possible. The go-to solution here is GPU accelerated video compression using whatever codec you have available, usually H.264, HEVC or if you’re really fancy, AV1. Ideally, we want all of this to complete in roughly ~20 ms.

To make this use case work well, we have to strangle the codecs quite a bit. Modern video codecs love latency since it allows tricks like flexible rate control and B-frames. That is what allows these codecs to operate at ridiculous compression ratios, but we have to throw away most of these tricks now. Since the codec cannot add latency, and we’re working on a fixed bit-rate budget (the ethernet cable or WiFi antenna), we’re left with:

  • Hard-capped constant bit rate
    • There is no buffer that can soak up variable bit rate
  • Infinite GOP P-frames or intra-refresh
    • Either choice deals with packet loss differently

When game streaming, the expectation is that we have a lot of bandwidth available. Streaming locally on a LAN in particular, bandwidth is basically free. Gigabit ethernet is ancient technology and hundreds of megabits over WiFi is no problem either. This shifts priorities a little bit for me at least.

Back in my student days, I designed a very simple low-complexity video codec for my master thesis, and after fiddling with Vulkan video and PyroFling for a while, that old itch was scratched again. I wanted to see what would happen if I designed a codec with laser focus on local streaming with the absolute lowest possible latency, what could go wrong?

Throwing out motion prediction – intra-only

This is the grug-brained approach to video, but it’s not as silly as it sounds. Bit-rates explode of course, but we gain:

  • Excellent error resilience
    • Even on local WiFi streaming to a handheld device, this does matter quite a lot, at least in my experience
  • Simplicity
    • Duh
  • Consistent quality
    • With CBR, video quality is heavily dependent on how good of a job motion estimation can do

Intra-only has use cases in digital cinema (motion JPEG2000) and more professionally oriented applications where these concerns are likely more important than squeezing bandwidth. We’re now working at 100+ Mbits/s instead of ~10-20 Mbit/s, so streaming over the internet is no longer feasible outside of peer-to-peer with fiber links. For reference, raw 1080p60 with 420 chroma subsampling is in the range of 1.5 Gbit/s, and it only gets worse from there.

Throwing out entropy coding

Entropy coding is an absolute nightmare for parallelization, which means encoding solely on the GPU with compute shaders becomes an extremely painful affair. Let’s just throw that out and see how far we get. Gotta go fast!

There are codecs in this domain too, but it’s getting very specialized at this point. In the professional broadcasting space, there are codecs designed to squeeze more video through existing infrastructure with “zero” lag and minimal hardware cost. My master thesis was about this, for example. A more consumer oriented example is VESA display stream compression (I’m not sure if it does entropy coding, but the compression ratios are small enough I doubt it). There isn’t much readily available software in this domain, it’s generally all implemented in tiny ASICs. If FFmpeg doesn’t support it, it doesn’t exist for mere mortals.

Discrete Wavelet Transforms

While modern codecs are all block-based Discrete Cosine Transform (DCT) + a million hacks on top to make it good, there is an alternative that tried its best in the 90s to replace DCTs, but kinda failed in the end. https://www.youtube.com/watch?v=UGXeRx0Tic4 is a nice video explaining some of the lore. DWT-based compression has a niche today, but only in intra video compression. It’s a shame, because it’s quite elegant. https://en.wikipedia.org/wiki/Discrete_wavelet_transform

A graphics programmer will be familiar with this structure immediately, because this is just good old mip-maps with spice. Effectively, we downsample images, and also compute the “error” between the high-res picture and low-res picture. With signal processing lenses on, we can say it’s a critically sampled filter-bank. After processing N pixels, we obtain N / 2 low-pass and N / 2 high-pass pixels. The filters designed to do this are very particular (I really don’t know or care how they were made), but it’s basically just a basic convolution kernel, nothing too wild. The number of levels can vary but I chose 5 levels of decomposition.

Once the image is filtered into different bands, the values are quantized. Quantizing wavelets is a little tricky since we need to consider that during reconstruction, the filters have different gains. For the CDF 9/7 filter, high-pass is attenuated by 6 dB, and there are other effects when upsampling the lower resolution bands (zero-insertion). Rather than sweating out new graphs, I’ll just copy paste from my thesis. CDF 9/7 has very similar looking spectrum to the 5/3 I used here.

After normalizing the noise power, higher frequency bands can be quantized much harder than low-frequency bands. This exploits human psychovisual effects. This effect is used during rate control, which is another interesting problem. In the end, the higher frequency bands quantize to zero for most of the frame, with bits being allocated to critical regions of the image.

Classic artifacts of wavelets

The JPEG blocking artifact is infamous. Wavelet’s typical failure mode is that all high-pass information is quantized to 0, even where it shouldn’t be. This leads to a blurring – and if severe – ringing artifact. Given how blurry games these days can be with TAA, maybe this simply isn’t all that noticeable? 😀 Modern problems require modern solutions.

Packing coefficients into blocks really fast

Fiddling with this part of the codec was the thing that took the longest, but I think I landed on something alright eventually.

The basic block is 32×32 coefficients. This forms a standalone unit of the bitstream that can be decoded independently. If there is packet loss, we can error correct by simply assuming all coefficients are zero. This leads to a tiny blur somewhere random in the frame which is likely not even going to be perceptible.

The 32×32 block is further broken down into 8×8 blocks, which are then broken down into 4×2 blocks. This design is optimized for GPUs hierarchy of threads:

  • 1 thread: 4×2 coefficients
  • Cluster of threads (subgroup): 8×8 block
  • Workgroup: 32×32 block (128 threads)

8 coefficients per thread is deliberately chosen so that we can be byte oriented. Vulkan widely supports 8-bit storage of SSBOs, so I rely on that. We absolutely cannot be in a situation where we do bit fiddling on memory. That makes GPUs sad.

Like most wavelet codecs, I went with bit-plane encoding, but rather than employing a highly complicated (and terribly slow) entropy coding scheme, the bit-planes are just emitted raw as-is. I did this in my master thesis project and I found it surprisingly effective. The number of bits per coefficient are signaled at a 4×2 block level. I did some experiments on these block sizes and 4×2 was the right tradeoff. Using subgroup operations and some light prefix sums across the workgroup, it’s very efficient to decode and encode data this way. For non-zero coefficients, sign bits are tightly packed to the end of the 32×32 block. This was mildly tricky to implement, but not too bad.

The details are in my draft of the bitstream.

Accurate and fast rate control

In this style of compression, rate control is extremely important. We have a fixed (but huge) budget we have to meet. Most video codecs struggle with this requirement since the number of bits we get out of entropy coding is not easily knowable before we have actually done the compression. There is usually a lot of slack available to codecs when operating under normal VBR constraints. If a frame overshoots by 30%, we can amortize that over a few frames no problem, but that slack does not exist here since we’re assuming zero buffering latency.

Without entropy coding, we can trivialize the problem. For every 32×32 block, I test what happens if I throw away 1, 2, 3, … bits. I measure psychovisually weighted error (MSE) and bit-cost from that and store it in a buffer for later.

During RD analysis I can loosely sort the decisions to throw away bits by order of least distortion per bit saved. After the required number of bits have been saved through some prefix summing, we have achieved a roughly optimal rate distortion across the entire image in fixed time.

In the final pass, every 32×32 block checks how many bits to throw away and packs out the final bit-stream to a buffer. The result is guaranteed to be within the rate limit we set, usually ~10-20 bytes under the target.

Being able to rate limit like this is a common strength of wavelet codecs. Most of them end up iterating from most significant to least significant bit-plane and can stop encoding when rate limit is met, which is pretty cool, but also horribly slow …

Gotta go ridiculously fast 😀

So … is it fast? I think so. Here’s a 1080p 4:2:0 encode and decode of Expedition 33, which I found to be on the “brutal” end for image compression. Lots of green foliage and a lot of TAA noise is quite hard to encode.

0.13 ms on a RX 9070 XT on RADV. Decoding is also very fast. Under 100 microseconds. I don’t think anything else even comes close. The DWT pass was quite heavily optimized. It’s one of the few times where I found that packed FP16 math actually helped a lot, even on a beast like RDNA4. The quantizer pass does the most work of all the passes, and it took some effort to optimize too. Doing DWT in FP16 does have a knock-on effect on the maximum quality metrics we can achieve though.

Encoding more “normal” games, the quant + analysis pass has an easier time. 80 microsecond encode is pretty good.

Here’s a 4K 4:2:0 encode of the infamous ParkJoy scene.

0.25 ms, showing that 1080p struggles a bit to keep the GPU fully occupied. An interesting data point is that transferring a 4K RGBA8 image over the PCI-e bus is far slower than compressing it on the GPU like this, followed by copying over the compressed payload. Maybe there is a really cursed use case here …

I think this is an order of magnitude faster than even dedicated hardware codecs on GPUs. This performance improvement translates directly to lower latency (less time to encode and less time to decode), so maybe I’m onto something here.

Power consumption on the Steam Deck when decoding is also barely measurable. Not sure it’s less than the hardware video decoder, but I actually wouldn’t be surprised if it were.

Quality comparisons

Given how niche and esoteric this codec is, it’s hard to find any actual competing codecs to compare against. Given the domain is game streaming the only real alternative is to test against the GPU vendors’ encoders with H.264/HEVC/AV1 codecs in FFmpeg. NVENC is the obvious one to test here. VAAPI is also an option but at least FFmpeg’s implementation of VAAPI fails to meet CBR targets which is cheating and broken for this particular use case. It’s possible I held it wrong, but I’m not going to try debugging that.

Over 200mbit/s at 60 fps, I find it hard to tell any compression artifacts without side-by-side comparisons + zooming in, which is about 1.5 bpp. For something as simple as this codec, that’s quite neat. For objective metrics, I made use of https://github.com/psy-ex/metrics.

To even begin to compare a trivial codec like this against these codecs is a little silly, but we can level the playing field a bit by putting these codecs under the same harsh restriction that we have in PyroWave:

  • Intra-only
  • CBR with hard cap
    • Encoder is not allowed any slack, which should make the rate control really sweat
  • Fastest modes (not sure it matters that much to intra-only)

Example command line here:

ffmpeg -y -i input.y4m -b:v 100000k -c:v hevc_nvenc -preset p1 -tune ull -g 1 -rc cbr -bufsize 1667k out.mkv

No one in their right mind would stream like this, but let’s try anyway. 🙂

The video clips from the games are 5s clips I captured myself to raw NV12 video. I don’t think it’s super useful to upload those. My hacked up scripts to generate the graphs can be found here for reference.

I ran the NVENC tests on an RTX 4070 on 575 drivers.

ParkJoy

I included this as a baseline since this sequence has been seared into the mind of every video engineer out there (I think?) … This clip is 50 fps, but since my test script is hard-coded for 1080p60, I hex-edited the .y4m :’)

O_O. One thing to note here is that AV1/HEVC rate control kinda fails in this scenario. It ends up using less than the allotted budget, probably because it has to be conservative to meet the ridiculous hard-capped CBR. The graphs are done using the final encoded size however.

… VMAF, are you drunk?

Back to reality.

The more typical metrics look more like what I would expect. XPSNR is supposed to be a weighted PSNR that takes psychovisual effects into account, but I have no idea if it’s a good objective metric.

Expedition 33

Quite hard to encode. It was this game that actually gave me the last push to make this codec since even at 50 mbit/s with motion estimation, I recall some sections giving the encoder real trouble. Bumping bit rates just never cleaned things up properly for whatever reason …

I don’t know why, but VMAF really likes PyroWave.

Stellar Blade

Surprisingly easy to code. Must be the blurred backgrounds in play.

The use of FP16 kinda limits how high the PSNR can go. This is way beyond transparency, so, whatever.

Street Fighter 6

This scene is arguably a good argument for 4:4:4 …

FF VII Rebirth

More foliage, which I expected to be kinda hard. The game’s presentation is very soft and it shows in the compression rates 🙂

VMAF really seems like a joke metric for these use cases …

Another example of PSNR flattening off due to lowered internal precision.

Conclusion

I’m quite happy with this as-is, and having a 100% DIY streaming solution for my own use is fun.

Conquering FidelityFX FSR4 – Enabling the pretty pixels on Linux through maniacal persistence

As AMD’s RDNA4 GPUs released, FSR 4 was released alongside it to much fanfare. It’s a massive leap in quality over FSR 3.1, and with FSR 4 moving to a machine learning model instead of an analytical model, FSR 3.1 marks the farewell to fully open and grok-able upscalers. It’s truly a shame, but the graphics world cares little about such sentimentality when pretty pixels are at stake.

FSR 4 is not actually released yet to developers for integration in engines, but exists as a driver toggle that can be enabled for certain FSR 3.1 games. The FSR 3.1 implementation had the foresight to let the implementation peek into the driver DLLs and fish out a replacement implementation. This is of course a horrible mess from any compatibility perspective, but we just have to make it work.

From day 1 of RDNA4’s release, there’s been a lot of pressure from the Linux gaming community to get this working on Linux as well. Buying an RDNA4 GPU would be far less enticing if there’s no way to get FSR 4 working after all …

First battle – how is FSR 4 even invoked?

Given the (currently) proprietary nature of FSR 4, we could easily have been in the situation of DLSS where it’s literally impossible to re-implement it. DLSS uses interop with CUDA for example and there’s 0% chance anyone outside of NVIDIA can deal with that. Fortunately NVIDIA provides the shims required to make DLSS work on Proton these days, but we’re on our own at this time for FSR 4.

It all started with an issue made on vkd3d-proton’s tracker asking for FSR 4 support.

Somehow, with the OptiScaler project, they had got to the point where vkd3d-proton failed to compile compute shaders. This was very encouraging, because it means that FSR 4 goes through D3D12 APIs somehow. Of course, D3D12’s way of vendor extensions is the most disgusting thing ever devised, but we’ll get to that later …

The flow of things seems to be:

  • FSR 3.1 DLL tries to open the AMD d3d12 driver DLL (amdxc64.dll) and queries for a COM interface.
  • Presumably, after checking that FSR 4 is enabled in control panel and checking if the .exe is allowed to promote, it loads amdxcffx64.dll, which contains the actual implementation.
  • amdxcffx64.dll creates normal D3D12 compute shaders against the supplied ID3D12Device (phew!), with a metric ton of undocumented AGS magic opcodes. Until games start shipping their own FSR 4 implementation, I’d expect that users need to copy over that DLL from an AMD driver install somehow, but that’s outside the scope of my work.
  • amdxcffx64.dll also seems to call back into the driver DLL to ask for which configuration to use. Someone else managed to patch this check out and eventually figure out how to implement this part in a custom driver shim.

With undocumented opcodes, I really didn’t feel like trying anything. Attempting to reverse something like that is just too much work. There was high risk of spending weeks on something that didn’t work out, but someone found a very handy file checked into the open source LLPC repos from AMD. This file is far from complete, but it’s a solid start.

At this point it was clear that FSR 4 is based on the WMMA (wave matrix multiply accumulate) instructions found on RDNA3 and 4 GPUs. This is encouraging since we have VK_KHR_cooperative_matrix in Vulkan which maps directly to WMMA. D3D12 was supposed to get this feature in SM 6.8, but it was dropped for some inexplicable reason. It’s unknown if FSR 4 could have used standard WMMA opcodes in DXIL if we went down that timeline. Certainly would have saved me a lot of pain and suffering …

Dumping shaders

Next step was being able to capture the shaders in question. Ratchet & Clank – Rift Apart was on the list of supported games from AMD and was the smallest install I had readily available, so I fired it up with RGP on Windows and managed to observe WMMA opcodes. Encouraging!

Next step was dumping the actual DXIL shaders, however, it seems like the driver blocks FSR 4 if RenderDoc is attached (or some other unknown weirdness occurs), so a different strategy was necessary.

First attempt was to take a FSR 3.1 application without anti-tampering and hook that somehow. The FSR 3.1 demo app in the SDK was suitable for this task. The driver refused to use FSR 4 here, but when I renamed the demo .exe to RiftApart.exe it worked. quack.exe lives on!

Now that I had DXIL and some example ISA to go along with it, it was looking possible to slowly piece together how it all works.

Deciphering AGS opcodes

Shader extensions in D3D is a disgusting mess and this is roughly how it works.

  • Declare a magic UAV at some magic register + space combo
  • Emit a ton of oddly encoded atomic compare exchanges to that UAV
    • DXC has no idea what any of this means, but it must emit the DXIL as is
    • Compare exchange is likely chosen because it’s never okay to reorder, merge or eliminate those operations
  • Driver compiler recognizes these magic back-doors, consumes the magic stream of atomic compare exchanges, and translates that into something that makes sense

dxil-spirv in vkd3d-proton already had some code to deal with this as Mortal Kombat 11 has some AGS shaders for 64-bit atomics (DXIL, but before SM 6.6).

RWByteAddressBuffer MAGIC : register(u0, space2147420894);

uint Code(uint opcode, uint opcodePhase, uint immediateData) 
{ 
   return (MagicCode << MagicCodeShift) | 
          ((immediateData & DataMask) << DataShift) | 
          ((opcodePhase & OpcodePhaseMask) << OpcodePhaseShift) | 
          ((opcode & OpcodeMask) << OpcodeShift); 
} 

uint AGSMagic(uint code, uint arg0, uint arg1) 
{ 
       uint ret; 
       MAGIC.InterlockedCompareExchange(code, arg0, arg1, ret); 
       return ret; 
} 

uint AGSMagic(uint opcode, uint phase, uint imm, uint arg0, uint arg1) 
{ 
       return AGSMagic(Code(opcode, phase, imm), arg0, arg1); 
}

Every WMMA opcode is translated to a ton of these magic instructions back-to-back. A maximum of 21 (!) in fact. dxil-spirv has to pattern match more or less.

The exact details of how WMMA is represented with these opcodes isn’t super exciting, but for testing this functionality, I implemented a header.

It seems like a wave matrix is represented with 8 uints. This tracks, as for 16×16 matrix and wave32 + FP32, you need 256 bits per lane in the worst case.

struct WMMA_Matrix 
{ 
       uint v[8]; 
};

Here’s how WMMA matmul can be represented for example:

WMMA_Matrix WMMA_MatMulAcc(WaveMatrixOpcode op, 
 WMMA_Matrix A, WMMA_Matrix B, WMMA_Matrix C) 
{ 
 // A matrix 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 0, WaveMatrixRegType_A_TempReg), 
   A.v[0], A.v[1]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 0, WaveMatrixRegType_A_TempReg), 
   A.v[2], A.v[3]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 1, WaveMatrixRegType_A_TempReg), 
   A.v[4], A.v[5]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 1, WaveMatrixRegType_A_TempReg), 
   A.v[6], A.v[7]); 

 // B matrix 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 0, WaveMatrixRegType_B_TempReg), 
   B.v[0], B.v[1]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 0, WaveMatrixRegType_B_TempReg), 
   B.v[2], B.v[3]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 1, WaveMatrixRegType_B_TempReg), 
   B.v[4], B.v[5]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 1, WaveMatrixRegType_B_TempReg), 
   B.v[6], B.v[7]); 

 // C matrix 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 0, WaveMatrixRegType_Accumulator_TempReg), 
   C.v[0], C.v[1]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 0, WaveMatrixRegType_Accumulator_TempReg), 
   C.v[2], C.v[3]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(0, 1, WaveMatrixRegType_Accumulator_TempReg), 
   C.v[4], C.v[5]); 
 AGSMagic(WaveMatrixMulAcc, 0, 
   MatrixIO(1, 1, WaveMatrixRegType_Accumulator_TempReg), 
   C.v[6], C.v[7]); 

 // Configure type 
 AGSMagic(WaveMatrixMulAcc, 1, 
   int(op) << int(WaveMatrixOpcode_OpsShift), 0, 0); 

 // Read output 
 WMMA_Matrix ret; 
 ret.v[0] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(0, 0, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[1] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(1, 0, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[2] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(2, 0, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[3] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(3, 0, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[4] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(0, 1, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[5] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(1, 1, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[6] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(2, 1, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 ret.v[7] = AGSMagic(WaveMatrixMulAcc, 2, 
   MatrixIO(3, 1, WaveMatrixRegType_RetVal_Reg), 0, 0); 
 return ret; 
}

Fortunately, FSR 4 never tries to be clever with the WMMA_Matrix elements. It seems possible to pass in whatever reason uint you want, but the shaders follow a strict pattern so dxil-spirv can do type promotion from uint to OpTypeCooperativeMatrixKHR on the first element and ignore the rest.

It took many days of agonizing trial and error, but eventually I managed to put together a test suite that exercises all the opcodes that I found in the DXIL files I dumped.

Tough corner cases

Ideally, there would be a straight forward implementation to KHR_cooperative_matrix, but that’s not really the case.

FP8 (E4M3)

FSR 4 is heavily reliant on FP8 WMMA. This is an exclusive RDNA4 feature. RDNA3 has WMMA, but only FP16. There is currently no SPIR-V support for Float8 either (but given that BFloat16 just released it’s not a stretch to assume something will happen in this area at some point).

To make something that is actually compliant with Vulkan at this point, I implemented emulation of FP8.

Converting FP8 to FP16 is fairly straight forward. While we don’t have float8 yet, we have uint8. Doing element-wise conversions like this is not strictly well defined, since the wave layout of different types can change, but it works fine in practice.

coopmat<float16_t, gl_ScopeSubgroup, 16u, 16u,
        gl_MatrixUseA> 
CoopMatFP8toFP16(
  coopmat<uint8_t, gl_ScopeSubgroup, 16u, 16u,
          gl_MatrixUseA> coop_input) 
{ 
 coopmat<float16_t, gl_ScopeSubgroup,
         16u, 16u, gl_MatrixUseA> coop_output;

 for (int i = 0; i < coop_input.length(); i++) 
 { 
   int16_t v = (int16_t(int8_t(coop_input[i])) << 7s) & (-16385s);
   coop_output[i] = int16BitsToFloat16(v); 
 }
 // Handles denorm correctly.
 // Ignores NaN, but who cares.
 // There is no Inf in E4M3.
 return coop_output * float16_t(256.0); 
}

I’m quite happy with this bit-hackery. Sign-extend, shift, bit-and, and an fmul.

FSR 4 also relies on FP32 -> FP8 conversions to quantize accumulation results back to FP8 for storage between stages. This is … significantly more terrible to emulate. Doing accurate RTE with denorm handling in soft-float is GPU sadism. It explodes driver compile times and runtime performance is borderline unusable as a result.

8-bit Accumulation matrix conversions

In many places, we need to handle loading 8-bit matrices, which are converted to FP32 accumulators and back. Vulkan can support this, but it relies on drivers exposing it via the physical device query. No driver I know of exposes 8-bit accumulator support in any operations, which means we’re forced to go out of spec. With some light tweaks to RADV, it works as expected however. A driver should be able to expose 8-bit accumulator types and do the trunc/extend as needed. It’s somewhat awkward that there is no way to expose format support separately, but it is what it is.

Converting types with Use conversion

In several places, the shaders need to convert e.g. Accumulation matrices to B matrices. This is a use case not covered by KHR_cooperative_matrix. The universal workaround is to roundtrip store/load via LDS, but that’s kind of horrible for perf. I ended up with some hacky code paths for now:

  • On RDNA4, Accum layout is basically same as B layout, so I can abuse implementation specific behavior by copying over elements one by one into a new coopmat with different type.
  • On RDNA3, this doesn’t work since len(B) != len(Accum).
  • NV_cooperative_matrix2 actually has a feature bit to support this exact use case without hackery, so I can take advantage of that when RADV implements support for it.

The debugging process

After implementing all the opcodes and making all my tests pass, it was time throw this at the wall. (Screenshots from TkG, pardon the French)

… fun. First process was to figure out if there was something I may have missed about opcode behavior. For SSBOs and LDS, I simply assumed that 4 byte alignment would be fine, then I found:

coopMatLoad(_86, _39._m0, ((30976u * _76) + _79) + gl_WorkGroupID.z, 16u, gl_CooperativeMatrixLayoutColumnMajor);

1 byte aligned coopmat (8-bit matrix), yikes … Technically out of spec for cooperative_matrix, but nothing we can do about that. AMD deals with this just fine. In the original code, I had used indexing into a u32 SSBO and just divided the byte offset by 4, but obviously that breaks. I added support for 8-bit SSBO aliases in dxil-spirv, updated test suite and we get:

Still not right. Eventually I found some questionable looking code with LDS load-store. The strides couldn’t possibly be right. It turns out that offset/stride for LDS is in terms of u32 words, not bytes (?!). This detail wasn’t caught in the test suite because that bug would cancel each other out on a store and load. Fortunately, this quirk made my life easier in dxil-spirv, since there’s no easy way of emitting 8-bit LDS aliases.

Things were looking good, but it was far from done. There was still intense shimmering and ghosting, which couldn’t be observed from screenshots and I couldn’t figure it out by simply staring at code. It was time to bring out the big guns.

Build a tiny FSR 4 test bench from scratch

I needed a way to directly compare outputs from native and vkd3d-proton, to be able to isolate exactly where implementations diverged. Fortunately, since we’re not getting blocked by driver when trying to capture anymore, I captured a RenderDoc frame from the 3.1 demo.

Fortunately the inputs and outputs are very simple.

  • A pre-pass that reads textures, and does very light WMMA work at the end
  • A bunch of raw passes that spam WMMA like no tomorrow, likely to implement the ML network
  • A final post-pass that synthesizes the final image with more WMMA work of course

The middle passes were very simple from a resource binding standpoint:

  • One big weight buffer
  • One big scratch buffer

In RenderDoc, I dumped the buffer contents to disk, and built a small D3D12 test app that invokes the shaders with the buffers in question. Every dispatch, dump the scratch buffers contents out to disk, and by running it against the native D3D12 driver and vkd3d-proton, figure out where things diverge.

Surely the FSR 4 shaders cannot be bugged, right?

Turns out, yes, they can be 🙂 Turns out many of the shaders only allocate 256 bytes of LDS, yet the shaders actually need 512. Classic undefined behavior. The reason this “happens” to work on native is that AMD allocates LDS space with 512 byte granularity. However, dxil-spirv also emits some LDS to deal with matrix transpositions and it ended up clobbering the AGS shader’s LDS space …

One disgusting workaround later …

// Workaround for bugged WMMA shaders.
// The shaders rely on AMD aligning LDS size to 512 bytes.
// This avoids overflow spilling into LDSTranspose area by mistake, which breaks some shaders.

if (
  address_space == DXIL::AddressSpace::GroupShared &&
  shader_analysis.require_wmma)
{
   // ... Pad groupshared array to 512 byte.
}

and games were rendering correctly. FP16 path on RDNA4 conquered.

Performance?

Absolute garbage, as expected. 1440p on 9070xt on native is about 0.85 ms and my implementation was about 3 ms.

Going beyond

Can we implement FP8?

RADV obviously cannot ship FP8 before there is a Vulkan / SPIR-V spec, but with the power of open source and courage, why not experiment. Nothing is stopping us from just emitting:

OpTypeFloat 8

and see what happens. Georg Lehmann brought up FP8 support in NIR and ACO enough to support FP8 WMMA. Hacking FP8 support into dxil-spirv was quite straight forward and done in an hour. Getting the test suite to pass was smooth and easy, but … the real battle was yet to come.

vkd3d-proton bug or RADV?

Games were completely broken in the FP8 path. Fortunately, this difference reproduced in my test bench. The real issue now was bisecting the shader itself to figure out where the shaders diverge. These shaders are not small. It’s full of incomprehensible ML gibberish, so only solution I could come up with was capturing both FP16 and FP8 paths and debug printing side-by-side. Fortunately, RenderDoc makes this super easy.

First, I had to hack together FP8 support in SPIRV-Cross and glslang so that roundtripping could work:

Eventually, I found the divergent spot, and Georg narrowed it down to broken FP8 vectorization in ACO. Once this was fixed, FP8 was up and running. Runtime was now down to 1.3 ms.

Get rid of silly LDS roundtrips

FSR 4 really likes to convert Accumulator to B matrices, but on RDNA4, the layouts match (at least for 8-bit) so until we have NV_cooperative_matrix2 implementation, I pretended it worked by copying elements instead, and runtime went down to about 1 ms. RADV codegen for coopmat is currently very naive, especially buffer loading code is extremely inefficient, but despite that, we’re pretty close to native here. Now that there is a good use case for cooperative matrix, I’m sure it will get optimized eventually.

At this point, FP8 path is fully functional and performant enough, but of course it needs building random Mesa branches and enabling hacked up code paths in vkd3d-proton.

RDNA3?

RDNA3 is not officially supported at the moment, but given I already went through the pain of emulating FP8, there’s no reason it cannot work on RDNA3. Given the terrible performance I got in FP16 emulation, I can understand why RDNA3 is not supported though … FSR 4 requires a lot of WMMA brute force to work, and RDNA3’s lesser WMMA grunt is simply not strong enough. Maybe it would work better if a dedicated FP16 model is designed, but that’s not on me to figure out.

It took a good while to debug this, since once again the test suite was running fine …

LDS roundtrip required

Unfortunately, RDNA3 is quite strange when it comes to WMMA layouts. Accumulator has 8 elements, but A and B matrices have 16 for some reason. NV_cooperative_matrix2 will help here for sure.

Final shader bug

After fixing the LDS roundtrip, the test bench passed, but games still looked completely broken on RDNA3. This narrowed down the problem to either the pre-pass or post-pass. Dual GPU and opening up a capture of the FSR 3.1 demo side by side on RDNA3 and RDNA4, I finally narrowed it down to questionable shader code that unnecessarily relies on implementation defined behavior.

// Rewritten for clarity

float tmp[8]; 
tmp[0u] = uintBitsToFloat(_30._m0[384u].x); 
tmp[1u] = uintBitsToFloat(_30._m0[384u].y); 
tmp[2u] = uintBitsToFloat(_30._m0[384u].z); 
tmp[3u] = uintBitsToFloat(_30._m0[384u].w); 
tmp[4u] = uintBitsToFloat(_30._m0[385u].x); 
tmp[5u] = uintBitsToFloat(_30._m0[385u].y); 
tmp[6u] = uintBitsToFloat(_30._m0[385u].z); 
tmp[7u] = uintBitsToFloat(_30._m0[385u].w);

for (int i = 0; i < mat.length(); i++)
  mat[i] += tmp[i];

storeToLDS(mat);

// Read LDS and write to image.
LDS[8 * SubgroupInvocationID + {4, 5, 6, 7}]

This is not well behaved cooperative matrix code since it relies on the register layout. RDNA3 and 4 actually differ. The columns are interleaved in very different ways.

I found a workaround which can be applied to RDNA3 wave32.

for (int i = 0; i < mat.length(); i++)
  mat[i] += tmp[((i << 1u) & 7u) + (gl_SubgroupInvocationID >> 4u)];

That was the best I could do without resorting to full shader replacement. The actual fix would be for the shader to just perform this addition after loading from LDS, like:

storeToLDS(mat);

float tmp[8]; 
tmp[0u] = uintBitsToFloat(_30._m0[384u].x); 
tmp[1u] = uintBitsToFloat(_30._m0[384u].y); 
tmp[2u] = uintBitsToFloat(_30._m0[384u].z); 
tmp[3u] = uintBitsToFloat(_30._m0[384u].w); 
tmp[4u] = uintBitsToFloat(_30._m0[385u].x); 
tmp[5u] = uintBitsToFloat(_30._m0[385u].y); 
tmp[6u] = uintBitsToFloat(_30._m0[385u].z); 
tmp[7u] = uintBitsToFloat(_30._m0[385u].w);

// Read LDS and write to image.
LDS[8 * SubgroupInvocationID + {4, 5, 6, 7}] + tmp[{4, 5, 6, 7}]

This would at least be portable.

With this, RDNA 3 can do FSR 4 on vkd3d-proton if you’re willing to take a massive dump on performance.

Conclusion

I was fearing getting FSR 4 up and running would take a year, but here we are. Lots of different people in the community ended up contributing to this in smaller ways to unblock the debugging process.

There probably won’t be a straight forward way to make use of this work until FSR 4 is released in an official SDK, FP8 actually lands in Vulkan, etc, etc, so I’ll leave the end-user side of things out of this blog.

Graphics programming like it’s 2000 – An esoteric introduction to PlayStation 2 graphics – Part 1

Graphics programming in 2025 can be confusing and exhausting. Let’s travel back to a simpler time. Imagine it’s 2000 again and we’re anticipating what will turn out to be the most successful game console of all time. In our reverie, we have acquired a virtual development kit from the future to get ahead of the curve.

Like many others, we must do our taxes and start with Hello Triangle. However, this Hello Triangle will likely be the strangest Hello Triangle yet.

Like any graphics chip – even to this day – it chews a sequence of commands and spits out pixels. The GS chip itself is based around the idea of writing to various hardware registers to trigger work. Everything from drawing triangles to copying image data around is all done by poking the right registers in the right order. To automate this process of peeking and poking hardware registers, the front-end is responsible for reading a command stream and tickle the registers.

To get graphics on the screen, our goal will be to prepare a packet of data that a hypothetical GS can process.

FILE *file = fopen("dump.gs", "wb");
if (!file)
    return 1;

Where we’re going we need no pesky API.

First, we need to program some HW registers:

struct {
    GIFTagBits tag;
    PackedADBits prmode;
    PackedADBits frame;
    PackedADBits scissor;
};

The GIFTag tells the hardware how to interpret the packet, which is followed by 3 Address + Data packets that tickle the hardware register of our choosing.

.tag = {
  // Loops once to program 3 registers
  .NLOOP = 1,
  // End of packet
  .EOP = 1,
  // 128-bit form
  .FLG = GIFTagBits::PACKED,
  // Three registers per loop
  .NREG = 3,
  // Up to 16 x 4 bits to program 16
  // different registers in one go.
  // A_D is a general "poke" interface
  // that can access any HW register.
  // 0x111 splats the bits.
  .REGS = int(GIFAddr::A_D) * 0x111,
},

PRMODE: Programs global settings like texture on/off, fogging on/off, blending on/off, etc. We just need to turn Gouraud shading on, i.e., color is interpolated across the triangle.

.prmode = {
  // Gouraud shading
  .data = Reg64<PRIMBits>({ .IIP = 1 }).bits,
  .ADDR = uint8_t(RegisterAddr::PRMODE),
},

FRAME: Program where the frame buffer is in VRAM. There is no height. That’s what scissor is for.

.frame = {
  // Programs the frame buffer
  // with 32-bit color.
  .data = Reg64<FRAMEBits>({
    .FBP = fb_address / 8192,
    .FBW = fb_width / 64,
    .PSM = PSMCT32 }).bits,
  .ADDR = uint8_t(RegisterAddr::FRAME_1),
},

SCISSOR: Set the scissor rect.

.scissor = {
  .data = Reg64<SCISSORBits>({
    .SCAX0 = 0,
    .SCAX1 = fb_width - 1,
    .SCAY0 = 0,
    .SCAY1 = fb_height - 1 }).bits,
  .ADDR = uint8_t(RegisterAddr::SCISSOR_1),
},

This now forms a packet and we can write that out to file.

Time for a new packet. We need to clear the frame buffer to some aesthetically pleasing color. SPRITE primitive to the rescue.

struct {
    GIFTagBits tag;
    PackedRGBAQBits rgba;
    PackedXYZBits xyz0;
    PackedXYZBits xyz1;
};

Unlike those silly modern GPUs we have a straight forward quad primitive here. It takes two points – meaning we cannot freely rotate sprites 90 or 270 degrees this way – but we have triangles for those edge cases.

The GIFTag programs primitive list of SPRITE and sets it up so that we interpret 3 registers as RGBA color followed by XYZ. Writing to XYZ “kicks” the vertex. Sounds familiar? glVertex3f in hardware? Yup, yup!

.tag = {
    // One primitive.
    .NLOOP = 1,
    .EOP = 1,

    // Begin a new primitive sequence.
    .PRE = 1,
    .PRIM = Reg64<PRIMBits>({
       .PRIM = int(PRIMType::Sprite) }).words[0],

    .FLG = GIFTagBits::PACKED,
    .NREG = 3,
    .REGS =
       int(GIFAddr::RGBAQ) |
       (int(GIFAddr::XYZ2) * 0x110),
},
.rgba = {
    .R = 0x20,
    .G = 0x30,
    .B = 0x40,
    .A = 0xff,
},
.xyz0 = {
    // Top-left coordinate in 12.4 fixed point.
    .X = 0 << 4,
    .Y = 0 << 4,
    .Z = 0,
},
.xyz1 = {
    // Bottom-right coordinate in 12.4 fixed point.
    .X = fb_width << 4,
    .Y = fb_height << 4,
    .Z = 0,
},

Then the final packet for our triangle:

struct PackedVertex {
    PackedRGBAQBits rgba;
    PackedXYZBits xyz;
};

const struct {
    GIFTagBits tag;
    PackedVertex verts[3];
};

Now we just need to program the hardware to read RGBA + XYZ in a loop 3 times and we can draw a triangle:

.tag = {
    // Three vertices
    .NLOOP = 3,
    .EOP = 1,

    // Begin a new primitive sequence.
    .PRE = 1,
    .PRIM = Reg64<PRIMBits>({
       .PRIM = int(PRIMType::TriangleList) }).words[0],

    .FLG = GIFTagBits::PACKED,

    // Every loop writes RGBA, then kicks vertex
    .NREG = 2,
    .REGS =
       int(GIFAddr::RGBAQ) |
       (int(GIFAddr::XYZ2) * 0x10),
},
.verts = {
    {
       .rgba = { .R = 0xff },
       .xyz = { .X = 300 << 4, .Y = 100 << 4 },
    },
    {
       .rgba = { .G = 0xff },
       .xyz = { .X = 100 << 4, .Y = 400 << 4 },
    },
    {
       .rgba = { .B = 0xff },
       .xyz = { .X = 500 << 4, .Y = 400 << 4 },
    },
},

Now the triangle is in memory and we need to display its lovely pixels on screen. To this this we must program the CRTC. This is mostly boilerplate.

// Program special registers which control the CRTC,
// aka display controller.
PrivRegisterState priv = {};

// Only enable display circuit 1.
priv.pmode.EN1 = 1;
priv.pmode.EN2 = 0;
// Just has to be 1. *shrug*
priv.pmode.CRTMD = 1;
// Normal NTSC 480i.
priv.smode1.CMOD = SMODE1Bits::CMOD_NTSC;
priv.smode1.LC = SMODE1Bits::LC_ANALOG;
priv.smode2.INT = 1;
// Effectively disables alpha blending against BG color.
priv.pmode.MMOD = PMODEBits::MMOD_ALPHA_ALP;
priv.pmode.SLBG = PMODEBits::SLBG_ALPHA_BLEND_BG;
priv.pmode.ALP = 0xff;

// Program the framebuffer pointer.
priv.dispfb1.FBP = fb_address / 8192;
priv.dispfb1.FBW = fb_width / 64;
priv.dispfb1.PSM = PSMCT32;
priv.dispfb1.DBX = 0;
priv.dispfb1.DBY = 0;

// Center the display area so it covers the full screen.
priv.display1.DX = 640; // Overscan centering.
priv.display1.DY = 50; // Overscan centering.
priv.display1.MAGH = 4 - 1; // 640 width framebuffer.
priv.display1.MAGV = 0; // No scaling vertically.
priv.display1.DW = (640 - 1) * SMODE1Bits::CLOCK_DIVIDER_COMPOSITE;
priv.display1.DH = 448 - 1;

Flush out all this to disk, load it up in parallel-gs-stream and presto:

Compile-able source code can be found here for reference:

https://gist.github.com/HansKristian-Work/b88066eb8f14be21277c550a6f775956

Bonus hackery

parallel-gs-stream can read from a mkfifo file, so you could technically open a file as a FIFO and animate a triangle by writing the SPRITE + TRIANGLE packets followed by vsync packet in a loop. No need to complicate things.

Future entries

Stay tuned for simple texture mapping with perspective correction.

PlayStation 2 GS emulation – the final frontier of Vulkan compute emulation

As you may, or may not know, I wrote paraLLEl-RDP back in 2020. It aimed at implementing the N64 RDP in Vulkan compute. Lightning fast, and extremely accurate, plus the added support of up-scaling on top. I’m quite happy how it turned out. Of course, the extreme accuracy was due to Angrylion being used as reference and I could aim for bit-exactness against that implementation.

Since then, there’s been the lingering idea of doing the same thing, but for PlayStation 2. Until now, there’s really only been one implementation in town, GSdx, which has remained the state-of-the-art for 20 years.

paraLLEl-GS is actually not the first compute implementation of the PS2 GS. An attempt was made back in 2014 for OpenCL as far as I recall, but it was never completed. At the very least, I cannot find it in the current upstream repo anymore.

The argument for doing compute shader raster on PS2 is certainly weaker than on N64. Angrylion was – and is – extremely slow, and N64 is extremely sensitive to accuracy where hardware acceleration with graphics APIs is impossible without serious compromises. PCSX2 on the other hand has a well-optimized software renderer, and a pretty solid graphics-based renderer, but that doesn’t mean there aren’t issues. The software renderer does not support up-scaling for example, and there are a myriad bugs and glitches with the graphics-based renderer, especially with up-scaling. As we’ll see, the PS2 GS is quite the nightmare to emulate in its own way.

My main motivation here is basically “because I can”. I already had a project lying around that did “generic” compute shader rasterization. I figured that maybe we could retro-fit this to support PS2 rendering.

I didn’t work on this project alone. My colleague, Runar Heyer, helped out a great deal in the beginning to get this started, doing all the leg-work to study the PS2 from various resources, doing the initial prototype implementation and fleshing out the Vulkan GLSL to emulate PS2 shading. Eventually, we hit some serious roadblocks in debugging various games, and the project was put on ice for a while since I was too drained dealing with horrible D3D12 game debugging day in and day out. The last months haven’t been a constant fire fight, so I’ve finally had the mental energy to finish it.

My understanding of the GS is mostly based on what Runar figured out, and what I’ve seen by debugging games. The GSdx software renderer does not seem like it’s hardware bit-accurate, so we were constantly second-guessing things when trying to compare output. This caused a major problem when we had the idea of writing detailed tests that directly compared against GSdx software renderer, and the test-driven approach fell flat very quickly. As a result, paraLLEl-GS isn’t really aiming for bit-accuracy against hardware, but it tries hard to avoid obvious accuracy issues at the very least.

Basic GS overview

Again, this is based on my understanding, and it might not be correct. 😀

GS is a pixel-pushing monster

The GS is infamous for its insane fill-rate and bandwidth. It could push over a billion pixels per second (in theory at least) back in 2000 which was nuts. While the VRAM is quite small (4 MiB), it was designed to be continuously streamed into using the various DMA engines.

Given the extreme fill-rate requirements, we have to design our renderer accordingly.

GS pixel pipeline is very basic, but quirky

In many ways, the GS is actually simpler than N64 RDP. Single texture, and a single cycle combiner, where N64 had a two stage combiner + two stage blender. Whatever AA support is there is extremely basic as well, where N64 is delightfully esoteric. The parts of the pixel pipeline that is painful to implement with traditional graphics APIs is:

Blending goes beyond 1.0

Inherited from PS1, 0x80 is treated as 1.0, and it can go all the way up to 0xff (almost 2). Shifting by 7 is easier than dividing by 255 I suppose. I’ve seen some extremely ugly workarounds in PCSX2 before to try working around this since UNORM formats cannot support this as is. Textures are similar, where alpha > 1.0 is representable.

There is also wrapping logic that can be used for when colors or alpha goes above 0xFF.

Destination alpha testing

The destination alpha can be used as a pseudo-stencil of sorts, and this is extremely painful without programmable blending. I suspect this was added as PS1 compatibility, since PS1 also had this strange feature.

Conditional blending

Based on the alpha, it’s possible to conditionally disable blending. Quite awkward without programmable blending … This is another PS1 compat feature. With PS1, it can be emulated by rendering every primitive twice with state changes in-between, but this quickly gets impractical with PS2.

Alpha correction

Before alpha is written out, it’s possible to OR in the MSB. Essentially forcing alpha to 1. It is not equivalent to alphaToOne however, since it’s a bit-wise OR of the MSB.

Alpha test can partially discard

A fun thing alpha tests can do is to partially discard. E.g. you can discard just color, but keep the depth write. Quite nutty.

AA1 – coverage-to-alpha – can control depth write per pixel

This is also kinda awkward. The only anti-alias PS2 has is AA1 which is a coverage-to-alpha feature. Supposedly, less than 100% coverage should disable depth writes (and blending is enabled), but the GSdx software renderer behavior here is extremely puzzling. I don’t really understand it yet.

32-bit fixed-point Z

I’ve still yet to see any games actually using this, but technically, it has D32_UINT support. Fun! From what I could grasp, GSdx software renderer implements this with FP64 (one of the many reasons I refuse to believe GSdx is bit-accurate), but FP64 is completely impractical on GPUs. When I have to, I’ll implement this with fixed-point math. 24-bit Z and 16-bit should be fine with FP32 interpolation I think.

Pray you have programmable blending

If you’re on a pure TBDR GPU most of this is quite doable, but immediate mode desktop GPUs quickly degenerates into ROV or per-pixel barriers after every primitive to emulate programmable blending, both which are horrifying for performance. Of course, with compute we can make our own TBDR to bypass all this. 🙂

D3D9-style raster rules

Primitives are fortunately provided in a plain form in clip-space. No awkward N64 edge equations here. The VU1 unit is supposed to do transforms and clipping, and emit various per-vertex attributes:

X/Y: 12.4 unsigned fixed-point
Z: 24-bit or 32-bit uint
FOG: 8-bit uint
RGBA: 8-bit, for per-vertex lighting
STQ: For perspective correct texturing with normalized coordinates. Q = 1 / w, S = s * Q, T = t * Q. Apparently the lower 8-bits of the mantissa are clipped away, so bfloat24? Q can be negative, which is always fun. No idea how this interacts with Inf and NaN …
UV: For non-perspective correct texturing. 12.4 fixed-point un-normalized.

  • Triangles are top-left raster, just like modern GPUs.
  • Pixel center is on integer coordinate, just like D3D9. (This is a common design mistake that D3D10+ and GL/Vulkan avoids).
  • Lines use Bresenham’s algorithm, which is not really feasible to upscale, so we have to fudge it with rect or parallelogram.
  • Points snap to nearest pixel. Unsure which rounding is used though … There is no interpolation ala gl_PointCoord.
  • Sprites are simple quads with two coordinates. STQ or UV can be interpolated and it seems to assume non-rotated coordinates. To support rotation, you’d need 3 coordinates to disambiguate.

All of this can be implemented fairly easily in normal graphics APIs, as long as we don’t consider upscaling. We have to rely on implementation details in GL and Vulkan, since these APIs don’t technically guarantee top-left raster rules.

Since X/Y is unsigned, there is an XY offset that can be applied to center the viewport where you want. This means the effective range of X/Y is +/- 4k pixels, a healthy guard band for 640×448 resolutions.

Vertex queue

The GS feels very much like old school OpenGL 1.0 with glVertex3f and friends. It even supports TRIANGLE_FAN! Amazing … RGBA, STQ and various registers are set, and every XYZ register write forms a vertex “kick” which latches vertex state and advances the queue. An XYZ register write may also be a drawing kick, which draws a primitive if the vertex queue is sufficiently filled. The vertex queue is managed differently depending on the topology. The semantics here seem to be pretty straight forward where strip primitives shift the queue by one, and list primitives clear the queue. Triangle fans keep the first element in the queue.

Fun swizzling formats

A clever idea is that while rendering to 24-bit color or 24-bit depth, there is 8 bits left unused in the MSBs. You can place textures there, because why not. 8H, 4HL, 4HH formats support 8-bit and 4-bit palettes nicely.

Pixel coordinates on PS2 are arranged into “pages”, which are 8 KiB, then subdivided into 32 blocks, and then, the smaller blocks are swizzled into a layout that fits well with a DDA-style renderer. E.g. for 32-bit RGBA, a page is 64×32 pixels, and 32 8×8 blocks are Z-order swizzled into that page.

Framebuffer cache and texture cache

There is a dedicated cache for framebuffer rendering and textures, one page’s worth. Games often abuse this to perform feedback loops, where they render on top of the pixels being sampled from. This is the root cause of extreme pain. N64 avoided this problem by having explicit copies into TMEM (and not really having the bandwidth to do elaborate feedback effects), and other consoles rendered to embedded SRAM (ala a tiler GPU), so these feedbacks aren’t as painful, but the GS is complete YOLO. Dealing with this gracefully is probably the biggest challenge. Combined with the PS2 being a bandwidth monster, developers knew how to take advantage of copious blending and blurring passes …

Texturing

Texturing on the GS is both very familar, and arcane.

On the plus side, the texel-center is at half-pixel, just like modern APIs. It seems like it has 4-bit sub-texel precision instead of 8 however. This is easily solved with some rounding. It also seems to have floor-rounding instead of nearest-rounding for bi-linear.

The bi-linear filter is a normal bi-linear. No weird 3-point N64 filter here.

On the weirder side, there are two special addressing modes.

REGION_CLAMP supports an arbitrary clamp inside a texture atlas (wouldn’t this be nice in normal graphics APIs? :D). It also works with REPEAT, so you can have REPEAT semantics on border, but then clamp slightly into the next “wrap”. This is trivial to emulate.

REGION_REPEAT is … worse. Here we can have custom bit-wise computation per coordinate. So something like u’ = (u & MASK) | FIX. This is done per-coordinate in bi-linear filtering, which is … painful, but solvable. This is another weird PS1 feature that was likely inherited for compatibility. At least on PS1, there was no bi-linear filtering to complicate things 🙂

Mip-mapping is also somewhat curious. Rather than relying on derivatives, the log2 of interpolated Q factor, along with some scaling factors are used to compute the LOD. This is quite clever, but I haven’t really seen any games use it. The down-side is that triangle-setup becomes rather complex if you want to account for correct tri-linear filtering, and it cannot support e.g. anisotropic filtering, but this is 2000, who cares! Not relying on derivatives is a huge boon for the compute implementation.

Formats are always “normalized” to RGBA8_UNORM. 5551 format is expanded to 8888 without bit-replication. There is no RGBA4444 format.

It’s quite feasible to implement the texturing with plain bindless.

CLUT

This is a 1 KiB cache that holds the current palette. There is an explicit copy step from VRAM into that CLUT cache before it can be used. Why hello there, N64 TMEM!

The CLUT is organized such that it can hold one full 256 color palette in 32-bit colors. On the other end, it can hold 32 palettes of 16 colors at 16 bpp.

TEXFLUSH

There is an explicit command that functions like a “sync and invalidate texture cache”. In the beginning I was hoping to rely on this to guide the hazard tracking, but oh how naive I was. In the end, I simply had to ignore TEXFLUSH. Basically, there are two styles of caching we could take with GS.

With “maximal” caching, we can assume that frame buffer caches and texture caches are infinitely large. The only way a hazard needs to be considered is after an explicit flush. This … breaks hard. Either games forget to use TEXFLUSH (because it happened to work on real hardware), or they TEXFLUSH way too much.

With “minimal” caching, we assume there is no caching and hazards are tracked directly. Some edge case handling is considered for feedback loops.

I went with “minimal”, and I believe GSdx did too.

Poking registers with style – GIF

The way to interact with the GS hardware is through the GIF, which is basically a unit that reads data and pokes the correct hardware registers. At the start of a GIF packet, there is a header which configures which registers should be written to, and how many “loops” there are. This maps very well to mesh rendering. We can consider something like one “loop” being:

  • Write RGBA vertex color
  • Write texture coordinate
  • Write position with draw kick

And if we have 300 vertices to render, we’d use 300 loops. State registers can be poked through the Address + Data pair, which just encodes target register + 64-bit payload. It’s possible to render this way too of course, but it’s just inefficient.

Textures are uploaded through the same mechanism. Various state registers are written to set up transfer destinations, formats, etc, and a special register is nudged to transfer 64-bit at a time to VRAM.

Hello Trongle – GS

If you missed the brain-dead simplicity of OpenGL 1.0, this is the API for you! 😀

For testing purposes, I added a tool to generate a .gs dump format that PCSX2 can consume. This is handy for comparing implementation behavior.

First, we program the frame buffer and scissor:

TESTBits test = {};
test.ZTE = TESTBits::ZTE_ENABLED;
test.ZTST = TESTBits::ZTST_GREATER; // Inverse Z, LESS is not supported.
iface.write_register(RegisterAddr::TEST_1, test);

FRAMEBits frame = {};
frame.FBP = 0x0 / PAGE_ALIGNMENT_BYTES;
frame.PSM = PSMCT32;
frame.FBW = 640 / BUFFER_WIDTH_SCALE;
iface.write_register(RegisterAddr::FRAME_1, frame);

ZBUFBits zbuf = {};
zbuf.ZMSK = 0; // Enable Z-write
zbuf.ZBP = 0x118000 / PAGE_ALIGNMENT_BYTES;
iface.write_register(RegisterAddr::ZBUF_1, zbuf);

SCISSORBits scissor = {};
scissor.SCAX0 = 0;
scissor.SCAY0 = 0;
scissor.SCAX1 = 640 - 1;
scissor.SCAY1 = 448 - 1;
iface.write_register(RegisterAddr::SCISSOR_1, scissor);

Then we nudge some registers to draw:

struct Vertex
{
    PackedRGBAQBits rgbaq;
    PackedXYZBits xyz;
} vertices[3] = {};

for (auto &vert : vertices)
{
   vert.rgbaq.A = 0x80;
   vert.xyz.Z = 1;
}

vertices[0].rgbaq.R = 0xff;
vertices[1].rgbaq.G = 0xff;
vertices[2].rgbaq.B = 0xff;

vertices[0].xyz.X = p0.x << SUBPIXEL_BITS;
vertices[0].xyz.Y = p0.y << SUBPIXEL_BITS;
vertices[1].xyz.X = p1.x << SUBPIXEL_BITS;
vertices[1].xyz.Y = p1.y << SUBPIXEL_BITS;
vertices[2].xyz.X = p2.x << SUBPIXEL_BITS;
vertices[2].xyz.Y = p2.y << SUBPIXEL_BITS;

PRIMBits prim = {};
prim.TME = 0; // Turn off texturing.
prim.IIP = 1; // Interpolate RGBA (Gouraud shading)
prim.PRIM = int(PRIMType::TriangleList);

static const GIFAddr addr[] = { GIFAddr::RGBAQ, GIFAddr::XYZ2 };
constexpr uint32_t num_registers = sizeof(addr) / sizeof(addr[0]);
constexpr uint32_t num_loops = sizeof(vertices) / sizeof(vertices[0]);
iface.write_packed(prim, addr, num_registers, num_loops, vertices);

This draws a triangle. We provide coordinates directly in screen-space.

And finally, we need to program the CRTC. Most of this is just copy-pasta from whatever games tend to do.

auto &priv = iface.get_priv_register_state();

priv.pmode.EN1 = 1;
priv.pmode.EN2 = 0;
priv.pmode.CRTMD = 1;
priv.pmode.MMOD = PMODEBits::MMOD_ALPHA_ALP;
priv.smode1.CMOD = SMODE1Bits::CMOD_NTSC;
priv.smode1.LC = SMODE1Bits::LC_ANALOG;
priv.bgcolor.R = 0x0;
priv.bgcolor.G = 0x0;
priv.bgcolor.B = 0x0;
priv.pmode.SLBG = PMODEBits::SLBG_ALPHA_BLEND_BG;
priv.pmode.ALP = 0xff;
priv.smode2.INT = 1;

priv.dispfb1.FBP = 0;
priv.dispfb1.FBW = 640 / BUFFER_WIDTH_SCALE;
priv.dispfb1.PSM = PSMCT32;
priv.dispfb1.DBX = 0;
priv.dispfb1.DBY = 0;
priv.display1.DX = 636; // Magic values that center the screen.
priv.display1.DY = 50; // Magic values that center the screen.
priv.display1.MAGH = 3; // scaling factor = MAGH + 1 = 4 -> 640 px wide.
priv.display1.MAGV = 0;
priv.display1.DW = 640 * 4 - 1;
priv.display1.DH = 448 - 1;

dump.write_vsync(0, iface);
dump.write_vsync(1, iface);

When the GS is dumped, we can load it up in PCSX2 and voila:

And here’s the same .gs dump is played through parallel-gs-replayer with RenderDoc. For debugging, I’ve spent a lot of time making it reasonably convenient. The images are debug storage images where I can store before and after color, depth, debug values for interpolants, depth testing state, etc, etc. It’s super handy to narrow down problem cases. The render pass can be split into 1 or more triangle chunks as needed.

To add some textures, and flex the capabilities of the CRTC a bit, we can try uploading a texture:

int chan;
auto *buf = stbi_load("/tmp/test.png", &w, &h, &chan, 4);
iface.write_image_upload(0x300000, PSMCT32, w, h, buf,
                         w * h * sizeof(uint32_t));
stbi_image_free(buf);

TEX0Bits tex0 = {};
tex0.PSM = PSMCT32;
tex0.TBP0 = 0x300000 / BLOCK_ALIGNMENT_BYTES;
tex0.TBW = (w + BUFFER_WIDTH_SCALE - 1) / BUFFER_WIDTH_SCALE;
tex0.TW = Util::floor_log2(w - 1) + 1;
tex0.TH = Util::floor_log2(h - 1) + 1;
tex0.TFX = COMBINER_DECAL;
tex0.TCC = 1; // Use texture alpha as blend alpha
iface.write_register(RegisterAddr::TEX0_1, tex0);

TEX1Bits tex1 = {};
tex1.MMIN = TEX1Bits::LINEAR;
tex1.MMAG = TEX1Bits::LINEAR;
iface.write_register(RegisterAddr::TEX1_1, tex1);

CLAMPBits clamp = {};
clamp.WMS = CLAMPBits::REGION_CLAMP;
clamp.WMT = CLAMPBits::REGION_CLAMP;
clamp.MINU = 0;
clamp.MAXU = w - 1;
clamp.MINV = 0;
clamp.MAXV = h - 1;
iface.write_register(RegisterAddr::CLAMP_1, clamp);

While PS2 requires POT sizes for textures, REGION_CLAMP is handy for NPOT. Super useful for texture atlases.

struct Vertex
{
    PackedUVBits uv;
    PackedXYZBits xyz;
} vertices[2] = {};

for (auto &vert : vertices)
    vert.xyz.Z = 1;

vertices[0].xyz.X = p0.x << SUBPIXEL_BITS;
vertices[0].xyz.Y = p0.y << SUBPIXEL_BITS;
vertices[1].xyz.X = p1.x << SUBPIXEL_BITS;
vertices[1].xyz.Y = p1.y << SUBPIXEL_BITS;
vertices[1].uv.U = w << SUBPIXEL_BITS;
vertices[1].uv.V = h << SUBPIXEL_BITS;

PRIMBits prim = {};
prim.TME = 1; // Turn on texturing.
prim.IIP = 0;
prim.FST = 1; // Use unnormalized coordinates.
prim.PRIM = int(PRIMType::Sprite);

static const GIFAddr addr[] = { GIFAddr::UV, GIFAddr::XYZ2 };
constexpr uint32_t num_registers = sizeof(addr) / sizeof(addr[0]);
constexpr uint32_t num_loops = sizeof(vertices) / sizeof(vertices[0]);
iface.write_packed(prim, addr, num_registers, num_loops, vertices);

Here we render a sprite with un-normalized coordinates.

Finally, we use the CRTC to do blending against white background.

priv.pmode.EN1 = 1;
priv.pmode.EN2 = 0;
priv.pmode.CRTMD = 1;
priv.pmode.MMOD = PMODEBits::MMOD_ALPHA_CIRCUIT1;
priv.smode1.CMOD = SMODE1Bits::CMOD_NTSC;
priv.smode1.LC = SMODE1Bits::LC_ANALOG;
priv.bgcolor.R = 0xff;
priv.bgcolor.G = 0xff;
priv.bgcolor.B = 0xff;
priv.pmode.SLBG = PMODEBits::SLBG_ALPHA_BLEND_BG;
priv.smode2.INT = 1;

priv.dispfb1.FBP = 0;
priv.dispfb1.FBW = 640 / BUFFER_WIDTH_SCALE;
priv.dispfb1.PSM = PSMCT32;
priv.dispfb1.DBX = 0;
priv.dispfb1.DBY = 0;
priv.display1.DX = 636; // Magic values that center the screen.
priv.display1.DY = 50; // Magic values that center the screen.
priv.display1.MAGH = 3; // scaling factor = MAGH + 1 = 4 -> 640 px wide.
priv.display1.MAGV = 0;
priv.display1.DW = 640 * 4 - 1;
priv.display1.DH = 448 - 1;

Glorious 256×179 logo 😀

Implementation details

The rendering pipeline

Before we get into the page tracker, it’s useful to define a rendering pipeline where synchronization is implied between each stage.

  • Synchronize CPU copy of VRAM to GPU. This is mostly unused, but happens for save state load, or similar
  • Upload data to VRAM (or perform local-to-local copy)
  • Update CLUT cache from VRAM
  • Unswizzle VRAM into VkImages that can be sampled directly, and handle palettes as needed, sampling from CLUT cache
  • Perform rendering
  • Synchronize GPU copy of VRAM back to CPU. This will be useful for readbacks. Then CPU should be able to unswizzle directly from a HOST_CACHED_BIT buffer as needed

This pipeline matches what we expect a game to do over and over:

  • Upload texture to VRAM
  • Upload palette to VRAM
  • Update CLUT cache
  • Draw with texture
    • Trigger unswizzle from VRAM into VkImage if needed
    • Begins building a “render pass”, a batch of primitives

When there are no backwards hazards here, we can happily keep batching and defer any synchronization. This is critical to get any performance out of this style of renderer.

Some common hazards here include:

Copy to VRAM which was already written by copy

This is often a false positive, but we cannot track per-byte. This becomes a simple copy barrier and we move on.

Copy to VRAM where a texture was sampled from, or CLUT cache read from

Since the GS has a tiny 4 MiB VRAM, it’s very common that textures are continuously streamed in, sampled from, and thrown away. When this is detected, we have to submit all vram copy work, all texture unswizzle work and then begin a new batch. Primitive batches are not disrupted.

This means we’ll often see:

  • Copy xN
  • Barrier
  • Unswizzle xN
  • Barrier
  • Copy xN
  • Barrier
  • Unswizzle xN
  • Barrier
  • Rendering

Sample texture that was rendered to

Similar, but here we need to flush out everything. This basically breaks the render pass and we start another one. Too many of these is problematic for performance obviously.

Copy to VRAM where rendering happened

Basically same as sampling textures, this is a full flush.

Other hazards are ignored, since they are implicitly handled by our pipeline.

Page tracker

Arguably, the hardest part of GS emulation is dealing with hazards. VRAM is read and written to with reckless abandon and any potential read-after-write or write-after-write hazard needs to be dealt with. We cannot rely on any game doing this for us, since PS2 GS just deals with sync in most cases, and TEXFLUSH is the only real command games will use (or forget to use).

Tracking per byte is ridiculous, so my solution is to first subdivide the 4 MiB VRAM into pages. A page is the unit for frame buffers and depth buffers, so it is the most meaningful place to start.

PageState

On page granularity, we track:

  • Pending frame buffer write?
  • Pending frame buffer read? (read-only depth)

Textures and VRAM copies have 256 byte alignment, and to avoid a ton of false positives, we need to track on a per-block basis. There are 32 blocks per page, so a u32 bit-mask is okay.

  • VRAM copy writes
  • VRAM copy reads
  • Pending read into CLUT cache or VkImage
  • Blocks which have been clobbered by any write, on next texture cache invalidate, throw away images that overlap

As mentioned earlier, there are also cases where you can render to 24-bit color, while sampling from the upper 8-bits without hazard. We need to optimize for that case too, so there is also:

  • A write mask for framebuffers
  • A read mask for textures

In the example above, FB write mask is 0xffffff and texture cache mask is 0xff000000. No overlap, no invalidate 😀

For host access, there are also timeline semaphore values per page. These values state which sync point to wait for if the host desires mapped read or mapped write access. Mapped write access may require more sync than mapped read if there are pending reads on that page.

Caching textures

Every page contains a list of VkImages which have been associated with it. When a page’s textures has been invalidated, the image is destroyed and has to be unswizzled again from VRAM.

There is a one-to-many relationship with textures and pages. A texture may span more than one page, and it’s enough that only one page is clobbered before the texture is invalidated.

Overall, there are a lot of micro-details here, but the important things to note here is that conservative and simple tracking will not work on PS2 games. Tracking at a 256 byte block level and considering write/read masks is critical.

Special cases

There are various situations where we may have false positives due to how textures work. Since textures are POT sized, it’s fairly common for e.g. a 512×448 texture of a render target to be programmed as a 512×512 texture. The unused region should ideally be clamped out with REGION_CLAMP, but most games don’t. A render target might occupy those unused pages. As long as the game’s UV coordinates don’t extend into the unused red zone, there are no hazards, but this is very painful to track. We would have to analyze every single primitive to detect if it’s sampling into the red zone.

As a workaround, we ignore any potential hazard in that red zone, and just pray that a game isn’t somehow relying on ridiculous spooky-action-at-a-distance hazards to work in the game’s favor.

There are more spicy special cases, especially with texture sampling feedback, but that will be for later.

Updating CLUT in a batched way

Since we want to batch texture uploads, we have to batch CLUT uploads too. To make this work, we have 1024 copies of CLUT, a ring buffer of snapshots.

One workgroup loops through the updates and writes them to an SSBO. I did a similar thing for N64 RDP’s TMEM update, where TMEM was instanced. Fortunately, CLUT update is far simpler than TMEM update.

shared uint tmp_clut[512];

// ...

// Copy from previous instance to allow a
// CLUT entry to be partially overwritten and used later
uint read_index = registers.read_index * CLUT_SIZE_16;
tmp_clut[gl_LocalInvocationIndex] =
    uint(clut16.data[read_index]);
tmp_clut[gl_LocalInvocationIndex + 256u] =
    uint(clut16.data[read_index + 256u]);
barrier();

for (uint i = 0; i < registers.clut_count; i++)
{
  // ...
  if (active_lane)
  {
    // update tmp_clut. If 256 color, all threads participate.
    // 16 color update is a partial update.
  }

  // Flush current CLUT state to SSBO.
  barrier();
  clut16.data[gl_LocalInvocationIndex + clut.instance * CLUT_SIZE_16] =
    uint16_t(tmp_clut[gl_LocalInvocationIndex]);
  clut16.data[gl_LocalInvocationIndex + clut.instance * CLUT_SIZE_16 + 256u] =
    uint16_t(tmp_clut[gl_LocalInvocationIndex + 256u]);
  barrier();
}

One potential optimization is that for 256 color / 32 bpp updates, we can parallelize the CLUT update, since nothing from previous iterations will be preserved, but the CLUT update time is tiny anyway.

Unswizzling textures from VRAM

Since this is Vulkan, we can just allocate a new VkImage, suballocate it from VkDeviceMemory and blast it with a compute shader.

Using Vulkan’s specialization constants, we specialize the texture format and all the swizzling logic becomes straight forward code.

REGION_REPEAT shenanigans is also resolved here, so that the ubershader doesn’t have to consider that case and do manual bilinear filtering.

Even for render targets, we roundtrip through the VRAM SSBO. There is not really a point going to the length of trying to forward render targets into textures. Way too many bugs to squash and edge cases to think about.

Triangle setup and binning

Like paraLLEl-RDP, paraLLEl-GS is a tile-based renderer. Before binning can happen, we need triangle setup. As inputs, we provide attributes in three arrays.

Position
struct VertexPosition
{
  ivec2 pos;
  float z;     // TODO: Should be uint for 32-bit Z.
  int padding; // Free real-estate?
};
Per-Vertex attributes
struct VertexAttribute
{
  vec2 st;
  float q;
  uint rgba; // unpackUnorm4x8
  float fog; // overkill, but would be padding anyway
  u16vec2 uv;
};
Per-primitive attributes
struct PrimitiveAttribute
{
  i16vec4 bb; // Scissor
  // Index into state UBO, as well as misc state bits.
  uint state;
  // Texture state which should be scalarized. Affects code paths.
  // Also holds the texture index (for bindless).
  uint tex;
  // Texture state like lod scaling factors, etc.
  // Does not affect code paths.
  uint tex2;  
  uint alpha; // AFIX / AREF
  uint fbmsk;
  uint fogcol;
};

For rasterization, we have a straight forward barycentric-based rasterizer. It is heavily inspired by https://fgiesen.wordpress.com/2011/07/06/a-trip-through-the-graphics-pipeline-2011-part-6/, which in turn is based on A Parallel Algorithm for Polygon Rasterization (Paneda, 1988) and describes the “standard” way to write a rasterizer with parallel hardware. Of course, the PS2 GS is DDA, i.e. a scanline rasterizer, but in practice, this is just a question of nudging ULPs of precision, and since I’m not aware of a bit-exact description of the GS’s DDA, this is fine. paraLLEl-RDP implements the raw DDA form for example. It’s certainly possible if we have to.

As an extension to a straight-forward triangle rasterizer, I also need to support parallelograms. This is used to implement wide-lines and sprites. Especially wide-line is kinda questionable, but I’m not sure it’s possible to fully solve up-scaling + Bresenham in the general case. At least I haven’t run into a case where this really matters.

Evaluating coverage and barycentric I/J turns into something like this:

bool evaluate_coverage_single(PrimitiveSetup setup,
  bool parallelogram, 
  ivec2 parallelogram_offset,
  ivec2 coord, inout float i, inout float j)
{
  int a = idot3(setup.a, coord);
  int b = idot3(setup.b, coord);
  int c = idot3(setup.c, coord);

  precise float i_result = float(b) * setup.inv_area + setup.error_i;
  precise float j_result = float(c) * setup.inv_area + setup.error_j;
  i = i_result;
  j = j_result;

  if (parallelogram && a.x < 0)
  {
    b += a + parallelogram_offset.x;
    c += a + parallelogram_offset.y;
    a = 0;
  }

  return all(greaterThanEqual(ivec3(a, b, c), ivec3(0)));
}

inv_area is computed in a custom fixed-point RCP, which is ~24.0 bit accurate. Using the standard GPU RCP would be bad since it’s just ~22.5 bit accurate and not consistent across implementations. There is no reason to skimp on reproducibility and accuracy, since we’re not doing work per-pixel.

error_i and error_j terms are caused by the downsampling of the edge equations and tie-break rules. As a side effect of the GS’s [-4k, +4k] pixel range, the range of the cross-product requires 33-bit in signed integers. By downsampling a bit, we can get 32-bit integer math to work just fine with 8 sub-pixel accuracy for super-sampling / multi-sampling. Theoretically, this means our upper up-sampling limit is 8×8, but that’s ridiculous anyway, so we’re good here.

The parallelogram offsets are very small numbers meant to nudge the tie-break rules in our favor as needed. The exact details of the implementation escape me. I wrote that code years ago. It’s not very hard to derive however.

Every primitive gets a struct of transformed attributes as well. This is only read if we actually end up shading a primitive, so it’s important to keep this separate to avoid polluting caches with too much garbage.

struct TransformedAttributes
{
  vec4 stqf0;
  vec4 stqf1;
  vec4 stqf2;
  uint rgba0;
  uint rgba1;
  uint rgba2;
  uint padding;
  vec4 st_bb;
};

Using I/J like this will lead to small inaccuracies when interpolating primitives which expect to land exactly on the top-left corner of a texel with NEAREST filtering. To combat this, a tiny epsilon offset is used when snapping texture coordinates. Very YOLO, but what can you do. As far as I know, hardware behavior is sub-texel floor, not sub-texel round.

precise vec2 uv_1 = uv * scale_1;

// Want a soft-floor here, not round behavior.
const float UV_EPSILON_PRE_SNAP = 1.0 / 16.0;
// We need to bias less than 1 / 512th texel, so that linear filter will RTE to correct subpixel.
// This is a 1 / 1024th pixel bias to counter-act any non-POT inv_scale_1 causing a round-down event.
const float UV_EPSILON_POST_SNAP = 16.0 / 1024.0;

if (sampler_clamp_s)
  uv_1.x = texture_clamp(uv_1.x, region_coords.xz, LOD_1);
if (sampler_clamp_t)
  uv_1.y = texture_clamp(uv_1.y, region_coords.yw, LOD_1);

// Avoid micro-precision issues with UV and flooring + nearest.
// Exact rounding on hardware is somwhat unclear.
// SotC requires exact rounding precision and is hit particularly bad.
// If the epsilon is too high, then FF X save screen is screwed over,
// so ... uh, ye.
// We likely need a more principled approach that is actually HW accurate in fixed point.
uv_1 = (floor(uv_1 * 16.0 + UV_EPSILON_PRE_SNAP) + UV_EPSILON_POST_SNAP) *
       inv_scale_1 * 0.0625;

Binning

This is mostly uninteresting. Every NxN pixel block gets an array of u16 primitive indices to shade. This makes the maximum number of primitives per render pass 64k, but that’s enough for PS2 games. Most games I’ve seen so far tend to be between 10k and 30k primitives for the “main” render pass, but I haven’t tested the real juggernauts of primitive grunt yet, but even so, having to do a little bit of incremental rendering isn’t a big deal.

NxN is usually 32×32, but it can be dynamically changed depending on how heavy the geometry load is. For large resolutions and high primitive counts, the binning and memory cost is unacceptable if the resolution is just 16×16 for example. One subgroup is responsible for iterating through all primitives in a block.

Since binning and triangle is state-less, triangle-setup and binning for back-to-back passes are batched up nicely to avoid lots of silly barriers.

The ubershader

A key difference between N64 and PS2 is fill-rate and per-pixel complexity. For N64, the ideal approach is to specialize the rasterizing shader, write out per-pixel color + depth + coverage + etc, then merge that data in a much simpler ubershader that only needs to consider depth and blend state rather than full texturing state and combiner state. This is very bandwidth intensive on the GPU, but the alternative is the slowest ubershader written by man. We’re saved by the fact that N64 fill-rate is abysmal. Check out this video by Kaze to see how horrible it is.

The GS is a quite different beast. Fill-rate is very high, and per-pixel complexity is fairly low, so a pure ubershader is viable. We can also rely on bindless this time around too, so texturing complexity becomes a fraction of what I had to deal with on N64.

Fine-grained binning

Every tile is 4×4, 4×8 and 8×8 for subgroup sizes 16, 32 and 64 respectively. For super-sampling it’s even smaller (it’s 4×4 / 4×8 / 8×8 in the higher resolution domain instead).

In the outer loop, we pull in up to SubgroupSize’s worth of primitives, and bin them in parallel.

for (int i = 0; i < tile.coarse_primitive_count;
     i += int(gl_SubgroupSize))
{
  int prim_index = i + int(gl_SubgroupInvocationID);
  bool is_last_iteration = i + int(gl_SubgroupSize) >= 
                           tile.coarse_primitive_count;

  // Bin primitives to tile.
  bool binned_to_tile = false;
  uint bin_primitive_index;
  if (prim_index < tile.coarse_primitive_count)
  {
    bin_primitive_index = 
      uint(coarse_primitive_list.data[
           tile.coarse_primitive_list_offset + prim_index]);
    binned_to_tile = primitive_intersects_tile(bin_primitive_index);
  }

  // Iterate per binned primitive, do per pixel work now.
  // Scalar loop.
  uvec4 work_ballot = subgroupBallot(binned_to_tile);

In the inner loop, we can do a scalarized loop which checks coverage per-pixel, one primitive at a time.

// Scalar data
uint bit = subgroupBallotFindLSB(work_ballot);

if (gl_SubgroupSize == 64)
{
  if (bit >= 32)
    work_ballot.y &= work_ballot.y - 1;
  else
    work_ballot.x &= work_ballot.x - 1;
}
else
{
  work_ballot.x &= work_ballot.x - 1;
}

shade_primitive_index = subgroupShuffle(bin_primitive_index, bit);

Early Z

We can take advantage of early-Z testing of course, but we have to be careful if there are rasterized pixels we haven’t resolved yet, and there are Z-writes in flight. In this case we have to defer to late Z to perform test.

// We might have to remove opaque flag.
bool pending_z_write_can_affect_result =
  (pixel.request.z_test || !pixel.request.z_write) &&
  pending_shade_request.z_write;

if (pending_z_write_can_affect_result)
{
  // Demote the pixel to late-Z,
  // it's no longer opaque and we cannot discard earlier pixels.
  // We need to somehow observe the previous results.
  pixel.opaque = false;
}

Deferred on-tile shading

Since we’re an uber-shader, all pixels are “on-chip”, i.e. in registers, so we can take advantage of culling pixels that won’t be visible anyway. The basic idea here is that after rasterization, if a pixel is considered opaque, it will simply replace the shading request that exists for that framebuffer coordinate. It won’t be visible at all anyway.

Lazy pixel shading

We only need to perform shading when we really have to, i.e., we’re shading a pixel that depends on the previous pixel’s results. This can happen for e.g. alpha test (if test fails, we preserve existing data), color write masks, or of course, alpha blending.

If our pixel remains opaque, we can just kill the pending pixel shade request. Very nice indeed. The gain here wasn’t as amazing as I had hoped since PS2 games love blending, but it helps culling out a lot of shading work.

if (pixel.request.coverage > 0)
{
  need_flush = !pixel.opaque && pending_shade_request.coverage > 0;

  // If there is no hazard, we can overwrite the pending pixel.
  // If not, defer the update until we run a loop iteration.
  if (!need_flush)
  {
    set_pending_shade_request(pixel.request, shade_primitive_index);
    pixel.request.coverage = 0;
    pixel.request.z_write = false;
  }
}

If we have flushes that need to happen, we do so if one pixel needs it. It’s just as fast to resolve all pixels anyway.

// Scalar branch
if (subgroupAny(need_flush))
{
  shade_resolve();
  if (has_work && pixel.request.coverage > 0)
    set_pending_shade_request(pixel.request, shade_primitive_index);
}

The resolve is a straight forward waterfall loop that stays in uniform control flow to be well defined on devices without maximal reconvergence support.

while (subgroupAny(has_work))
{
  if (has_work)
  {
    uint state_index =
      subgroupBroadcastFirst(pending_shade_request.state);
    uint tex = subgroupBroadcastFirst(prim_tex);
    if (state_index == pending_shade_request.state && prim_tex == tex)
    {
      has_work = false;
      shade_resolve(pending_primitive_index, state_index, tex);
    }
  }
}

This scalarization ensures that all branches on things like alpha test mode, blend modes, etc, are purely scalar, and GPUs like that. Scalarizing on the texture index is technically not that critical, but it means we end up hitting the same branches for filtering modes, UBOs for scaling factors are loaded uniformly, etc.

When everything is done, the resulting framebuffer color and depth is written out to SSBO. GPU bandwidth is kept to a minimum, just like a normal TBDR renderer.

Super-sampling

Just implementing single sampled rendering isn’t enough for this renderer to be really useful. The software renderer is certainly quite fast, but not fast enough to keep up with intense super-sampling. We can fix that now.

For e.g. 8x SSAA, we keep 10 versions of VRAM on the GPU.

  • 1 copy represents the single-sampled VRAM. It is super-sampled.
  • 1 copy represents the reference value for single-sampled VRAM. This allows us to track when we should discard the super-samples and splat the single sample to all. This can happen if someone copies to VRAM over a render target for whatever reason.
  • 8 copies which each represent the super-samples. Technically, we can reconstruct a higher resolution image from these samples if we really want to, but only the CRTC could easily do that.

When rendering super-sampled, we load the single-sampled VRAM and reference. If they match, we load the super-sampled version. This is important for cases where we’re doing incremental rendering.

On tile completion we use clustered subgroup ops to do multi-sample resolve, then write out the super-samples, and the two single-sampled copies.

uvec4 ballot_color = subgroupBallot(fb_color_dirty);
uvec4 ballot_depth = subgroupBallot(fb_depth_dirty);

// No need to mask, we only care about valid ballot for the
// first sample we write-back.
if (NUM_SAMPLES >= 16)
{
  ballot_color |= ballot_color >> 8u;
  ballot_depth |= ballot_depth >> 8u;
}

if (NUM_SAMPLES >= 8)
{
  ballot_color |= ballot_color >> 4u;
  ballot_depth |= ballot_depth >> 4u;
}

if (NUM_SAMPLES >= 4)
{
  ballot_color |= ballot_color >> 2u;
  ballot_depth |= ballot_depth >> 2u;
}

ballot_color |= ballot_color >> 1u;
ballot_depth |= ballot_depth >> 1u;

// GLSL does not accept cluster reduction as spec constant.
if (NUM_SAMPLES == 16)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 16) / 16.0);
else if (NUM_SAMPLES == 8)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 8) / 8.0);
else if (NUM_SAMPLES == 4)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 4) / 4.0);
else
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 2) / 2.0);

fb_color_dirty = subgroupInverseBallot(ballot_color);
fb_depth_dirty = subgroupInverseBallot(ballot_depth);

The main advantage of super-sampling over straight up-scaling is that up-scaling will still have jagged edges, and super-sampling retains a coherent visual look where 3D elements have similar resolution as UI elements. One of my pet peeves is when UI elements have a significantly different resolution from 3D objects and textures. HD texture packs can of course alleviate that, but that’s a very different beast.

Super-sampling also lends itself very well to CRT post-processing shading, which is also a nice bonus.

Dealing with super-sampling artifacts

It’s a fact of life that super-sampling always introduces horrible artifacts if not handled with utmost care. Mitigating this is arguably easier with software renderers over traditional graphics APIs, since we’re not limited by the fixed function interpolators. These tricks won’t make it perfect by any means, but it greatly mitigates jank in my experience, and I already fixed many upscaling bugs that GSdx Vulkan backend does not solve as we shall see later.

Sprite primitives should always render at single-rate

Sprites are always UI elements or similar, and games do not expect us to up-scale them. Doing so either results in artifacts where we sample outside the intended rect, or we risk overblurring the image if bilinear filtering is used.

The trick here is just to force-snap the pixel coordinate we use when rasterizing and interpolating. This is very inefficient of course, but UI shouldn’t take up the entire screen. And if it does (like in a menu), the GPU load is tiny anyway.

const uint SNAP_RASTER_BIT = (1u << STATE_BIT_SNAP_RASTER);
const uint SNAP_ATTR_BIT = (1u << STATE_BIT_SNAP_ATTRIBUTE);

if (SUPER_SAMPLE && (prim_state & SNAP_RASTER_BIT) != 0)
  fb_pixel = tile.fb_pixel_single_rate;

res.request.coverage = evaluate_coverage(
  prim, fb_pixel, i, j,
  res.request.multisample, SAMPLING_RATE_DIM_LOG2);

Flat primitives should interpolate at single-pixel coordinate

Going further, we can demote SSAA interpolation to MSAA center interpolation dynamically. Many UI elements are unfortunately rendered with normal triangles, so we have to be a bit more careful. This snap only affects attribute interpolation, not Z of course.

res.request.st_bb = false;
if (SUPER_SAMPLE &&
    (prim_state & (SNAP_RASTER_BIT | SNAP_ATTR_BIT)) == SNAP_ATTR_BIT)
{
  vec2 snap_ij = evaluate_barycentric_ij(
    prim.b, prim.c, prim.inv_area,
    prim.error_i, prim.error_j, tile.fb_pixel_single_rate,
    SAMPLING_RATE_DIM_LOG2);

  i = snap_ij.x;
  j = snap_ij.y;
  res.request.st_bb = true;
}

Here, we snap interpolation to the top-left pixel. This fixes any artifacts for primitives which align their rendering to a pixel center, but some games are mis-aligned, so this snapping can cause texture coordinates to go outside the expected area. To clean this up, we compute a bounding box of final texture coordinates. Adding bounding boxes can technically cause notorious block-edge artifacts, but that was mostly a thing on PS1 since emulators like to convert nearest sampling to bilinear.

The heuristic for this is fairly simple. If perspective is used, if all vertices in a triangle have exact same Q, we assume it’s a flat UI primitive. The primitive’s Z coordinates must also match. This is done during triangle setup on the GPU. There can of course be false positives here, but it should be rare. In my experience this hack works well enough in the games I tried.

Results

Here’s a good example of up-sampling going awry in PCSX2. This is with Vulkan backend:

Notice the bloom on the glass being mis-aligned and a subtle (?) rectangular pattern being overlaid over the image. This is caused by a post-processing pass rendering in a page-like pattern, presumably to optimize for GS caching behavior.

 

With 8x SSAA in paraLLEl-GS it looks like this instead. There is FSR1 post-upscale in effect here which changes the look a bit, but the usual trappings of bad upscale cannot be observed here. This is another reason to do super-sample; texture mis-alignment has a tendency to fix itself.

Also, if you’re staring at the perf numbers, this is RX 7600 in a low power state :’)

Typical UI issues can be seen in games as well. Here’s native resolution:

and 4x upscale, which … does not look acceptable.

This UI is tricky to render in upscaled mode, since it uses triangles, but the MSAA snap trick above works well and avoids all artifacts. With straight upscale, this is hard to achieve in normal graphics APIs since you’d need interpolateAtOffset beyond 0.5 pixels, which isn’t supported. Perhaps you could do custom interpolation with derivatives or something like that, but either way, this glitch can be avoided. The core message is basically to never upscale UI beyond plain nearest neighbor integer scale. It just looks bad.

There are cases where PCSX2 asks for high blending accuracy. One example is MGS2, and I found a spot where GPU perf is murdered. My desktop GPU cannot keep 60 FPS here at 4x upscale. PCSX2 asks you to turn up blend-accuracy for this game, but …

What happens here is we hit the programmable blending path with barrier between every primitive. Ouch! This wouldn’t be bad for the tiler mobile GPUs, but for a desktop GPU, it is where perf goes to die. The shader in question does subpassLoad and does programmable blending as expected. Barrier, tiny triangle, barrier, tiny triangle, hnnnnnnng.

paraLLEl-GS on the other hand always runs with 100% blend accuracy (assuming no bugs of course). Here’s 16xSSAA (equivalent to 4x upscale). This is just 25 W and 17% GPU utilization on RX 7600. Not bad.

Other difficult cases include texture sampling feedback. One particular case I found was in Valkyrie Profile 2.

This game has a case where it’s sampling it’s own pixel’s alpha as a palette index. Quirky as all hell, and similar to MGS2 there’s a barrier between every pixel.

In paraLLEl-GS, this case is detected, and we emit a magical texture index, which resolved to just looking at in-register framebuffer color instead. Programmable blending go brr. These cases have to be checked per primitive, which is quite rough on CPU time, but it is what it is. If we don’t hit the good path, GPU performance completely tanks.

The trick here is to analyze the effective UV coordinates, and see if UV == framebuffer position. If we fall off this path, we have to go via texture uploads, which is bad.

ivec2 uv0_delta = uv0 - pos[0].pos;
ivec2 uv1_delta = uv1 - pos[1].pos;
ivec2 min_delta = min(uv0_delta, uv1_delta);
ivec2 max_delta = max(uv0_delta, uv1_delta);

if (!quad)
{
  ivec2 uv2_delta = uv2 - pos[2].pos;
  min_delta = min(min_delta, uv2_delta);
  max_delta = max(max_delta, uv2_delta);
}

int min_delta2 = min(min_delta.x, min_delta.y);
int max_delta2 = max(max_delta.x, max_delta.y);

// The UV offset must be in range of [0, 2^SUBPIXEL_BITS - 1].
// This guarantees snapping with NEAREST.
// 8 is ideal. That means pixel centers during interpolation
// will land exactly in the center of the texel.
// In theory we could allow LINEAR if uv delta was
// exactly 8 for all vertices.
if (min_delta2 < 0 || max_delta2 >= (1 << SUBPIXEL_BITS))
  return ColorFeedbackMode::Sliced;

// Perf go brrrrrrr.
return ColorFeedbackMode::Pixel;
if (feedback_mode == ColorFeedbackMode::Pixel)
{
  mark_render_pass_has_texture_feedback(ctx.tex0.desc);
  // Special index indicating on-tile feedback.
  // We could add a different sentinel for depth feedback.
  // 1024k CLUT instances and 32 sub-banks. Fits in 15 bits.
  // Use bit 15 MSB to mark feedback texture.
  return (1u << (TEX_TEXTURE_INDEX_BITS - 1u)) |
         (render_pass.clut_instance * 32 + uint32_t(ctx.tex0.desc.CSA));
}

It’s comfortably full-speed on PCSX2 here, despite the copious number of barriers, but paraLLEl-GS is reasonably close perf-wise, actually. At 8x SSAA.

Overall, we get away with 18 render pass barriers instead of 500+ which was the case without this optimization. You may notice the interlacing artifacts on the swirlies. Silly game has a progressive scan output, but downsamples it on its own to a field before hitting CRTC, hnnnnng 🙁 Redirecting framebuffer locations in CRTC might work as a per-game hack, but either way, I still need to consider a better de-interlacer. Some games actually render explicitly in fields (640×224), which is very annoying.

This scene in the MGS2 intro also exposes some funny edge cases with sampling.

To get the camo effect, it’s sampling its own framebuffer as a texture, with overlapping coordinates, but not pixel aligned, so this raises some serious questions about caching behavior. PCSX2 doesn’t seem to add any barriers here, and I kinda had to do the same thing. It looks fine to me compared to software renderer at least.

if (feedback_mode == ColorFeedbackMode::Sliced)
{
  // If game explicitly clamps the rect to a small region,
  // it's likely doing well-defined feedbacks.
  // E.g. Tales of Abyss main menu ping-pong blurs.
  // This code is quite flawed,
  // and I'm not sure what the correct solution is yet.
  if (desc.clamp.desc.WMS == CLAMPBits::REGION_CLAMP &&
      desc.clamp.desc.WMT == CLAMPBits::REGION_CLAMP)
  {
    ivec4 clamped_uv_bb(
      int(desc.clamp.desc.MINU),
      int(desc.clamp.desc.MINV),
      int(desc.clamp.desc.MAXU),
      int(desc.clamp.desc.MAXV));

    ivec4 hazard_bb(
      std::max<int>(clamped_uv_bb.x, bb.x),
      std::max<int>(clamped_uv_bb.y, bb.y),
      std::min<int>(clamped_uv_bb.z, bb.z),
      std::min<int>(clamped_uv_bb.w, bb.w));

    cache_texture = hazard_bb.x > hazard_bb.z ||
                    hazard_bb.y > hazard_bb.w;
  }
  else
  {
    // Questionable,
    // but it seems almost impossible to do this correctly and fast.
    // Need to emulate the PS2 texture cache exactly,
    // which is just insane.
    // This should be fine.
    cache_texture = false;
  }
}

If we’re in a mode where texture points directly to the frame buffer we should relax the hazard tracking a bit to avoid 2000+ barriers. This is clearly spooky since Tales of Abyss’s bloom effect as shown earlier depends on this to be well behaved, but in that case, at least it uses REGION_CLAMP to explicitly mark the ping-pong behavior. I’m not sure what the proper solution is here.

The only plausible solution to true bit-accuracy with real hardware is to emulate the caches directly, one pixel at a time. You can kiss performance good bye in that case.

One of the worst stress tests I’ve found so far has to be Shadow of the Collosus. Just in the intro, we can make the GPU kneel down to 24 FPS with maximum blend accuracy on PCSX2, at just 2x upscale! Even with normal blending accuracy, it is extremely heavy during the intro cinematic.

At 8x SSAA, perf is still looking pretty good for paraLLEl-GS, but it’s clearly sweating now.

We’re actually still CPU bound on the geometry processing. Optimizing the CPU code hasn’t been a huge priority yet. There’s unfortunately a lot of code that has to run per-primitive, where hazards can happen around every corner that has to be dealt with somehow. I do some obvious optimizations, but it’s obviously not as well-oiled as PCSX2 in that regard.

Deck?

It seems fast enough to comfortably do 4x SSAA. Maybe not in SotC, but … hey. 😀

What now?

For now, the only real way to test this is through GS dumps. There’s a hack-patch for PCSX2 that lets you dump out a raw GS trace, which can be replayed. This works via mkfifo as a crude hack to test in real-time, but some kind of integration into an emulator needs to happen at some point if this is to turn into something that’s useful for end users.

There’s guaranteed to be a million bugs lurking since the PS2 library is ridiculously large and there’s only so much I can be arsed to test myself. At least, paraLLEl-GS has now become my preferred way to play PS2 games, so I can say mission complete.

A potential use case for this is due to its standalone library nature, it may be useful as very old-school rendering API for the old greybeards around that still yearn for the day of PS2 programming for whatever reason :p

Real-time video streaming experiments with forward error correction

As I previously discussed in my PyroFling post about real-time video streaming, one challenge I mentioned was related to error correction. Using UDP, packet loss is inevitable, so there’s two approaches to reduce streaming jank:

  • Error masking – hallucinate missed frames
  • Forward error correction (FEC) – add redundancy to avoid dropped packets

Re-sending packets is a waste of time in a low latency environment like this, so we can ignore that. If re-sending was okay, I’d just use TCP and forget about all of this anyway.

With intra-refresh, error masking is half decent, so I wanted to focus on FEC. Error correction is its own field of study, but I didn’t have the time to actually study the field. State-of-the-art error correction is extremely advanced, complex to implement and IP encumbered (*ahem*, RaptorQ, *ahem*), but I evaluated some less recent approaches.

Understanding the data

Which FEC mechanism we choose needs to take the input data into consideration.

N bytes split into 1024 byte sub-packets

A video packet is a variable number of bytes every frame, which I split up into 1024 bytes (+ header) each. A successful transmission only happens when all N bytes are received successfully. Sending partially valid data to the video decoder is likely going to result in horrible things happening, so if even one sub-packet is dropped, I have to drop the full frame.

Some error correction schemes rely on fixed block lengths, which isn’t ideal for our variable length input. The classic example everyone taking classes on the subject learn is the Hamming (7, 4) code, but this code is better suited for noisy analog channels where we don’t know if any bit was actually received correctly. What we really want is a method that takes extra knowledge about packet loss into account.

Erasure channel

Sending UDP packets over the internet functions like an erasure channel. At the receiver, we know if data was missed. Corrupt packets are dropped by the network (random bit-errors causing CRC check to pass is theoretically possible I suppose, but I don’t consider that).

Small block size

Since a packet is received all or nothing, we’re actually error correcting in a vectorized fashion. The message we’re looking at error correcting is byte n for for all sub-packets P in [0, ceil(N / 1024)), where N is the video packet size in bytes. The error correction algorithm will perform the exact same operation for every byte n in a given packet.

Another way of looking at this is to consider every packet a single 8192-bit number, but that’s a very mathematical way of looking at it. Either way, given a typical 10 mbit/s stream at 60 fps, we expect about 20 sub-packets per video frame. Some frames will be very small, and some will be larger.

Flexible redundancy ratios and block sizes

Some block codecs like Reed-Solomon are well known and very powerful, but seemed a bit too rigid in its block structure. It also has block lengths that seem better suited to bit-streams.

Being able to adapt how much FEC is used is quite useful in a dynamic system such as streaming. A Compact Disc (which uses Reed-Solomon) has to bake in a fixed amount of error correction, but with streaming, a feedback channel can let us dynamically adjust the amount of error correction used as needed. I quickly rejected these codes.

The most basic FEC – YOLO XOR

I don’t think this code has a formal name, but understanding it is the foundation for the upcoming section. Given N packets, just take the XOR of all the packets and send that as one FEC block. If there is 1 packet loss, we can recover it by taking the XOR of all received packets and the FEC block. For small video frames with a small number of packets, this is actually the method I went with.

The downside of course is that there is no obvious way to recover more than 1 packet. I found that spurious packet loss could have 2 or 3 drops in some cases, especially in very large video frames that span up to 100 sub-packets, so this approach was too naive for me.

Fountain codes

While looking around, I ran into a very clever scheme called a fountain code, in particular, the Luby Transform. There is a nice YouTube video explaining it. I also dug up Chapter 50 in an old textbook I used in my university studies, which had a chapter dedicated to this with more mathematical rigor.

This method has some nice properties that are well suited for network transmission:

  • Designed for erasure channels (e.g. IP networks)
  • Flexible FEC ratios
  • Receiver can complete decode after receiving enough data packets (with some major caveats)

A fountain code is called so because the encoder can spit out an arbitrary number of packets. There is no fixed block structure, and the process is pseudo-random. As long as the encoder and decoder agree on a seed for the process, very little side channel data needs to be communicated.

The algorithm is essentially YOLO XOR, with a lot of statistical tweaks. First, we consider the degree d of a packet, which is the number of blocks we take the XOR of when generating a packet. A degree of 1 is the base case where we send a block as-is, and a degree of ceil(N / 1024) is taking the XOR of everything (i.e. the YOLO XOR case).

The packets chosen for XOR-ing are randomized. On the receiver end, we look at all our received packets, and if we find a case where we have a packet with degree d, and d – 1 of the packets have been recovered, we can recover the last one through … more XOR-ing. By recovering a new packet, this may cause other packets to reach this condition and the cycle continues. To kick-start this process, some packets with degree = 1 must be transmitted.

To make this work well, the literature describes a very particular distribution for d to minimize the expected redundancy. I implemented all of this, but I found some unfortunate practical problems. It is (very) possible I had bugs of course, but debugging completely random processes is not very fun and it’s not like I had a reference result to compare against.

Non-deterministic amount of packets needed to complete decode

Given the completely random process, it’s unbounded how many packets have to be encoded to actually be able to decode, even with no packet loss. Studying the literature, the examples I found seemed to assume a very large number of blocks K. K would be 10000 for example, and as K increases, the variance of redundancy ratios decreases. For my example of K = 20, the algorithm seemed to collapse. Occasionally, I needed 2 or 3x redundancy to complete the decode, which is obviously unacceptable.

Painful statistical modelling

The statistical distribution for the degree factor d depends on the number of blocks to send, K. This value changes every frame. K can be arbitrarily large, so computing LUTs got awkward.

Smol brain LT code

Following the enlightened example of grug brain, I massively simplified the LT code down to something that maybe isn’t as theoretically good, but it worked very well in practice for my particular needs, where packet loss ratios are fairly low and K is low.

This basically boils down to a heavily “rigged” LT, but otherwise the encoder and decoder does not really change.

Send all packets as-is first (d = 1)

This is an obvious thing to do. If there is no packet loss, we guarantee that we can start decoding immediately (good for latency). There is no randomness in this process.

Fixed degree factor

I found that a fixed d factor of K / 2 worked well. For odd K, alternate d between ceil(K / 2) and floor(K / 2). For large K, clamping the factor d to something reasonable like 64 worked well too.

Mirror selection of blocks

For every pair of blocks, we want to ensure that blocks selected for XOR are the complement of each other. This guarantees that by receiving a pair of FEC blocks, we will always be able to correct one missed packet. With 50% probability, we can recover two packet losses. The odd/even split above is designed to make sure that an odd/even pair always covers all K blocks. As the number of blocks increases, we’ll be able to recover more losses (with lower and lower probability).

Results

Given a fixed number of data packets, and N randomly lost packets, we can observe how well this FEC recovers data. The recovery rate for 1 lost packet is 100% due to our design, so that’s not interesting. XOR degree factor is 20.

With larger data blocks, and more FEC blocks to match the redundancy ratio, recovery ratio improves, but beyond 4 losses, the codec starts collapsing. That’s fine. I haven’t had too many issues with burst losses like these. XOR degree factor is 40. There’s some interesting stair-stepping here, which might be caused by the mirroring. This suggests we get most bang for our bandwidth by using odd number of FEC blocks.

It’s possible to tweak things however. Using a smaller degree is good when using a lot of FEC packets and more errors are expected. Maybe it’s possible to use a blend of high degree and low degree packets as well (basically the entire point of LT), but this kind of tweaking can be left to another time. PyroFling’s expectation is low number of losses per video frame and simplicity beats theoretical performance.

If we add e.g. 0.5% random, uncorrelated packet loss ratio, a degree factor of d = N / 2 seems much better.

For larger data sets, the number of expected losses starts increasing, so degree factors seem to prefer d = 100 over d = 200. For larger video frames, we’re far more likely to encounter at least one packet loss, so that’s why loss ratio for 0 FEC packets approaches 100%.

Is this even good?

Compared to the state of the art, this is likely far from optimal, but this was good enough for my uses and here’s the latest log from a 4-hour play session between Trondheim and Bergen:

2322932 complete, 54 dropped video, 9683 FEC recovered

~99.5% of dropped packets were avoided. This was at 25% FEC redundancy rate. Every dropped video packet is disruptive and lasts many frames, so this improvement was transformative.

This isn’t an academic project, so I don’t really care about comparing against a million different FEC algorithms. 🙂

UDP pacing to reduce bursty drops

A common technique in UDP streaming is to not send all packets immediately, but pace them over an interval. Sending over the full frame interval increases latency by quite a bit, but pacing a stream to a max instantaneous rate of e.g. 60 mbit/s worked alright. The added latency is only a few ms for a ~10-15 mbit/s stream, which is acceptable. Since I’m on Linux, and I was lazy, I rely on the kernel to do this automatically:

sudo tc qdisc add dev $iface root fq maxrate 60000000

Conclusion

Hopefully this demonstrates a simple FEC that is fairly accessible. The last piece of the PyroFling adventure will be to finally tackle Vulkan Video encode.

Modernizing Granite’s mesh rendering

Granite’s renderer is currently quite old school. It was written with 2017 mobile hardware in mind after all. Little to no indirect drawing, a bindful material system, etc. We’ve all seen that a hundred times before. Granite’s niche ended up exploring esoteric use cases, not high-end rendering, so it was never a big priority for me to fix that.

Now that mesh shading has starting shipping and is somewhat proven in the wild with several games shipping UE5 Nanite, and Alan Wake II – which all rely on mesh shaders to not run horribly slow – it was time to make a more serious push towards rewriting the entire renderer in Granite. This has been a slow burn project that’s been haunting me for almost half a year at this point. I haven’t really had the energy to rewrite a ton of code like this in my spare time, but as always, holidays tend to give me some energy for these things. Video shenanigans have also kept me distracted this fall.

I’m still not done with this rewrite, but enough things have fallen into place, that I think it’s time to write down my findings so far.

Design requirements

Reasonable fallbacks

I had some goals for this new method. Unlike UE5 Nanite and Alan Wake II, I don’t want to hard-require actual VK_EXT_mesh_shader support to run acceptably. Just thinking in terms of meshlets should benefit us in plain multi-draw-indirect (MDI) as well. For various mobile hardware that doesn’t support MDI well (or at all …), I’d also like a fallback path that ends up using regular direct draws. That fallback path is necessary to evaluate performance uplift as well.

What Nanite does to fallback

This is something to avoid. Nanite relies heavily on rendering primitive IDs to a visibility buffer, where attributes are resolved later. In the primary compute software rasterizer, this becomes a 64-bit atomic, and in the mesh shader fallback, a single primitive ID is exported to fragment stage as a per-primitive varying, where fragment shader just does the atomic (no render targets, super fun to debug …). The problem here is that per-primitive varyings don’t exist in the classic vertex -> fragment pipeline. There are two obvious alternatives to work around this:

  • Geometry shaders. Pass-through mode can potentially be used if all the stars align on supported hardware, but using geometry shaders should revoke your graphics programmer’s license.
  • Unroll a meshlet into a non-indexed draw. Duplicate primitive ID into 3 vertices. Use flat shading to pull in the primitive ID.

From my spelunking in various shipped titles, Nanite does the latter, and fallback rendering performance is halved as a result (!). Depending on the game, meshlet fallbacks are either very common or very rare, so real world impact is scene and resolution dependent, but Immortals of Aveum lost 5-15% FPS when I tested it.

The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading suggests rendering out a visibility G-Buffer using InstanceID (fed through some mechanism) and SV_PrimitiveID, which might be worth exploring at some point. I’m not sure why Nanite did not go that route. It seems like it would have avoided the duplicated vertices.

Alan Wake II?

Mesh shaders are basically a hard requirement for this game. It will technically boot without mesh shader support, but the game gives you a stern warning about performance, and they are not kidding. I haven’t dug into what the fallback is doing, but I’ve seen people posting videos demonstrating sub-10 FPS on a 1080 Ti. Given the abysmal performance, I wouldn’t be surprised if they just disabled all culling and draw everything in the fallback.

A compressed runtime meshlet format

While studying https://github.com/zeux/meshoptimizer I found support for compressed meshes, a format that was turned into a glTF EXT. It seems to be designed for decompressing on CPU (completely serial algorithm), which was not all that exciting for me, but this sparked an idea. What if I could decompress meshlets on the GPU instead? There are two ways this can be useful:

  • Would it be fast enough to decompress inline inside the mesh shader? This can potentially save a lot of read bandwidth during rendering and save precious VRAM.
  • Bandwidth amplifier on asset loading time. Only the compressed meshlet format needs to go over PCI-e wire, and we decompress directly into VRAM. Similar idea to GDeflate and other compression formats, except I should be able to come up with something that is way faster than a general purpose algorithm and also give decent compression ratios.

I haven’t seen any ready-to-go implementation of this yet, so I figured this would be my starting point for the renderer. Always nice to have an excuse to write some cursed compute shaders.

Adapting to implementations

One annoying problem with mesh shading is that different vendors have very different fast paths through their hardware. There is no single implementation that fits all. I’ve spent some time testing various parameters and observe what makes NV and AMD go fast w.r.t. mesh shaders, with questionable results. I believe this is the number 1 reason mesh shaders are still considered a niche feature.

Since we’re baking meshlets offline, the format itself must be able to adapt to implementations that prefer 32/64/128/256 primitive meshlets. It must also adapt nicely to MultiDrawIndirect-style rendering.

Random-access

It should be efficient to decode meshlets in parallel, and in complete isolation.

The format

I went through some (read: way too many) design iterations before landing on this design.

256 vert/prim meshlets

Going wide means we get lower culling overhead and emitting larger MDI calls avoids us getting completely bottlenecked on command stream frontend churn. I tried going lower than 256, but performance suffered greatly. 256 seemed like a good compromise. With 256 prim/verts, we can use 8-bit index buffers as well, which saves quite a lot of memory.

Sublets – 8×32 grouping

To consider various hardware implementations, very few will be happy with full, fat 256 primitive meshlets. To remedy this, the encoding is grouped in units of 32 – a “sublet” – where we can shade the 8 groups independently, or have larger workgroups that shade multiple sublets together. Some consideration is key to be performance portable. At runtime we can specialize our shaders to fit whatever hardware we’re targeting.

Using grouping of 32 is core to the format as well, since we can exploit NV warps being 32-wide and force Wave32 on RDNA hardware to get subgroup accelerated mesh shading.

Format header

// Can point to mmap-ed file.
struct MeshView
{
    const FormatHeader *format_header;
    const Bound *bounds;
    const Bound *bounds_256; // Used to cull in units of 256 prims
    const Stream *streams;
    const uint32_t *payload;
    uint32_t total_primitives;
    uint32_t total_vertices;
    uint32_t num_bounds;
    uint32_t num_bounds_256;
};
struct FormatHeader
{
    MeshStyle style;
    uint32_t stream_count;
    uint32_t meshlet_count;
    uint32_t payload_size_words;
};

The style signals type of mesh. This is naturally engine specific.

  • Wireframe: A pure position + index buffer
  • Textured: Adds UV + Normal + Tangent
  • Skinned: Adds bone indices and weights on top

A stream is 32 values encoded in some way.

enum class StreamType
{
    Primitive = 0,
    Position,
    NormalTangentOct8,
    UV,
    BoneIndices,
    BoneWeights,
};

Each meshlet has stream_count number of Stream headers. The indexing is trivial:

streams[RuntimeHeader::stream_offset + int(StreamType)]
// 16 bytes
struct Stream
{
    union
    {
       uint32_t base_value[2];
       struct { uint32_t prim_count; uint32_t vert_count; } counts;
    } u;
    uint32_t bits;
    uint32_t offset_in_words;
};

This is where things get a bit more interesting. I ended up supporting some encoding styles that are tailored for the various attribute formats.

Encoding attributes

There’s two parts to this problem. First is to decide on some N-bit fixed point values, and then find the most efficient way to pull those bits from a buffer. I went through several iterations on the actual bit-stuffing.

Base value + DELTA encoding

A base value is encoded in Stream::base_value, and the decoded bits are an offset from the base. To start approaching speed-of-light decoding, this is about as fancy as we can do it.

I went through various iterations of this model. The first idea had a predictive encoding between neighbor values, where subgroup scan operations were used to complete the decode, but it was too slow in practice, and didn’t really improve bit rates at all.

Index buffer

Since the sublet is just 32-wide, we can encode with 5-bit indices. 15 bit / primitive. There is no real reason to use delta encode here, so instead of storing base values in the stream header, I opted to use those bits to encode vertex/index counts.

Position

This is decoded to 3×16-bit SINT. The shared exponent is stored in top 16 bits of Stream::bits.

vec3 position = ldexp(vec3(i16vec3(decoded)), exponent);

This facilitates arbitrary quantization as well.

UV

Similar idea as position, but 2×16-bit SINT. After decoding similar to position, a simple fixup is made to cater to typical UVs which lie in range of [0, +1], not [-1, +1].

vec2 uv = 0.5 * ldexp(vec2(i16vec2(decoded)), exponent) + 0.5;
Normal/Tangent

Encoded as 4×8-bit SNORM. Normal (XY) and Tangent (ZW) are encoded with Octahedral encoding from meshoptimizer library.

To encode the sign of tangent, Stream::bits stores 2 bits, which signals one of three modes:

  • Uniform W = -1
  • Uniform W = +1
  • LSB of decoded W encodes tangent W. Tangent’s second component loses 1 bit of precision.
Bone index / weight

Basically same as Normal/Tangent, but ignore tangent sign handling.

First (failed?) idea – bitplane encoding

For a long time, I was pursuing bitplane encoding, which is one of the simplest ways to encode variable bitrates. We can encode 1 bit for 32 values by packing them in one u32. To speed up decoding further, I aimed to pack everything into 128-bit aligned loads. This avoids having to wait for tiny, dependent 32-bit loads.

For example, for index buffers:

uint meshlet_decode_index_buffer(
   uint stream_index, uint chunk_index,
   int lane_index)
{
    uint offset_in_b128 =
      meshlet_streams.data[stream_index].offset_in_b128;

    // Fixed 5-bit encoding.
    offset_in_b128 += 4 * chunk_index;

    // Scalar load. 64 bytes in one go.
    uvec4 p0 = payload.data[offset_in_b128 + 0];
    uvec4 p1 = payload.data[offset_in_b128 + 1];
    uvec4 p2 = payload.data[offset_in_b128 + 2];
    uvec4 p3 = payload.data[offset_in_b128 + 3];

    uint indices = 0;

    indices |= bitfieldExtract(p0.x, lane_index, 1) << 0u;
    indices |= bitfieldExtract(p0.y, lane_index, 1) << 1u;
    indices |= bitfieldExtract(p0.z, lane_index, 1) << 2u;
    indices |= bitfieldExtract(p0.w, lane_index, 1) << 3u;

    indices |= bitfieldExtract(p1.x, lane_index, 1) << 8u;
    indices |= bitfieldExtract(p1.y, lane_index, 1) << 9u;
    indices |= bitfieldExtract(p1.z, lane_index, 1) << 10u;
    indices |= bitfieldExtract(p1.w, lane_index, 1) << 11u;

    indices |= bitfieldExtract(p2.x, lane_index, 1) << 16u;
    indices |= bitfieldExtract(p2.y, lane_index, 1) << 17u;
    indices |= bitfieldExtract(p2.z, lane_index, 1) << 18u;
    indices |= bitfieldExtract(p2.w, lane_index, 1) << 19u;

    indices |= bitfieldExtract(p3.x, lane_index, 1) << 4u;
    indices |= bitfieldExtract(p3.y, lane_index, 1) << 12u;
    indices |= bitfieldExtract(p3.z, lane_index, 1) << 20u;

    return indices;
}

On Deck, this ends up looking like

s_buffer_load_dwordx4 x 4
v_bfe_u32 x 15
v_lshl_or_b32 x 15

Thinking about ALU and loads in terms of scalar and vectors can greatly help AMD performance when done right, so this approach felt natural.

For variable bit rates, I’d have code like:

if (bits & 4) { unroll_4_bits_bit_plane(); }
if (bits & 2) { unroll_2_bits_bit_plane(); }
if (bits & 1) { unroll_1_bit_bit_plane(); }

However, I abandoned this idea, since while favoring SMEM so heavily, the VALU with all the bitfield ops wasn’t exactly amazing for perf. I’m still just clocking out one bit per operation here. AMD performance was quite alright compared to what I ended up with in the end, but NVIDIA performance was abysmal, so I went back to the drawing board, and ended up with the absolute simplest solution that would work.

Tightly packed bits

This idea is to just literally pack bits together, clearly a revolutionary idea that noone has ever done before. A VMEM load or two per thread, then some shifts should be all that is needed to move the components into place.

E.g. for index buffers:

uvec3 meshlet_decode_index_buffer(uint stream_index,
   uint chunk_index,
   int lane_index)
{
  uint offset_in_words = 
    meshlet_streams.data[stream_index].offset_in_words;
  return meshlet_decode3(offset_in_words, lane_index, 5);
}

For the actual decode I figured it would be pretty fast if all the shifts could be done in 64-bit. At least AMD has native instructions for that.

uvec3 meshlet_decode3(uint offset_in_words,
   uint index,
   uint bit_count)
{
    const uint num_components = 3;
    uint start_bit = index * bit_count * num_components;
    uint start_word = offset_in_words + start_bit / 32u;
    start_bit &= 31u;
    uint word0 = payload.data[start_word];
    uint word1 = payload.data[start_word + 1u];
    uvec3 v;

    uint64_t word = packUint2x32(uvec2(word0, word1));
    v.x = uint(word >> start_bit);
    start_bit += bit_count;
    v.y = uint(word >> start_bit);
    start_bit += bit_count;
    v.z = uint(word >> start_bit);
    return bitfieldExtract(v, 0, int(bit_count));
}

There is one detail here. For 13, 14 and 15 bit components with uvec3 decode, more than two u32 words may be needed, so in this case, encoder must choose 16 bit. (16-bit works due to alignment.) This only comes up in position encode, and encoder can easily just ensure 12 bit deltas is enough to encode, quantizing a bit more as necessary.

Mapping to MDI

Every 256-wide meshlet can turn into an indexed draw call with VK_INDEX_TYPE_UINT8_EXT, which is nice for saving VRAM. The “task shader” becomes a compute shader that dumps out a big multi-draw indirect buffer. The DrawIndex builtin in Vulkan ends up replacing WorkGroupID in mesh shader for pulling in per-meshlet data.

Performance sanity check

Before going further with mesh shading fun, it’s important to validate performance. I needed at least a ballpark idea of how many primitives could be pumped through the GPU with a good old vkCmdDrawIndexed and the MDI method where one draw call is one meshlet. This was then to be compared against a straight forward mesh shader.

Zeux’s Niagara renderer helpfully has a simple OBJ for us to play with.

When exported to the new meshlet format it looks like:

[INFO]: Stream 0: 54332 bytes. (Index) 15 bits / primitive
[INFO]: Stream 1: 75060 bytes. (Position) ~25 bits / pos
[INFO]: Stream 2: 70668 bytes. (Normal/Tangent) ~23.8 bits / N + T + sign
[INFO]: Total encoded vertices: 23738 // Vertex duplication :(
[INFO]: Average radius 0.037 (908 bounds) // 32-wide meshlet
[INFO]: Average cutoff 0.253 (908 bounds)
[INFO]: Average radius 0.114 (114 bounds) // 256-wide meshlet
[INFO]: Average cutoff 0.697 (114 bounds)
// Backface cone culling isn't amazing for larger meshlets.
[INFO]: Exported meshlet:
[INFO]: 908 meshlets
[INFO]: 200060 payload bytes
[INFO]: 86832 total indices
[INFO]: 14856 total attributes
[INFO]: 703872 uncompressed bytes

One annoying thing about meshlets is attribute duplication when one vertex is reused across meshlets, and using tiny 32-wide meshlets makes this way worse. Add padding on top for encode and the compression ratio isn’t that amazing anymore. The primitive to vertex ratio is ~1.95 here which is really solid, but turning things into meshlets tends to converge to ~1.0.

I tried different sublet sizes, but NVIDIA performance collapsed when I didn’t use 32-wide sublets, and going to 64 primitive / 32 vertex only marginally helped P/V ratios. AMD runtime performance did not like that in my testing (~30% throughput loss), so 32/32 it is!

After writing this section, AMD released a blog post suggesting that the 2N/N structure is actually good, but I couldn’t replicate that in my testing at least and I don’t have the energy anymore to rewrite everything (again) to test that.

Test scene

The classic “instance the same mesh a million times” strategy. This was tested on RTX 3070 (AMD numbers to follow, there are way more permutations to test there …). The mesh is instanced in a 13x13x13 grid. Here we’re throwing 63.59 million triangles at the GPU in one go.

Spam vkCmdDrawIndexed with no culling

5.5 ms

layout(location = 0) in vec3 POS;
layout(location = 1) in mediump vec3 NORMAL;
layout(location = 2) in mediump vec4 TANGENT;
layout(location = 3) in vec2 UV;

layout(location = 0) out mediump vec3 vNormal;
layout(location = 1) out mediump vec4 vTangent;
layout(location = 2) out vec2 vUV;

// The most basic vertex shader.
void main()
{
  vec3 world_pos = (M * vec4(POS, 1.0)).xyz;
  vNormal = mat3(M) * NORMAL;
  vTangent = vec4(mat3(M) * TANGENT.xyz, TANGENT.w);
  vUV = UV;
  gl_Position = VP * vec4(world_pos, 1.0);
}

With per-object frustum culling

This is the most basic thing to do, so for reference.

4.3 ms

One massive MDI

Here we’re just doing basic frustum culling of meshlets as well as back-face cone culling and emitting one draw call per meshlet that passes test.

3.9 ms

Significantly more geometry is rejected now due to back-face cull and tighter frustum cull, but performance isn’t that much better. Once we start considering occlusion culling, this should turn into a major win over normal draw calls. In this path, we have a bit more indirection in the vertex shader, so that probably accounts for some loss as well.

void main()
{
    // Need to index now, but shouldn't be a problem on desktop hardware.
    mat4 M = transforms.data[draw_info.data[gl_DrawIDARB].node_offset];

    vec3 world_pos = (M * vec4(POS, 1.0)).xyz;
    vNormal = mat3(M) * NORMAL;
    vTangent = vec4(mat3(M) * TANGENT.xyz, TANGENT.w);
    vUV = UV;

    // Need to pass down extra data to sample materials, etc.
    // Fragment shader cannot read gl_DrawIDARB.
    vMaterialID = draw_info.data[gl_DrawIDARB].material_index;

    gl_Position = VP * vec4(world_pos, 1.0);
}

Meshlet – Encoded payload

Here, the meshlet will read directly from the encoded payload, and decode inline in the shader. No per-primitive culling is performed.

4.1 ms

Meshlet – Decoded payload

4.0 ms

We’re at the point where we are bound on fixed function throughput. Encoded and Decoded paths are basically both hitting the limit of how much data we can pump to the rasterizer.

Per-primitive culling

To actually make good use of mesh shading, we need to consider per-primitive culling. For this section, I’ll be assuming a subgroup size of 32, and a meshlet size of 32. There are other code paths for larger workgroups, which require some use of groupshared memory, but that’s not very exciting for this discussion.

The gist of this idea was implemented in https://gpuopen.com/geometryfx/. Various AMD drivers adopted the idea as well to perform magic driver culling, but the code here isn’t based on any other code in particular.

Doing back-face culling correctly

This is tricky, but we only need to be conservative, not exact. We can only reject when we know for sure the primitive is not visible.

Perspective divide and clip codes

The first step is to do W divide per vertex and study how that vertex clips against the X, Y, and W planes. We don’t really care about Z. Near-plane clip is covered by negative W tests, and far plane should be covered by simple frustum test, assuming we have a far plane at all.

vec2 c = clip_pos.xy / clip_pos.w;

uint clip_code = clip_pos.w <= 0.0 ? CLIP_CODE_NEGATIVE_W : 0;
if (any(greaterThan(abs(c), vec2(4.0))))
    clip_code |= CLIP_CODE_INACCURATE;
if (c.x <= -1.0)
    clip_code |= CLIP_CODE_NEGATIVE_X;
if (c.y <= -1.0)
    clip_code |= CLIP_CODE_NEGATIVE_Y;
if (c.x >= 1.0)
    clip_code |= CLIP_CODE_POSITIVE_X;
if (c.y >= 1.0)
    clip_code |= CLIP_CODE_POSITIVE_Y;

vec2 window = roundEven(c * viewport.zw + viewport.xy);

There are things to unpack here. The INACCURATE clip code is used to denote a problem where we might start to run into accuracy issues when converting to fixed point, or GPUs might start doing clipping due to guard band exhaustion. I picked the value arbitrarily.

The window coordinate is then computed by simulating the fixed point window coordinate snapping done by real GPUs. Any GPU supporting DirectX will have a very precise way of doing this, so this should be okay in practice. Vulkan also exposes the number of sub-pixel bits in the viewport transform. On all GPUs I know of, this is 8. DirectX mandates exactly 8.

vec4 viewport =
    float(1 << 8 /* shader assumes 8 */) *
        vec4(cmd->get_viewport().x +
               0.5f * cmd->get_viewport().width - 0.5f,
             cmd->get_viewport().y +
               0.5f * cmd->get_viewport().height - 0.5f,
             0.5f * cmd->get_viewport().width,
             0.5f * cmd->get_viewport().height) -
             vec4(1.0f, 1.0f, 0.0f, 0.0f);

This particular way of doing it comes into play later when discussing micro-poly rejection. One thing to note here is that Vulkan clip-to-window coordinate transform does not flip Y-sign. D3D does however, so beware.

Shuffle clip codes and window coordinates

void meshlet_emit_primitive(uvec3 prim, vec4 clip_pos, vec4 viewport)
{
  // ...
  vec2 window = roundEven(c * viewport.zw + viewport.xy);

  // vertex ID maps to gl_SubgroupInvocationID
  // Fall back to groupshared as necessary
  vec2 window_a = subgroupShuffle(window, prim.x);
  vec2 window_b = subgroupShuffle(window, prim.y);
  vec2 window_c = subgroupShuffle(window, prim.z);
  uint code_a = subgroupShuffle(clip_code, prim.x);
  uint code_b = subgroupShuffle(clip_code, prim.y);
  uint code_c = subgroupShuffle(clip_code, prim.z);
}

Early reject or accept

Based on clip codes we can immediately accept or reject primitives.

uint or_code = code_a | code_b | code_c;
uint and_code = code_a & code_b & code_c;
bool culled_planes = (and_code & CLIP_CODE_PLANES) != 0;
bool is_active_prim = false;

if (!culled_planes)
{
    is_active_prim =
        (or_code & (CLIP_CODE_INACCURATE |
                    CLIP_CODE_NEGATIVE_W)) != 0;

    if (!is_active_prim)
        is_active_prim = cull_triangle(window_a,
                                       window_b,
                                       window_c);
}
  • If all three vertices are outside one of the clip planes, reject immediately
  • If any vertex is considered inaccurate, accept immediately
  • If one or two of the vertices have negative W, we have clipping. Our math won’t work, so accept immediately. (If all three vertices have negative W, the first test rejects).
  • Perform actual back-face cull.

Actual back-face cull

bool cull_triangle(vec2 a, vec2 b, vec2 c)
{
  precise vec2 ab = b - a;
  precise vec2 ac = c - a;

  // This is 100% accurate as long as the primitive
  // is no larger than ~4k subpixels, i.e. 16x16 pixels.
  // Normally, we'd be able to do GEQ test, but GE test is conservative,
  // even with FP error in play.

  // Depending on your engine and API conventions, swap these two.
  precise float pos_area = ab.y * ac.x;
  precise float neg_area = ab.x * ac.y;

  // If the pos value is (-2^24, +2^24),
  // the FP math is exact,
  // if not, we have to be conservative.
  // Less-than check is there to ensure that 1.0 delta
  // in neg_area *will* resolve to a different value.
  bool active_primitive;
  if (abs(pos_area) < 16777216.0)
    active_primitive = pos_area > neg_area;
  else
    active_primitive = pos_area >= neg_area;

  return active_primitive;
}

To compute winding, we need a 2D cross product. While noodling with this code, I noticed that we can still do it in FP32 instead of full 64-bit integer math. We’re working with integer-rounded values here, so based on the magnitudes involved we can pick the exact GEQ test. If we risk FP rounding error, we can use GE test. If the results don’t test equal, we know for sure area must be negative, otherwise, it’s possible it could have been positive, but the intermediate values rounded to same value in the end.

3.3 ms

Culling primitives helped as expected. Less pressure on the fixed function units.

Micro-poly rejection

Given how pathologically geometry dense this scene is, we expect that most primitives never trigger the rasterizer at all.

If we can prove that the bounding box of the primitive lands between two pixel grids, we can reject it since it will never have coverage.

if (active_primitive)
{
    // Micropoly test.
    const int SUBPIXEL_BITS = 8;
    vec2 lo = floor(ldexp(min(min(a, b), c), ivec2(-SUBPIXEL_BITS)));
    vec2 hi = floor(ldexp(max(max(a, b), c), ivec2(-SUBPIXEL_BITS)));
    active_primitive = all(notEqual(lo, hi));
}

There is a lot to unpack in this code. If we re-examine the viewport transform:

vec4 viewport = float(1 << 8 /* shader assumes 8 */) *
  vec4(cmd->get_viewport().x +
        0.5f * cmd->get_viewport().width - 0.5f,
      cmd->get_viewport().y +
        0.5f * cmd->get_viewport().height - 0.5f,
      0.5f * cmd->get_viewport().width,
      0.5f * cmd->get_viewport().height) -
      vec4(1.0f, 1.0f, 0.0f, 0.0f);

First, we need to shift by 0.5 pixels. The rasterization test happens at the center of a pixel, and it’s more convenient to sample at integer points. Then, due to top-left rasterization rules on all desktop GPUs (a DirectX requirement), we shift the result by one sub-pixel. This ensures that should a primitive have a bounding box of [1.0, 2.0], we will consider it for rasterization, but [1.0 + 1.0 / 256.0, 2.0] will not. Top-left rules are not technically guaranteed in Vulkan however (it just has to have some rule), so if you’re paranoid, increase the upper bound by one sub-pixel.

1.9 ms

Now we’re only submitting 1.2 M primitives to the rasterizer, which is pretty cool, given that we started with 31 M potential primitives. Of course, this is a contrived example with ridiculous micro-poly issues.

We’re actually at the point here where reporting the invocation stats (one atomic per workgroup) becomes a performance problem, so turning that off:

1.65 ms

With inline decoding there’s some extra overhead, but we’re still well ahead:

2.5 ms

Build active vertex / primitive masks

This is quite straight forward. Once we have the counts, SetMeshOutputCounts is called and we can compute the packed output indices with a mask and popcount.

uint vert_mask = 0u;
if (is_active_prim)
    vert_mask = (1u << prim.x) | (1u << prim.y) | (1u << prim.z);

uvec4 prim_ballot = subgroupBallot(is_active_prim);

shared_active_prim_offset = subgroupBallotExclusiveBitCount(prim_ballot);
shared_active_vert_mask = subgroupOr(vert_mask);

shared_active_prim_count_total = subgroupBallotBitCount(prim_ballot);
shared_active_vert_count_total = bitCount(shared_active_vert_mask);

Special magic NVIDIA optimization

Can we improve things from here? On NVIDIA, yes. NVIDIA seems to under-dimension the shader export buffers in their hardware compared to peak triangle throughput, and their developer documentation on the topic suggests:

  • Replace attributes with barycentrics and allowing the Pixel Shader to fetch and interpolate the attributes

Using VK_KHR_fragment_shader_barycentrics we can write code like:

// Mesh output
layout(location = 0) flat out uint vVertexID[];
layout(location = 1) perprimitiveEXT out uint vTransformIndex[];

// Fragment
layout(location = 0) pervertexEXT in uint vVertexID[];
layout(location = 1) perprimitiveEXT flat in uint vTransformIndex;

// Fetch vertex IDs
uint va = vVertexID[0];
uint vb = vVertexID[1];
uint vc = vVertexID[2];

// Load attributes from memory directly
uint na = attr.data[va].n;
uint nb = attr.data[vb].n;
uint nc = attr.data[vc].n;

// Interpolate by hand
mediump vec3 normal = gl_BaryCoordEXT.x * decode_rgb10a2(na) +
    gl_BaryCoordEXT.y * decode_rgb10a2(nb) +
    gl_BaryCoordEXT.z * decode_rgb10a2(nc);

// Have to transform normals and tangents as necessary.
// Need to pass down some way to load transforms.
normal = mat3(transforms.data[vTransformIndex]) * normal;
normal = normalize(normal);

1.0 ms

Quite the dramatic gain! Nsight Graphics suggests we’re finally SM bound at this point (> 80% utilization), where we used to be ISBE bound (primitive / varying allocation). An alternative that I assume would work just as well is to pass down a primitive ID to a G-buffer similar to Nanite.

There are a lot of caveats with this approach however, and I don’t think I will pursue it:

  • Moves a ton of extra work to fragment stage
    • I’m not aiming for Nanite-style micro-poly hell here, so doing work per-vertex seems better than per-fragment
    • This result isn’t representative of a real scene where fragment shader load would be far more significant
  • Incompatible with encoded meshlet scheme
    • It is possible to decode individual values, but it sure is a lot of dependent memory loads to extract a single value
  • Very awkward to write shader code like this at scale
    • Probably need some kind of meta compiler that can generate code, but that’s a rabbit hole I’m not going down
    • Need fallbacks, barycentrics is a very modern feature
  • Makes skinning even more annoying
    • Loading multiple matrices with fully dynamic index in fragment shader does not scream performance, then combine that with having to compute motion vectors on top …
  • Only seems to help throughput on NVIDIA
  • We’re already way ahead of MDI anyway

Either way, this result was useful to observe.

AMD

Steam Deck

Before running the numbers, we have to consider that the RADV driver already does some mesh shader optimizations for us automatically. The NGG geometry pipeline automatically converts vertex shading workloads into pseudo-meshlets, and RADV also does primitive culling in the driver-generated shader.

To get the raw baseline, we’ll first consider the tests without that path, so we can see how well RADV’s own culling is doing. The legacy vertex path is completely gone on RDNA3 as far as I know, so these tests have to be done on RDNA2.

No culling, plain vkCmdDrawIndexed, RADV_DEBUG=nongg

Even locked to 1600 MHz (peak), GPU is still just consuming 5.5 W. We’re 100% bound on fixed function logic here, the shader cores are sleeping.

44.3 ms

Basic frustum culling

As expected, performance scales as we cull. Still 5.5 W. 27.9 ms

NGG path, no primitive culling, RADV_DEBUG=nonggc

Not too much changed in performance here. We’re still bound on the same fixed function units pumping invisible primitives through. 28.4 ms

Actual RADV path

When we don’t cripple RADV, we get a lot of benefit from driver culling. GPU hits 12.1 W now. 9.6 ms

MDI

Slight win. 8.9 ms

Forcing Wave32 in mesh shaders

Using Vulkan 1.3’s subgroup size control feature, we can force RDNA2 to execute in Wave32 mode. This requires support in

 VkShaderStageFlags requiredSubgroupSizeStages;

The Deck drivers and upstream Mesa ship support for requiredSize task/mesh shaders now which is very handy. AMD’s Windows drivers or AMDVLK/amdgpu-pro do not, however 🙁 It’s possible Wave32 isn’t the best idea for AMD mesh shaders in the first place, it’s just that the format favors Wave32, so I enable it if I can.

Testing various parameters

While NVIDIA really likes 32/32 (anything else I tried fell off the perf cliff), AMD should in theory favor larger workgroups. However, it’s not that easy in practice, as I found.

Decoded meshlet – Wave32 – N/N prim/vert

  • 32/32: 9.3 ms
  • 64/64: 10.5 ms
  • 128/128: 11.2 ms
  • 256/256: 12.8 ms

These results are … surprising.

Encoded meshlet – Wave32 N/N prim/vert

  • 32/32: 10.7 ms
  • 64/64: 11.8 ms
  • 128/128: 12.7 ms
  • 256/256: 14.7 ms

Apparently Deck (or RDNA2 in general) likes small meshlets?

Wave64?

No meaningful difference in performance on Deck.

VertexID passthrough?

No meaningful difference either. This is a very NVIDIA-centric optimization I think.

A note on LocalInvocation output

In Vulkan, there are some properties that AMD sets for mesh shaders.

VkBool32 prefersLocalInvocationVertexOutput;
VkBool32 prefersLocalInvocationPrimitiveOutput;

This means that we should write outputs using LocalInvocationIndex, which corresponds to how RDNA hardware works. Each thread can export one primitive and one vertex and the thread index corresponds to primitive index / vertex index. Due to culling and compaction, we will have to roundtrip through groupshared memory somehow to satisfy this.

For the encoded representation, I found that it’s actually faster to ignore this suggestion, but for the decoded representation, we can just send the vertex IDs through groupshared, and do split vertex / attribute shading. E.g.:

if (meshlet_lane_has_active_vert())
{
    uint out_vert_index = meshlet_compacted_vertex_output();
    uint vert_id = meshlet.vertex_offset + linear_index;

    shared_clip_pos[out_vert_index] = clip_pos;
    shared_attr_index[out_vert_index] = vert_id;
}

barrier();

if (gl_LocalInvocationIndex < shared_active_vert_count_total)
{
    TexturedAttr a =
      attr.data[shared_attr_index[gl_LocalInvocationIndex]];
    mediump vec3 n = unpack_bgr10a2(a.n).xyz;
    mediump vec4 t = unpack_bgr10a2(a.t);
    gl_MeshVerticesEXT[gl_LocalInvocationIndex].gl_Position =
      shared_clip_pos[gl_LocalInvocationIndex];
    vUV[gl_LocalInvocationIndex] = a.uv;
    vNormal[gl_LocalInvocationIndex] = mat3(M) * n;
    vTangent[gl_LocalInvocationIndex] = vec4(mat3(M) * t.xyz, t.w);
}

Only computing visible attributes is a very common optimization in GPUs in general and RADV’s NGG implementation does it roughly like this.

Either way, we’re not actually beating the driver-based meshlet culling on Deck. It’s more or less already doing this work for us. Given how close the results are, it’s possible we’re still bound on something that’s not raw compute. On the positive side, the cost of using encoded representation is very small here, and saving RAM for meshes is always nice.

Already, the permutation hell is starting to become a problem. It’s getting quite obvious why mesh shaders haven’t taken off yet 🙂

RX 7600 numbers

Data dump section incoming …

NGG culling seems obsolete now?

By default RADV disables NGG culling on RDNA3, because apparently it has a much stronger fixed function culling in hardware now. I tried forcing it on with RADV_DEBUG=nggc, but found no uplift in performance for normal vertex shaders. Curious. Here’s with no culling, where the shader is completely export bound.

But, force NGG on, and it still doesn’t help much. Culling path takes as much time as the other, the instruction latencies are just spread around more.

RADV

  • vkCmdDrawIndexed, no frustum culling: 5.9 ms
  • With frustum cull: 3.7 ms
  • MDI: 5.0 ms
Wave32 – Meshlet
  • Encoded – 32/32: 3.3 ms
  • Encoded – 64/64 : 2.5 ms
  • Encoded – 128/128: 2.7 ms
  • Encoded – 256/256: 2.9 ms
  • Decoded – 32/32: 3.3 ms
  • Decoded – 64/64: 2.4 ms
  • Decoded – 128/128: 2.6 ms
  • Decoded – 256/256: 2.7 ms
Wave64 – Meshlet
  • Encoded – 64/64: 2.4 ms
  • Encoded – 128/128: 2.6 ms
  • Encoded – 256/256: 2.7 ms
  • Decoded – 64/64: 2.2 ms
  • Decoded – 128/128: 2.5 ms
  • Decoded – 256/256: 2.7 ms

Wave64 mode is doing quite well here. From what I understand, RADV hasn’t fully taken advantage of the dual-issue instructions in RDNA3 specifically yet, which is important for Wave32 performance, so that might be a plausible explanation.

There was also no meaningful difference in doing VertexID passthrough.

It’s not exactly easy to deduce anything meaningful out of these numbers, other than 32/32 being bad on RDNA3, while good on RDNA2 (Deck)?

AMD doesn’t seem to like the smaller 256 primitive draws on the larger desktop GPUs. I tried 512 and 1024 as a quick test and that improved throughput considerably, still, with finer grained culling in place, it should be a significant win.

amdgpu-pro / proprietary (Linux)

Since we cannot request specific subgroup size, the driver is free to pick Wave32 or Wave64 as it pleases, so I cannot test the difference. It won’t hit the subgroup optimized paths however.

  • vkCmdDrawIndexed, no culling : 6.2 ms
  • With frustum cull: 4.0 ms
  • MDI: 5.3 ms
  • Meshlet – Encoded – 32/32: 2.5 ms
  • Meshlet – Encoded – 64/64 : 2.6 ms
  • Meshlet – Encoded – 128/128: 2.7 ms
  • Meshlet – Encoded – 256/256: 2.6 ms
  • Meshlet – Decoded – 32/32: 2.1 ms
  • Meshlet – Decoded – 64/64: 2.1 ms
  • Meshlet – Decoded – 128/128: 2.1 ms
  • Meshlet – Decoded – 256/256: 2.1 ms

I also did some quick spot checks on AMDVLK, and the numbers are very similar.

The proprietary driver is doing quite well here in mesh shaders. On desktop, we can get significant wins on both RADV and proprietary with mesh shaders, which is nice to see.

It seems like the AMD Windows driver skipped NGG culling on RDNA3 as well. Performance is basically the same.

Task shader woes

The job of task shaders is to generate mesh shader work on the fly. In principle this is nicer than indirect rendering with mesh shaders for two reasons:

  • No need to allocate temporary memory to hold for indirect draw
  • No need to add extra compute passes with barriers

However, it turns out that this shader stage is even more vendor specific when it comes to tuning for performance. So far, no game I know of has actually shipped with task shaders (or the D3D12 equivalent amplification shader), and I think I now understand why.

The basic task unit I settled on was:

struct TaskInfo
{
    uint32_t aabb_instance;  // AABB, for top-level culling
    uint32_t node_instance;  // Affine transform
    uint32_t material_index; // To be eventually forwarded to fragment
    uint32_t mesh_index_count;
    // Encodes count [1, 32] in lower bits.
    // Mesh index is aligned to 32.
    uint32_t occluder_state_offset;
    // For two-phase occlusion state (for later)
};

An array of these is prepared on CPU. Each scene entity translates to one or more TaskInfos. Those are batched up into one big buffer, and off we go.

The logical task shader for me was to have N = 32 threads which tests AABB of N tasks in parallel. For the tasks that pass the test, test 32 meshlets in parallel. This makes it so the task workgroup can emit up to 1024 meshlets.

When I tried this on NVIDIA however …

18.8 ms

10x slowdown … The NVIDIA docs do mention that large outputs are bad, but I didn’t expect it to be this bad:

Avoid large outputs from the amplification shader, as this can incur a significant performance penalty. Generally, we encourage a flexible implementation that allows for fine-tuning. With that in mind, there are a number of generic factors that impact performance:

  • Size of the payloads. The AS payload should preferably stay below 108 bytes, but if that is not possible, then keep it at least under 236 bytes.

If we remove all support for hierarchical culling, the task shader runs alright again. 1 thread emits 0 or 1 meshlet. However, this means a lot of threads dedicated to culling, but it’s similar in performance to plain indirect mesh shading.

AMD however, is a completely different story. Task shaders are implemented by essentially emitting a bunch of tiny indirect mesh shader dispatches anyway, so the usefulness of task shaders on AMD is questionable from a performance point of view. While writing this blog, AMD released a new blog on the topic, how convenient!

When I tried NV-style task shader on AMD, performance suffered quite a lot.

However, the only thing that gets us to max perf on both AMD and NV is to forget about task shaders and go with vkCmdDrawMeshTasksIndirectCountEXT instead. While the optimal task shader path for each vendor gets close to indirect mesh shading, having a universal fast path is good for my sanity. The task shader loss was about 10% for me even in ideal situations on both vendors, which isn’t great. As rejection ratios increase, this loss grows even more. This kind of occupancy looks way better 🙂

The reason for using multi-indirect-count is to deal with the limitation that we can only submit about 64k workgroups in any dimension, similar to compute. This makes 1D atomic increments awkward, since we’ll easily blow past the 64k limit. One alternative is to take another tiny compute pass that prepares a multi-indirect draw, but that’s not really needed. Compute shader code like this works too:

// global_offset = atomicAdd() in thread 0

if (gl_LocalInvocationIndex == 0 && draw_count != 0)
{
  uint max_global_offset = global_offset + draw_count - 1;
  // Meshlet style.
  // Only guaranteed to get 0xffff meshlets,
  // so use 32k as cutoff for easy math.
  // Allocate the 2D draws in-place, avoiding an extra barrier.
  uint multi_draw_index = max_global_offset / 0x8000u;
  uint local_draw_index = max_global_offset & 0x7fffu;
  const int INC_OFFSET = NUM_CHUNK_WORKGROUPS == 1 ? 0 : 1;
  atomicMax(output_draws.count[1], multi_draw_index + 1);
  atomicMax(output_draws.count[
    2 + 3 * multi_draw_index + INC_OFFSET],
    local_draw_index + 1);

  if (local_draw_index <= draw_count)
  {
    // This is the thread that takes us over the threshold.
    output_draws.count[
      2 + 3 * multi_draw_index + 1 - INC_OFFSET] =
      NUM_CHUNK_WORKGROUPS;
    output_draws.count[2 + 3 * multi_draw_index + 2] = 1;
  }

  // Wrapped around, make sure last bucket sees 32k meshlets.
  if (multi_draw_index != 0 && local_draw_index < draw_count)
  {
    atomicMax(output_draws.count[
      2 + 3 * (multi_draw_index - 1) +
      INC_OFFSET], 0x8000u);
  }
}

This prepares a bunch of (8, 32k, 1) dispatches that are processed in one go. No chance to observe a bunch of dead dispatches back-to-back like task shaders can cause. In the mesh shader, we can use DrawIndex to offset the WorkGroupID by the appropriate amount (yay, Vulkan). A dispatchX count of 8 is to shade the full 256-wide meshlet through 8x 32-wide workgroups. As the workgroup size increases to handle more sublets per group, dispatchX count decreases similarly.

Occlusion culling

To complete the meshlet renderer, we need to consider occlusion culling. The go-to technique for this these days is two-phase occlusion culling with HiZ depth buffer. Some references:

Basic gist is to keep track of which meshlets are considered visible. This requires persistent storage of 1 bit per unit of visibility. Each pass in the renderer needs to keep track of its own bit-array. E.g. shadow passes have different visibility compared to main scene rendering.

For Granite, I went with an approach where 1 TaskInfo points to one uint32_t bitmask. Each of the 32 meshlets within the TaskInfo gets 1 bit. This makes the hierarchical culling nice too, since we can just test for visibility != 0 on the entire word. Nifty!

First phase

Here we render all objects which were considered visible last frame. It’s extremely likely that whatever was visible last frame is visible this frame, unless there was a full camera cut or similar. It’s important that we’re actually rendering to the framebuffer now. In theory, we’d be done rendering now if there were no changes to camera or objects in the scene.

HiZ pass

Based on the objects we drew in phase 1, build a HiZ depth map. This topic is actually kinda tricky. Building the mip-chain in one pass is great for performance, but causes some problems. With NPOT textures and single pass, there is no obvious way to create a functional HiZ, and the go-to shader for this, FidelityFX SPD, doesn’t support that use case.

The problem is that the size of mip-maps round down, so if we have a 7×7 texture, LOD 1 is 3×3 and LOD 2 is 1×1. In LOD2, we will be able to query a 4×4 depth region, but the edge pixels are forgotten.

The “obvious” workaround is to pad the texture to POT, but that is a horrible waste of VRAM. The solution I went with instead was to fold in the neighbors as the mips are reduced. This makes it so that the edge pixels in each LOD also remembers depth information for pixels which were truncated away due to NPOT rounding.

I rolled a custom HiZ shader similar to SPD with some extra subgroup shenanigans because why not (SubgroupShuffleXor with 4 and 8).

Second phase

In this pass we submit for rendering any object which became visible this frame, i.e. the visibility bit was not set, but it passed occlusion test now. Again, if camera did not change, and objects did not move, then nothing should be rendered here.

However, we still have to test every object, in order to update the visibility buffer for next frame. We don’t want visibility to remain sticky, unless we have dedicated proxy geometry to serve as occluders (might still be a thing if game needs to handle camera cuts without large jumps in rendering time).

In this pass we can cull meshlet bounds against the HiZ.

Because I cannot be arsed to make a fancy SVG for this, the math to compute a tight AABB bound for a sphere is straight forward once the geometry is understood.

The gist is to figure out the angle, then rotate the (X, W) vector with positive and negative angles. X / W becomes the projected lower or upper bound. Y bounds are computed separately.

vec2 project_sphere_flat(float view_xy, float view_z, float radius)
{
    float len = length(vec2(view_xy, view_z));
    float sin_xy = radius / len;

    float cos_xy = sqrt(1.0 - sin_xy * sin_xy);
    vec2 rot_lo = mat2(cos_xy, sin_xy, -sin_xy, cos_xy) *
      vec2(view_xy, view_z);
    vec2 rot_hi = mat2(cos_xy, -sin_xy, +sin_xy, cos_xy) *
      vec2(view_xy, view_z);

    return vec2(rot_lo.x / rot_lo.y, rot_hi.x / rot_hi.y);
}

The math is done in view space where the sphere is still a sphere, which is then scaled to window coordinates afterwards. To make the math easier to work with, I use a modified view space in this code where +Y is down and +Z is in view direction.

bool hiz_cull(vec2 view_range_x, vec2 view_range_y, float closest_z)
// view_range_x: .x -> lower bound, .y -> upper bound
// view_range_y: same
// closest_z: linear depth. ViewZ - Radius for a sphere

First, convert to integer coordinates.

// Viewport scale first applies any projection scale in X/Y
// (without Y flip).
// The scale also does viewport size / 2 and then
// offsets into integer window coordinates.

vec2 range_x = view_range_x *
  frustum.viewport_scale_bias.x +
  frustum.viewport_scale_bias.z;
vec2 range_y = view_range_y *
  frustum.viewport_scale_bias.y +
  frustum.viewport_scale_bias.w;

ivec2 ix = ivec2(range_x);
ivec2 iy = ivec2(range_y);

ix.x = clamp(ix.x, 0, frustum.hiz_resolution.x - 1);
ix.y = clamp(ix.y, ix.x, frustum.hiz_resolution.x - 1);
iy.x = clamp(iy.x, 0, frustum.hiz_resolution.y - 1);
iy.y = clamp(iy.y, iy.x, frustum.hiz_resolution.y - 1);

Figure out a LOD where we only have to sample a 2×2 footprint. findMSB to the rescue.

int max_delta = max(ix.y - ix.x, iy.y - iy.x);
int lod = min(findMSB(max_delta - 1) + 1, frustum.hiz_max_lod);
ivec2 lod_max_coord = max(frustum.hiz_resolution >> lod, ivec2(1)) - 1;

// Clamp to size of the actual LOD.
ix = min(ix >> lod, lod_max_coord.xx);
iy = min(iy >> lod, lod_max_coord.yy);

And finally, sample:

ivec2 hiz_coord = ivec2(ix.x, iy.x);

float d = texelFetch(uHiZDepth, hiz_coord, lod).x;
bool nx = ix.y != ix.x;
bool ny = iy.y != iy.x;

if (nx)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(1, 0)).x);

if (ny)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(0, 1)).x);

if (nx && ny)
    d = max(d, texelFetchOffset(uHiZDepth,
      hiz_coord, lod,
      ivec2(1, 1)).x);

return closest_z < d;

Trying to get up-close, it’s quite effective.

Without culling:

With two-phase:

As the culling becomes more extreme, GPU go brrrrr. Mostly just bound on HiZ pass and culling passes now which can probably be tuned a lot more.

Conclusion

I’ve spent way too much time on this now, and I just need to stop endlessly tuning various parameters. This is the true curse of mesh shaders, there’s always something to tweak. Given the performance I’m getting, I can call this a success, even if there might be some wins left on the table by tweaking some more. Now I just need to take a long break from mesh shaders before I actually rewrite the renderer to use this new code … And maybe one day I can even think about how to deal with LODs, then I would truly have Nanite at home!

The “compression” format ended up being something that can barely be called a compression format. To chase decode performance of tens of billions of primitives per second through, I suppose that’s just how it is.