Why coverage based pixel art filtering is terrible

I recently made a blog talking about how to properly filter pixel art in 3D, where I chose a cosine kernel as our low-pass. Please read through this to get a better idea of what I’m discussing here.

Pseudo-bandlimited pixel art filtering in 3D – a mathematical derivation

In this blog, I would like to analyze some alternative filters which we could potentially use, and try to explore the quality of the other methods from a signal processing standpoint.

“Ideal” sinc

The theoretical low-pass filter. Just as a reference.

Cosine

The filter kernel I chose as my filter.

d/dx (smoothstep)

I mentioned smoothstep in the previous post. smoothstep was mentioned as the integral of the filter kernel in many implementations, so to get the underlying filter to analyze, we take the derivative of smoothstep. Smoothstep is defined as 3x^2 – 2x^3. Its derivative is 6x – 6x^2, and if we normalize it to [-1, 1] range instead of [0, 1], we get:

Triangle

The LINEAR filter. This is essentially the case we would get if we upscale pixel art a lot with NEAREST, then filter the result.

Rect

This is the worst low-pass filter ever as I mentioned in the last post. Interestingly, this is the filter we would use for coverage-based pixel art filtering if the texture and screen were aligned. (I don’t expect the result for rotated textures to be much different, but the rigorous analysis becomes a bit too hard for me). If our input signal is turned into a series of rects (pixel extent) ahead of time (which is what we do in our implementation), our coverage filter is basically the equivalent of convolving a rect against that signal. Basically, rect is the filter we would end up with if we cranked SSAA up to infinity.

One filter which implements this is (from what I can tell): https://github.com/libretro/slang-shaders/blob/master/retro/shaders/pixellate.slang

Windowed sinc (Lanczos2)

Lanczos2 is a fairly popular filter for image scaling. I’ve included it here to have some reference on how these filters behave in frequency space.

Filter response

Now, it’s time to drone on about the results. All of this on frequency response, stop bands, blah blah, is probably going to be too pedantic for purposes of graphics, but we can, so why not.

So, first we see the response around 0.5 in this graph. This is the Nyquist frequency. Our ideal sinc (or well – to be pedantic – a 16 lobe Hamming windowed sinc) falls immediately after 0.5. This is a completely impractical filter obviously.

Another consideration is how fast we reach the “stop band”, i.e., when the filter response has rolled off completely. Cosine and smoothstep are better here, while Lanczos2 is a bit slower to roll off. This is because of the window function. Applying a window function is the same as “blurring” frequency space, so the perfect sinc is smearing out too far. Linear and Rect don’t hit their first low-point until twice the Nyquist :(.

As for stop-band attenuation, windowed sinc is the clear winner, but again, this filter is impractical for our purposes (analytic windowed sinc integral with negative lobes breaking bilinear optimization? just no). Cosine seems to have a slight edge over smoothstep, with about 2 dB improvement. 2 dB is quite significant (For reference, halving root-mean-square error is ~6 dB). Triangle kernel (LINEAR filter) is about 3 dB worse again, and rect sniffs glue alone in the corner. Triangle and rect look very similar because a triangle kernel is just two rects convolved with each other, and thus its frequency response is squared (the more you know).

Conclusion

Cosine is a solid filter for what it’s trying to do, given our constraints on needing to analytically integrate whatever filter kernel we choose. The quality should be very similar to smoothstep, but I think I’ll consider cosine the winner here with 2 dB better stop-band attenuation. It also has slightly better response in the pass-band, which will preserve sharpness slightly better. This should make sense, because the cosine kernel has a higher peak and rolls of faster to zero than the smoothstep kernel. This frequency response can be expanded or contracted based on how we multiply the “d” parameter. Multiply it by 2 for example, and we have the equivalent of LOD bias +1.

Basically, coverage/area based filtering is pretty terrible from a signal processing standpoint. Rect is a terrible filter, and you should avoid it if you can.

Reference script

I captured this with Octave using this script for reference:

cosine_kernel = pi/4 * cos(pi/2 * (-1024:1024) / 1024);
windowed_sinc = sinc((-2048:2048) / 1024) .* sinc((-2048:2048) / 2048);
perfect_windowed_sinc = sinc((-64*1024:64*1024) / 1024) .* hamming(128 * 1024 + 1)';
rect_kernel = ones(1, 1024);
linear_kernel = conv(rect_kernel, rect_kernel) / 1024;

ramp = (0:2048) / 2048;
smoothstep_kernel = 3 * ramp - 3 * (ramp .* ramp);

cosine_fft = fft(cosine_kernel, 1024 * 1024)(1 : 16 * 1024);
windowed_fft = fft(windowed_sinc, 1024 * 1024)(1 : 16 * 1024);
perfect_windowed_fft = fft(perfect_windowed_sinc, 1024 * 1024)(1 : 16 * 1024);
rect_fft = fft(rect_kernel, 1024 * 1024)(1 : 16 * 1024);
linear_fft = fft(linear_kernel, 1024 * 1024)(1 : 16 * 1024);
smoothstep_fft = fft(smoothstep_kernel, 1024 * 1024)(1 : 16 * 1024);

offset = 20.0 * log10(1024);

cosine_fft = 20.0 * log10(abs(cosine_fft)) - offset;
windowed_fft = 20.0 * log10(abs(windowed_fft)) - offset;
perfect_windowed_fft = 20.0 * log10(abs(perfect_windowed_fft)) - offset;
rect_fft = 20.0 * log10(abs(rect_fft)) - offset;
linear_fft = 20.0 * log10(abs(linear_fft)) - offset;
smoothstep_fft = 20.0 * log10(abs(smoothstep_fft)) - offset;

x = linspace(0, 1024 * 16 * 1024 / (1024 * 1024), 16 * 1024);

figure
plot(x, cosine_fft, 'r', x, windowed_fft, 'g', x, rect_fft, 'b', x, linear_fft, 'b--', x, smoothstep_fft, 'k', x, perfect_windowed_fft, 'c-')
legend('cosine', 'lanczos2', 'rect', 'triangle', 'smoothstep', 'ideal sinc')
xlabel('frequency / sampling rate')
ylabel('Filter response (dB)')
xticks(linspace(0, 1024 * 16 * 1024 / (1024 * 1024), 65))

Pseudo-bandlimited pixel art filtering in 3D – a mathematical derivation

Recently, I’ve been playing Octopath Traveler, and I’m very disappointed with the poor texture filtering seen in this game. It is PS1-level, with severe shimmering artifacts, which ruins the nice pixel art. Tastefully merging retro pixel art with 3D environment is a very cool aesthetic to me, so I want it to look right. Taking a signal processing approach to the problem, I wanted to see if I could solve the issue in a mathematically sound way.

Correctly filtering pixel art is a challenge, especially in a 3D environment, because none of the GPU hardware assisted filtering methods work well. We have two GPU filters available to us:

  • NEAREST/POINT: Razor sharp filtering, but exhibits severe aliasing artifacts.
  • LINEAR: Smooth, but way too blurry.

Our goal is to preserve the pixellated nature of the textures, yet have an alias-free result. A classic solution for this problem is to pre-scale the texture by some integer multiple with NEAREST, then sample this texture with LINEAR filtering. While this works reasonably well, it costs a lot of extra VRAM and bandwidth to sample huge textures which mostly contain duplicate pixels. Duplicating pixels also puts a maximum level on how sharp the pixels can become. On top of that, straight LINEAR still has some level of aliasing, as LINEAR is not a very good low-pass filter with its triangular kernel.

A flawed assumption of texture sampling using LINEAR is also that each sample point in the texture is basically a dirac delta function, i.e. the sample value only exists right at the texel center, and thus has an infinitesimal area. The pixel art assumption is that each texel has proper area, the sample point exists across the entire texel. For bandlimited signals, i.e. “natural” images, the dirac delta assumption is how you sample, but pixel art is not a sampled bandlimited signal.

Filtering textures in 3D means we need to work well in many different scenarios:

  • Magnification
  • Minification
  • Rotation
  • Scale
  • Anisotropy / uneven scaling (look at the texture at an angle)

For minification, we are going to rely on the existing texture filtering hardware to do mip-mapping and anisotropic filtering for us. For minification, we are going to assume it is a good enough approximation to the true solution. What we want to focus on is magnification, as this is where the blurring issues with LINEAR become obvious.

Aliasing from hell

It’s even worse in motion.

Zoomed in 4x with pixel duplication.

Blurry mess

Straight LINEAR, in linear space.

Zoomed in 4x with pixel duplication.

Smooth and sharp

Here’s my method.

Zoomed in 4x with pixel duplication.

Bandlimited sampling

To understand the derivation, we must understand what bandlimited sampling means. All signals (audio or images) have a frequency response. Nyquist’s theorem tells us that if we sample at some frequency F, there must be no frequency component over F / 2 in the signal, or it aliases. (F / 2 is also called the Nyquist frequency.) Our input signal, i.e. a series of dirac delta functions, has an infinite frequency response. Therefore, we must convolve a low-pass filter on that signal to reject as much energy above the Nyquist frequency as possible before we sample it again. In LINEAR sampling, a triangular kernel is applied, which acts as a low-pass filter, but it is not very good at removing high frequencies. NEAREST is effectively just applying a rect filter, which is the worst low-pass filter you can have.

Triangle kernel:

Rect kernel:

For completeness, there exists a perfect low-pass filter, sinc. However, it is purely theoretical as its width is infinitely large, a common approach is to limit the extent of the sinc using a cleverly chosen window function, but now we lose the perfect low-pass. We can get as close as we wish to the perfect low pass by using a larger windowing function, but it’s rarely practical to go this far except for image/video resizing, where we are able to use separable filtering in multiple passes with precomputed filter coefficients. We will not have that luxury for texture filtering in a 3D setting.

We also get negative filter coefficients, which usually manifests itself as “ringing” or “haloing” when filtering graphics. This is something we would like to avoid when filtering pixel art as well. We also need to avoid negative values in the kernel because we will need it for a crucial optimization later.

We will just need to come to terms that we can never get perfect bandlimiting, and we will need to be practical about choosing our filter, hence pseudo-bandlimited.

Picking a practical filter kernel

To sample pixel art, we actually need to apply two filters at once, which complicates things. First, we need to apply a rect kernel (NEAREST filter) to give our texel some proper area. We then apply a low-pass filter which will aim to band-limit the rect. Applying two filters after each other is the same as convolving them together. Convolution is an integral, so now we have some constraints on our filter kernel, because it needs to be cheap to analytically integrate. LUTs will be too costly and annoying to use.

A key point is that the low-pass filter kernel needs to adapt to our sampling rate of the texture. Basically, we need to be band-limited in screen space, not texture space. If we have a filter kernel

we should be able to change it to

where d is the screenspace partial derivative in either X or Y.

// Get derivatives in texel space.
// Need a non-zero derivative since we need to divide it later.
vec2 d = max(fwidth(uv) * texture_size, 1.0 / 256.0);

For our filter kernel, we could pick between:

  • Triangle (piece-wise integration, annoying)
  • Rect (lol)
  • Polynomial (Easy to integrate)
  • Cosine (Easy to integrate)
  • Gaussian (No analytic integration possible)
  • Windowed sinc (negative lobes and super difficult integral, no thanks)

I chose a cosine kernel. If we think about Taylor expansions, a polynomial kernel and cosine is basically in the same ballpark. Cosine is not a perfect low-pass by any means, but it’s pretty good for our purpose here.

The cosine kernel will be

The normalization factor is to make sure the area of the kernel is 1. The rect kernel is

To get our filter with a given d, we will convolve.

The integration boundaries need to be limited to the range of W and R. If we solve this, we get

The brackets in the integration range denote a signed saturate, i.e. clamp(x, -1, 1).

Here’s how the kernel look for different d values:

d = 1:

d = 0.5:

d = 0.25:

d = 0.1:

Nice, so as we can see, the filter kernel starts off fairly smooth, but sharpens into a smoothly rolling off rect as we sample with a higher and higher resolution.

The 2×2 filter (d <= 0.5)

From the filter kernel above, we can see that if d <= 0.5 (LOD = -1), the extent of the filter kernel is 1 pixel. This means we can implement the kernel by a simple 2×2 kernel, or as we shall see, a single bilinear filter tap. For the implementation, we are going to assume we are sampling between two texels, where x is the phase, in range 0 to 1.

We can implement this as a simple lerp. Instead of evaluating two sines, we’re just going for one which implements the transition from 0 to 1.

This is very similar to a smoothstep, which explains why smoothstep techniques for this kind of filtering works so well. sin might be rather expensive on your GPU targets, so instead we can use a simple Taylor expansion to get a very good approximation to the sine

In fact, if we use smoothstep, we would get a filter kernel which is the derivative of smoothstep. Now, if we compute the result for both and X and Y dimensions, we will have lerp factors for both dimensions. Since our filter weights are all positive, and our kernel is separable we can make use of bilinear filtering to get the correct result.

// Get base pixel and phase, range [0, 1).
vec2 pixel = uv * texture_size - 0.5;
vec2 base_pixel = floor(pixel);
vec2 phase = pixel - base_pixel;

// We can resolve the filter by just sampling a single 2x2 block.

mediump vec2 shift = 0.5 + 0.5 * taylor_sin(PI_half * clamp((phase - 0.5) / min(d, vec2(0.5)), -1.0, 1.0));
vec2 filter_uv = (base_pixel + 0.5 + shift) * inverse_texture_size;

As d increases, we should no longer use our filter since we can only support up to d = 0.5, so we implement something ala trilinear filtering where we lerp between our ideal LOD = -1, and LOD = 0, where we fully sample with a normal trilinear/anisotropy filter. This implementation will only require two bilinear texture lookups, one for the d <= 0.5 sampling, and one for the d > 0.5 sampling. These two results need to be lerped.

The 4×4 filter (d <= 1.5)

Now, I’m heading into more interesting territory. While the good old smoothstep with fwidth is a well known hack, it cannot deal with larger kernels than 2×2. Using our success with replacing 4 filter taps with a single bilinear we’re going to continue implementing a 4×4 kernel with 4 bilinear taps. If we support a 4×4 filter kernel we can have our nice filter even for some slight minification as well. It’s going to require a lot of ALU, so we can split the implementation into the case where d <= 0.5, use the single tap, d <= 1.5, use this 4 tap method, otherwise, just sample normally. If we remove this case, we effectively have a “speed hack” mode for slower devices.

The filter coefficients for each element in the 4×4 grid will be

 

u and v are the phase variables in range [0, 1] as mentioned earlier. Since the filter is separable, we can compute X and Y separately and perform an outer product to complete the kernel.

To evaluate four F values, we only need to compute 5 sines (or Taylor approximations), not 8, because F(u) shares a value with F(u + 1), and so on. For d = 1.5, the filter kernel for one dimension will look like

Since we compute X and Y separate, we end up with a cost of 10 Taylor approximations per pixel, ouch, but GPUs crunch this like butter.

Now, each 2×2 block of this kernel can be replaced with one bilinear lookup and a weight.

// Given weights, compute a bilinear filter which implements the weight.
// All weights are known to be non-negative, and separable.
mediump vec3 compute_uv_phase_weight(mediump vec2 weights_u, mediump vec2 weights_v)
{
	// The sum of a bilinear sample has combined weight of 1, we will need to adjust the resulting sample
	// to match our actual weight sum.
	mediump float w = dot(weights_u.xyxy, weights_v.xxyy);
	mediump float x = weights_u.y / max(weights_u.x + weights_u.y, 0.001);
	mediump float y = weights_v.y / max(weights_v.x + weights_v.y, 0.001);
	return vec3(x, y, w);
}

Is 4×4 worth it?

The difference between 4×4 and 2×2 is very subtle and hard to show with still images. The difference manifests itself around LOD -0.5 to 0.5, i.e. close to 1:1 sampling. 2×2 is sharper with more aliasing while the 4×4 kernel remains a little blurrier but alias-free. For most use cases, I expect the 2×2-only method to be good enough, i.e. the good old “smoothstep/sine & fwidth”, but now I’ve come to that conclusion through math and not random graphics hackery.

Performance

A quick and dirty check on RX 470 @ 1440p

  • Full screen with all pixels hitting 4×4 sampling case: 441.44 µs
  • Full screen with all pixels hitting 2×2 sampling case: 207.68 µs

Anisotropy

We naturally get support for anisotropy because we compute different filters for X and Y dimensions. However, once max(d.x, d.y) is too large, we must fall back to normal sampling. Either a very blurry trilinear or actual anisotropic filtering in hardware.

Here’s a shot where we gradually go from crisp texels to normal, very blurry trilinear.

The green region is where d <= 0.5, 1 bilinear fetch, blue is where d < 0.5, d <= 1.5, 4 taps, red region is regular trilinear fallback. The blue region will fade towards the normal trilinear fallback to avoid any abrupt artifacts.

Code

The complete shader implementation (reuseable GLSL header) can be found in: https://github.com/Themaister/Granite/blob/master/assets/shaders/inc/bandlimited_pixel_filter.h
A test project to play around with the filter: https://github.com/Themaister/Granite/blob/master/tests/bandlimited_pixel_test.cpp

Go forth and filter your pixels correctly.

 

Improving VK_KHR_display in Mesa – or, let’s make DRM better!

VK_KHR_display was recently added to Mesa, and I was very excited. Finally, we could get direct-to-display, lowest possible latency in Vulkan for programs like RetroArch (VK_KHR_display is not just for VR!). We have had support for KMS/GBM for a long time, and it works really well when you want to get the most direct access to the display at the cost of convenience. Since we have full control of how page flips happen we avoid many of the pitfalls with compositors. When they work well, you should get optimal conditions in full-screen, but it’s very hard to guarantee. We have no control if X or Wayland choose to go into a direct-to-display mode without compositors when we fullscreen the window surface. VK_KHR_display or the EGL equivalent is also the only realistic way to get good display performance on more embedded Linux systems which are fairly popular for emulation boxes. There’s no need for X or Wayland getting in the way when you have a dedicated, 10-foot UI setup.

We can also control how much total buffering we want. Sometimes we want 2 buffers where CPU emulation + GPU rendering needs to complete in 16.66 ms (great for retro console emulation), and sometimes we want 3 buffers where CPU, GPU and display can overlap (usually the case when running GPU-intensive emulation). Controlling this is almost impossible to do reliably with compositors. We synchronize vkAcquireNextImageKHR using a VkFence instead of a VkSemaphore. Most implementations do not actually support async AcquireNextImageKHR, and certainly not Mesa’s implementation of VK_KHR_display.

In EGL, we made use of GBM to allocate DRM buffers directly and pumped through our own page flips using the super low-level DRM API. In Vulkan however, we cannot go that low-level as we go through VK_KHR_display. There are tradeoffs. Nvidia for example supports VK_KHR_display on Linux, but not GBM. VK_KHR_display is a cleaner abstraction than raw DRM.

I had some issues with Mesa’s implementation of VK_KHR_display however, so I tried to fix them.

Lack of MAILBOX or IMMEDIATE present modes

This is the first glaring omission. In emulators, a key feature is being able to fast-forward. Usually, we also want to fast-forward completely seamlessly. Are you’re playing an old console RPG and want to make random encounters less dull? Just hit fast forward and blast through. Unfortunately, the Mesa implementation of VK_KHR_display does not support the display modes which facilitate this use case. In GL, we trivially support his by hitting glXSwapInterval(0) or 1 and it should “just work”.

MAILBOX is preferred because it is tear-free, but IMMEDIATE is a good fallback too.

So, I tried patching^H^H^Hhacking this up in Mesa.

MAILBOX

Basically, the FIFO implementation revolves around using drmModePageFlip. This queues up a framebuffer to be display on the next VBlank, if the buffer DRM is rendering to has completed its rendering. Apparently, the kernel tracks images on DRM, and whatever semaphores you pass in to vkQueuePresent don’t seem to matter at all.

Now, one glaring problem with drmModePageFlip is that once you have queued up a flip, you cannot queue up another one until it has completed on the next vblank. The page flip itself can be polled through the DRM FD. The page flip event will say when the page flip happened, and which image was flipped in.

After the page flip, the VK_KHR_display has a thread which checks if there are any queued up frames which can be setup with drmModePageFlip, and that keeps the FIFO queue going. If there are multiple frames queued up, the first image queued by the application is selected for the next page flip. So, I implemented MAILBOX using a really basic idea: When queuing up for a new page flip, pick the latest image to be queued by the application. Other frames which were queued up were transitioned back into the IDLE state, because they would never end up being displayed anyways. This worked wonderfully for my use case. Hundreds of FPS without tearing achieved.

I also implemented a better AcquireNextImageKHR. If there are multiple IDLE state images, I picked the image which was presented earliest. This way, you get a round-robin-like model, whereas the old model just picked the first IDLE frame. This might work just fine for FIFO, but not for MAILBOX.

Another thing I did was to force at least 4 images for MAILBOX. Unlike the Wayland implementation, I didn’t force a 4-deep swapchain in minImageCount. It’s perfectly fine for a swapchain to return more images than what is being requested. This is probably an API wart. It is a bit awkward to not have different minImageCount queries per present mode …

IMMEDIATE

For this mode, you can have tearing, and I found a particular flag in the DRM API. You can pass down DRM_MODE_PAGE_FLIP_ASYNC flag to drmModePageFlip, and the page flip will just happen when it happens. Unfortunately, this only worked as expected on AMD. Apparently, Intel cannot support this flip mode.

Lack of seamless transition between present modes

One annoying aspect of Vulkan WSI is that you cannot just change the present interval on the fly. Once you have a swapchain you cannot just call glXSwapInterval or similar to get vsync or no vsync. What you need to do in this case is to create a new VkSwapchainKHR, setting oldSwapchain to hopefully reuse the images in the old swapchain, and then, delete the old swapchain. Assuming a decent implementation, you should be getting a seamless transition over to the new present mode.

Unfortunately, this did not work well. First of all, the VK_KHR_display implementation did not bother with oldSwapchain. When deleting the old swapchain after creating a new one, the screen would black out for a while, before starting up again, usually 3-4 seconds, kinda like doing a full mode change. I tracked this down to drmModeRmFB which deletes the old framebuffer references. Apparently, if you delete a framebuffer which is being displayed in DRM, the screen blacks out. This kind of makes sense, as there is nothing to display anymore. However, for toggling vsync-state this is just unacceptable.

The patching of this got a bit hairier.

I ended up with a scheme which can “steal” images from oldSwapchain assuming the formats and dimensions match up. I tried to pilfer the displayed image especially, because its framebuffer cannot be allowed to die or we get a black screen. (Note to self, there might be a ton of edge cases here with synchronization …) To make sure I don’t get stale page flip events, I block to make sure any pending flip has completed before continuing. The reason for this is to avoid a race condition where I end up freeing the oldSwapchain before the page flip handler. The page flip handler has a reference to the swapchain it came from. After being pilfered from, the old swapchain is considered dead, and I return VK_ERROR_OUT_OF_DATE_KHR on any following requests to acquire from it.

This combined with my crude MAILBOX implementation allowed me to get seamless transitions between tear-free fast-forward and butter smooth vsync gameplay with very low latency.

Doing MAILBOX properly in DRM

Unfortunately, the MAILBOX implementation I made is just a crude hack, and is not a valid implementation. Effectively, when you queue up a page flip in DRM, you’re expecting what is going to be the next image to display at the next vblank, long before that vblank actually occurs. This increases latency by quite a lot, but you also risk overshooting quite a lot, where the GPU cannot complete the frame in time, and you cannot flip anything on the next vblank. This is just dumb.

True MAILBOX needs to be handled in the VBlank handler inside DRM somehow, and DRM has no API available which can support this present mode. The way I expect this to work is:

  • drmModePageFlip can be called multiple times, up to N times. If you pass down a flag called say, DRM_MODE_PAGE_FLIP_REPLACE. This would allow multiple pending page flips to be in flight, and not return EBUSY if a pending flip is active.
  • In the vblank handler, the latest entry in the queue, whose rendering is also complete on the GPU, is selected. This frame is programmed into the display controller.
  • The earlier frames in the queue which were available to be presented, but were not selected for page flip, will be reported through a “discard” event, similar to the PAGE_FLIP event. This allows the WSI implementation to release the images to the application.
  • The current page flip callback will report the actual image which was selected for presentation.

Some further improvements to this scheme could be that discard events are returned as early as possible. If a swapchain image completes in the middle of the frame, it knows that it can “discard” earlier completed frames ahead of time. This way, we avoid stalling the GPU at DisplayRate * (SwapchainImages – 1) frame rates, which can happen if we render all available swapchain images, and we need to wait for next vblank to discard all frames.

This is way beyond me unfortunately, but it sounds like a very valuable thing to have. It seems like I will need to maintain my own hacky patch set on top of Mesa for now. 🙂