An unusual recompiler experiment – MIPS to LLVM IR – Part 1

While not graphics per-se, recompilers in emulation is an interesting topic. They are notoriously complex and difficult to implement, but promise incredible performance improvements. While emulating weaker hardware is serviced well with interpreters, more powerful hardware needs recompilers to shine. Starting with the PlayStation 1 era of hardware, recompilers are common, and beyond that, recompilers are required to have any hope of reaching real-time performance.

This is a multi-part post. I’m not sure how many posts it will take, but there’s a lot of stuff to write about. In this round, I’ll introduce the experiment, we’ll parse a MIPS ELF file, and make it ready for execution in our emulated Linux environment.

What is the goal of a recompiler

The main purpose of the recompiler is to look at the foreign machine code an application is running, and converting that to equivalent machine code on the hardware you are running on.

Conversely, an interpreter looks at instructions one at a time, and performs the action it requires, which wastes a lot of work in decoding instructions and branching dynamically to whatever code snippet you need to execute.

Isn’t this what Java and .NET runtimes do?

Basically, yes. Just replace “foreign machine code” with “byte code”.

The portability problem

Since a recompiler needs to target the raw machine code of the hardware you’re running on, this is a serious hazard for portability. Typical recompilers aiming to be portable need to write backends for all kinds of machines you want to run on, and deal with operating-system specific ABIs along the way. Today, the most relevant targets would be:

  • x86
  • x86-64/amd64
  • ARMv7 (older mobile devices)
  • AArch64 (newer mobile devices)

To target something exotic, like homebrew on consoles, you might have recompilers targeting PowerPC, which was very popular as well for two console generations.

This is too much, so I’ve been interested in the prospect of leveraging LLVM instead. If I can just target LLVM IR, LLVM could generate machine code in runtime (ORC JIT) for me, targeting any of these architectures. It is also a good excuse to learn some LLVM internals. We’ll be generating LLVM code with the C++ API, and pass that along to the JIT runtime to generate machine code on the fly.

Ahead-of-time recompilation and re-targeting?

If we target LLVM, we could in theory dump all LLVM IR code we have encountered to disk, optimize the code more aggressively (maybe some LTO), and build everything into a single dynamic library for any target we’d like. Once we have warmed up all the known code paths we should be able to avoid almost all run-time recompilation on subsequent runs. I wonder how practical it is, but it’s something I’d like to experiment with.

Patching speed critical sections with native C?

If we can dump LLVM IR to disk it doesn’t seem to farfetched to replace functions at known addresses with our own native versions written in C or something.

Recompilation efficiency

A big advantage of having a dedicated recompiler is how quickly the code can be generated as it barely needs to qualify as a compiler to get the job done. LLVM is a complex beast which needs to target all sorts of use cases, and using it as a just-in-time compiler like this is going to create some performance issues. It will be interesting to see how slow it is in practice.

Why MIPS?

MIPS is found in lots of gaming console hardware from the 90s and early 00s.

  • PlayStation 1
  • Nintendo 64 (CPU and RSP)
  • PlayStation 2
  • PlayStation Portable

MIPS is also a very simple ISA to understand and get started with. The core, original MIPS I ISA is probably the simplest, practical ISA to learn, and it’s often part of the curriculum when studying for an electronics degree. As part of my undergrad digital design course, we hacked on a trivial MIPS CPU core in VHDL, which was very fun and educational.

What should we emulate?

I felt like doing something I could get results from quickly, so rather than emulating a full game console, I wanted to try pretending to be a MIPS Linux kernel, running fully fleshed-out MIPS ELF binaries compiled with GCC 8.2 backed by glibc and libstdc++. That way I could build my way up from running trivial test cases written in assembly without any run-time all the way to emulating complicated programs using STL and the modern C/C++ run-times. We can also get a much better picture of performance differences between a natively compiled C++ application compared to a recompiled one.

The MIPS we’re going to target is a 32-bit MIPS I with whatever extra instructions we need when running real applications. GCC can target MIPS I with -march=mips1, but there are a few instructions from MIPS II and up GCC will use anyways to run any glibc application, due to a couple extra fundamental features:

  • RDHWR – Reads a hardware register, used to query thread local storage (TLS). To run any non-trivial C program, we need this because of errno, which is defined by POSIX to be thread local.
  • LL/SC – Load linked and store conditional serve as the backbone for atomic operations. glibc needs it in a few places. We’re not going to bother with threading, so we can trivially implement it as normal load/store.

As for endianness, MIPS can support both little and big-endian. Little is the easiest to start with since it matches our target hardware, but we’ll want to support big-endian as well, as it is the default, and the only MIPS endianness I know of in the wild.

User-space Linux application – ELF parsing and syscalls

Our program will need to read an ELF file, set up a virtual 32-bit address space and begin execution. We will need to implement various Linux syscalls required to host a real application. Common linux syscalls like:

  • open
  • read
  • write
  • exit
  • kill
  • llseek
  • mmap
  • munmap
  • brk

… and so on will be enough to get us over the printf stage. We do not have to concern ourselves with interrupts, CPU exceptions, memory-mapped I/O and other complicated things a game console emulator would have to deal with, fortunately.

Step #0 – Getting a MIPS cross compiler

The first step is getting some code to compile to MIPS. This took a surprisingly long time as the PKGBUILDs for Arch Linux did not work properly with some weird incompatibilities. To cut the long story short, I made some PKGBUILDs for Arch Linux which worked for me to create fully functional cross-compilers for both little and big-endian 32-bit MIPS. https://github.com/Themaister/MIPS-Toolchain-PKGBUILD/

To build a little-endian binary once you build the toolchain:

mipsel-linux-gnu-gcc -o test test.c -static -march=mips1 -O2
# CMake toolchain file for little endian.
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR mipsel)
set(CMAKE_C_COMPILER mipsel-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER mipsel-linux-gnu-g++)
set(CMAKE_C_FLAGS "-mgp32 -march=mips1")
set(CMAKE_CXX_FLAGS "-mgp32 -march=mips1")
set(CMAKE_ASM_FLAGS "-mgp32 -march=mips1")
set(CMAKE_C_LINK_FLAGS "-static")
set(CMAKE_CXX_LINK_FLAGS "-static")

The -static flag is important as we do not want to have to deal with dynamically loading the C runtime in our ELF loader. Fortunately, static libgcc/libstdc++ seems to work just fine for our purpose here.

Step #1 – Parsing an ELF file

Before starting I did not know much about ELF (Executable and Linkable Format). It is the executable format used on Linux and many other systems. It is surprisingly simple when we just want to run a statically linked executable like this. It is helpful to use the readelf tool (mipsel-linux-gnu-readelf) to study the ELF, this will help us to understand what’s going on.

# mipsel-linux-gnu-readelf -h mips.elf
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           MIPS R3000
  Version:                           0x1
  Entry point address:               0x400630
  Start of program headers:          52 (bytes into file)
  Start of section headers:          628984 (bytes into file)
  Flags:                             0x1007, noreorder, pic, cpic, o32, mips1
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         38
  Section header string table index: 37

This is the first few bytes of the binary. The structure is defined in the system header “elf.h” on Linux, which is very handy when we want to write a parser. There are a few things we care about here:

  • We can verify that we have a MIPS binary, and in which endianness.
  • What the entry point address is, i.e. where do we start executing.
  • How to find the program headers
  • How to find the section headers

The program headers

The program headers contain information about where data is located in the binary and what to do with it. We only care about LOAD and TLS blobs.

# mipsel-linux-gnu-readelf -l mips.elf                                            

Elf file type is EXEC (Executable file)
Entry point 0x400630
There are 7 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  ABIFLAGS       0x000138 0x00400138 0x00400138 0x00018 0x00018 R   0x8
  REGINFO        0x000150 0x00400150 0x00400150 0x00018 0x00018 R   0x4
  LOAD           0x000000 0x00400000 0x00400000 0x813d4 0x813d4 R E 0x10000
  LOAD           0x082000 0x00492000 0x00492000 0x041e0 0x04f9c RW  0x10000
  NOTE           0x000114 0x00400114 0x00400114 0x00020 0x00020 R   0x4
  NOTE           0x000168 0x00400168 0x00400168 0x00024 0x00024 R   0x4
  TLS            0x0820b0 0x004920b0 0x004920b0 0x00010 0x00030 R   0x4

...

We see two LOAD blocks. One has the read/execute flags set. This is our code segments, which we load from Offset in the ELF into virtual address “VirtAddr”. The read/write is the data segment, where global variables live. Note that FileSiz may be != MemSiz. This is to support zero-initialized global variables, they simply need memory allocated to them.

TLS is a bit special, it marks which global data needs to be thread local. Any data here needs to be copied out to a new allocation per-thread if we create a new one (we won’t, but we still need to implement it).

The section headers

The section headers don’t seem vital to execution, but they contain the symbol table, which is useful for debugging.

The virtual address space

Linux has a virtual address space, so we need to copy all relevant data out to our own virtualized address space. This is simply represented as a page table (4k pages) spanning the entire 32-bit address space. Through our syscall emulation, applications/glibc can use mmap() to either allocate memory dynamically or memory map files. We cannot assume a fixed address space layout if we want to emulate Linux binaries.

Having an address space like this means all memory access becomes indirect. This will certainly be a performance problem. There might be a better way if we abuse mmap(), but sounds very hard.

Setting up the stack

Before we can call the entry point we must set up a stack. Normally, we would think this stack only contains “argc” and “argv”, ala:

int main(int argc, char **argv)

but we actually need to pass a lot more information. All of this extra information is used by the C runtime entry point, which seems to be called __start in MIPS. We allocate the stack space at the top of the virtual address space, and push some data to the stack. The $sp register will point to:

  • argc
  • argv argument #0 (char *)
  • argv argument #1
  • NULL // Terminates argv
  • environment variable key #0 (char *)
  • environment variable value #0 (char *)
  • environment variable key #1
  • environment variable value #1
  • NULL // Terminates envp

The environment variables are passed in like this, and the C runtime will parse it. However, there is more data we need to pass on the stack on Linux, which was rather surprising. glibc will crash deep into its initialization if this is not done properly.

// ELF AUXV, see <elf.h>
stack_data.push_back(AT_PHDR);
stack_data.push_back(misc.phdr_addr);
stack_data.push_back(AT_PHENT);
stack_data.push_back(ehdr.e_phentsize);
stack_data.push_back(AT_PHNUM);
stack_data.push_back(ehdr.e_phnum);
stack_data.push_back(AT_PAGESZ);
stack_data.push_back(VirtualAddressSpace::PageSize);
stack_data.push_back(AT_ENTRY);
stack_data.push_back(ehdr.e_entry);
stack_data.push_back(AT_UID);
stack_data.push_back(getuid());
stack_data.push_back(AT_EUID);
stack_data.push_back(geteuid());
stack_data.push_back(AT_GID);
stack_data.push_back(getgid());
stack_data.push_back(AT_EGID);
stack_data.push_back(getegid());
stack_data.push_back(AT_RANDOM);
stack_data.push_back(stack_top); // Just point to something. glibc needs this.
stack_data.push_back(AT_NULL);

glibc needs to have a pointer to its own ELF headers and some other information like user IDs, page sizes, and some other things. The headers are used to set up TLS properly. It also needs a random number created by the Linux kernel, which it uses in early initialization to set up stack protection canaries.

Now, everything is set up, and we can start executing … I mean generating some LLVM IR … in part 2.

Experimenting with VK_GOOGLE_display_timing, taking control over the swap chain

Recently, I’ve been experimenting with the VK_GOOGLE_display_timing extension. Not many people seem to have tried it yet, and I am over average interested in pristine swap chain performance (The state of Window System Integration (WSI) in Vulkan for retro emulators, Improving VK_KHR_display in Mesa – or, let’s make DRM better!), so when I learned there was an experimental patch set for Mesa (X11/DRM) from Keith Packard, I had to try it out. My experience here will reflect whatever I got running by rebasing this patch set, so YMMV.

Croteam’s presentation from GDC is a very good read to understand how important this stuff is for normal gaming (Article, PDF)

The extension supports a few critical components we have been missing for years in graphics APIs:

  • When presenting with vkQueuePresentKHR, specify that an image cannot be presented on screen before some wall time (CLOCK_MONOTONIC on Linux/Android). This allows two critical features, SwapInterval > 1, e.g. locked 30 FPS, and proper audio/video sync for video players since we can schedule frames to be presented at very specific timestamps (subject to rounding to next VBlank). VDPAU and presentation APIs like that have long supported things like this, and it’s critical for non-interactive applications like media players.
  • Feedback about what actually happened a few frames after the fact. We get an accurate timestamp when our image was actually flipped on screen, as well as reports about how early it could have been presented, and even how much GPU processing margin we had.
  • Query in nanoseconds how long the display refresh interval is. This is critical since monitors are never exactly 60 FPS, but rather something odd like 59.9524532452 or 59.9723453245. Normally, graphics APIs will just report 60 and call it a day, but I’d like to tune my frame loop against the correct refresh rate. Small differences like these matter, because if you naively assume 60 without any feedback, you will face dropped or duped frames when the rounding error eventually hits you. I’ve grown allergic to dropped or duped frames, so this is not acceptable for me.

Using this feature set, we can do a lot of interesting things and discover some interesting behavior. As you might expect from the extension name, GOOGLE_display_timing ships on a few Android devices as well, so I’ve tested two implementations.

Use case #1: Monitoring display latency

So, for the purposes of this discussion, the latency we will measure is the time is takes for input to be sampled until we reach the frame buffer being scanned out to display. We obviously cannot monitor the latency of the display itself without special equipment (and there’s nothing we can do about that), so the latency would be:

  • Polling input <– Start timer
  • Running application logic
  • Building rendering commands
  • Execute frame on GPU
  • Wait for FIFO queue to flip image on-screen <– This is the lag which is very hard to quantify without this extension!

We’re going to call AcquireNextImageKHR, submit some work, and call QueuePresentKHR in a loop, using VkSemaphore to synchronize, and let’s see how much input latency we get. The way we do this with display_timing is fairly simple. In vkQueuePresentKHR we pass in:

typedef struct VkPresentTimeGOOGLE {
    uint32_t    presentID;
    uint64_t    desiredPresentTime;
} VkPresentTimeGOOGLE;

We pass in some unique ID we want to poll later, and desiredPresentTime. Since we only care about latency here, we pass in 0, which means just present ASAP (in FIFO fashion without tearing of course). Later, we can call:

vkGetPastPresentationTimingGOOGLE

to get a report on what actually happened, here’s the data you get:

typedef struct VkPastPresentationTimingGOOGLE {
    uint32_t    presentID;
    uint64_t    desiredPresentTime;
    uint64_t    actualPresentTime;
    uint64_t    earliestPresentTime;
    uint64_t    presentMargin;
} VkPastPresentationTimingGOOGLE;

This data will tell you in wall time, when the frame was actually flipped on screen (actualPresentTime), when it could have been flipped on-screen potentially (earliestPresentTime), and the presentMargin meaning, how much earlier did the GPU complete rendering compared to earliestPresentTime.

To estimate total latency we’ll compute actualPresentTime – CLOCK_MONOTONIC when polling input. actualPresentTime is defined to use CLOCK_MONOTONIC base on Linux and Android. This is powerful stuff, so let’s see what happens.

X11, full-screen, 2 image FIFO

The interesting thing we observe here is about 17 ms total latency. My monitor is 60 FPS, so that’s 1 frame of total latency, which means we have true flip mode. Great! The reason we get one frame total is that X11’s implementation in Mesa is a synchronous acquire. We cannot get a new image from AcquireNextImageKHR until something else has actually flipped on screen. 2-image FIFO on Xorg isn’t all that practical however. We’re forced to eat a full drain of the GPU when calling AcquireNextImageKHR in this case (unless we do some heroics to break the bubble), which may or may not be a problem, but for GPU-intensive workloads, this is bad, and probably not recommended.

X11, windowed, 2 image FIFO

In Windowed mode, we observe ~33ms, or 2 frames. It’s clear that Xorg adds a frame of latency to the mix here. Likely, we’re seeing a blit-style compositor frame being added in the middle. It’s great that we can actually measure this stuff, because what happens after a present is usually pretty opaque and hard to reason about without good tools.

X11, 3 image FIFO

As we expect, the observed latency is simply 1 frame longer for both windowed and full-screen. ~33 ms latency, or two frames.

X11, presentation timing latency

Another latency we need to consider is how long time it takes to get back a past presentation timing. On the Mesa implementation, it is very tight, matching the overall latency, since we have a synchronous AcquireNextImage. For 2-image FIFO full-screen for example, we know this information just one frame after we submitted a present, nice! For triple buffer it takes 2 frames to get the information back, etc.

X11, frame time jitter

Generally, the deltas between actualPresentTime is rock stable, showing +/- 1 microsecond or so. I think it’s probably using the DRM flip timestamps which come straight from the kernel. In my experience it’s about this accurate, and more than good enough.

Android 8.0, 3 image FIFO

Android is a little strange, we observe ~45 ms latency. About 2.7 frames. This suggests there is actually some kind of asynchronous acquire going on here. If I add a fence to vkAcquireNextImage to force synchronous acquire, I get ~33 ms latency, as we got with Xorg. Not sure why it’s not ~3 frames of latency … Maybe it depends on GPU rendering times somehow.

Android 8.0, 2 image FIFO

We now have ~28 ms, about 1.7 frames. If we try the fence mechanism to get sync acquire, we drop down to a stuttering mess, with ~33 ms latency. Apparently, getting one frame latency is impossible, so it consistently misses a frame. I didn’t really expect this to work, but nice to have tested it. At this latency, I get a lot of stuttering despite trivial GPU load. Triple buffering is probably a good idea on Android …

Android 8.0, presentation timing latency

This is rather disappointing, no matter what I do, it takes 5 frames for presentation timing results to trickle back to the application. 🙁 This will make it harder to adapt dynamically to frame drops later.

Android 8.0, frame time jitter

This one is also a bit worrying. The deltas between actualPresentTime hover around the right target, but show a jitter of about 0.3 ms +/-. This leads me to think the presentation timing is not tied to a kernel timestamp derived from an IRQ directly, but rather an arbitrary user-space timestamp.

Use case #2: Adaptive low-latency tuning

Sometimes, we have little to no GPU workload, but we want to achieve sub-frame latencies. One use case here is retro emulation which might have GPU workloads close to just a few blits, so we want to squeeze the latency as much as we can if possible.

To do this we want to monitor the time it takes to make the GPU frame buffers ready for presentation, then try to lock our frame so we start it at estimatedFuturePresentTime – appCPUandGPUProcessingTime – safetyMargin. The best we can currently do is through ugly sleeping with clock_nanosleep with TIMER_ABSTIME, but at least we have a very good estimate what that timestamp is now.

E.g., rendering some trivial stuff which only takes, say, 1 ms in total for CPU -> GPU pipeline, I add in some safety margin, say 4 ms, then I should be able to sleep until 5 ms before next frame needs to be scanned out. Seems to work just fine on Xorg in fullscreen, which is pretty cool. What makes this so nice is that we can dynamically observe how much latency we need to be able to reach the deadline in time. While we could have used GPU timestamps, it gets hairy because we would need to correlate GPU timestamps with CPU time, and the presentation engine might need some buffering before an image is actually ready to be presented, so using presentMargin is the correct way.

As you’d expect, this doesn’t work on Android, because SurfaceFlinger apparently forces some buffering. presentMargin seems kinda broken as well on Android, so it’s not something I can rely on.

Use case #3: Adaptive locked 60 or 30 FPS

Variable FPS game loops are surprisingly complicated as discussed in the Croteam GDC talk. If we have a v-synced display at 60 FPS, the game should have a frame time locked to N * refreshDuration.

Instead of sampling timing deltas on CPU which is incredible brittle for frame pacing, we can take a better approach, where we try to lock our frame to N * refreshDuration + driftCompenstation. Based on observations over time we can see if we should drop our fixed rendering rate or increase it. This allows for butter smooth rendering void of jitter, but we can still adapt to how fast the GPU can render.

The driftCompentation term is something I added where we can deal with a dropped frame here and there and combat natural drift over time, so we can slowly sync up with real wall-time. For example, if we let the frame time jitter up to a few %, we can catch up quickly to any lost frame, and overall animation speed should remain constant over time. This might not be suitable for all types of content (fixed frame-rate pixel-artsy 2D stuff), but techniques like these can work well I think.

For adaptive refresh rate, we could for example have heuristics like:

  • If we observe at least N frame drops the last M frames, increase swap interval since it’s better to have steady low FPS than a jumpy, jittery mess.
  • If we observe that earliestPresentTime is consistently lower than actualPresentTime and presentMargin is decent, it’s a sign we should bump up the frame rate.

The way to implement a custom swap interval with this extension is fairly simple as we just specify desiredPresentationTime to have a cadence of two or more refreshDurations instead of one refreshDuration.

Getting the refresh interval

We can observe the refresh interval by calling

vkGetRefreshCycleDurationGOOGLE

Normally, I would expect this to just work, but on Xorg, it seems like this is learned over time by looking at presentation timestamps, so we basically need to wait some frames before we can observe the true refresh cycle duration, Android doesn’t seem to have this problem.

An immediate question is of course how all this would work on a variable refresh rate display …

Important note on rounding issues

Just naively doing targetPresentTime = someOlderObservedPresentationTime + frameDelta * refreshDuration calculation will give you troubles. The spec says that the display controller cannot present before desiredPresentationTime, so due to rounding issues we might effectively say, “don’t present until frame 105.00000001, and the driver will say, “oh, I’ll wait a frame extra till frame 106, since 105 is before 105.0000001!”. This is a bit icky.

The solution seems to be just subtracting a few ms off the desiredPresentationTime just to be safe, but I actually don’t know what drivers actually expect here. Using absolute nanoseconds like this gets tricky very quickly.

Discussion

For a future multi-vendor extension, some things should probably be reconsidered.

Absolute vs relative time?

The current model of absolute timings in nanoseconds is simple in theory, but it does get a bit tricky since we need to compensate for drift, estimate future timings with rounding considerations, and issues like these. Being able to specify relative timings between presents would simplify the cases where we just want to fiddle with present interval and not really caring exactly when the image comes on screen, but at the same time, relative timings would complicate cases where we want to present exactly at some time (e.g. a video player).

Time vs frame counters?

On fixed refresh rate displays, it seems a bit weird to use time rather than frame counters (media stream counters on X11). It is much easier to reason about, and the current X11 implementation translates time back and forth to these counters anyways.

The argument against frame counters would be variable refresh rate displays I suppose.

Just support all the things?

The best extension would probably support all variants, i.e.:

  • Let you use relative or absolute timing
  • Let you specify time in frame counters or absolute time
  • Let you query variable refresh rate vs fixed refresh rate

Implementation issues

I had a few issues on Android 8.0, which I haven’t been able to figure out. The Mesa implementation I had working was basically flawless in my book, sans the X11 issue of not knowing correct refresh rate ahead of time.

presentMargin is really weird

I don’t think presentMargin is working as intended, and many times when you drop a frame, you can observe that presentMargin becomes “negative” in int64_t, but the member is uint64_t, so it’s a bizarre overflow instead. The numbers I get back don’t really make sense anyways, so I doubt it’s tied to a GPU timestamp or anything like that.

earliestPresentationTime is never lower than actualPresentTime

This makes it impossible to use adaptive locked frame rates, since we cannot ever safely bump the frame rate.

Randomly dropping frames with almost no GPU load

SurfaceFlinger seems to have a very strange tendency to just drop frames randomly even if I have very low target refresh rate, and GPU time is like 1 ms per frame. I wonder if this is related to rounding errors or not, but it is a bit disturbing. Touching the screen helps, which might be a clue related to power management somehow, but it can still drop frames randomly even if I touch the screen all the time.

Here’s an APK if anyone wants to test. It’s based on this test app: https://github.com/Themaister/Granite/blob/master/tests/display_timing.cpp

http://themaister.net/tmp/display-timing-test.apk

The screen flashes red when a frame drops, it should be a smooth scrolling quad moving across the screen. It logs output with:

adb logcat -c && adb logcat -s Granite

and I often get output like:

https://gist.github.com/Themaister/873021cba88acb48e04c668eae3ab4e9

Conclusion

I think this is a really cool extension, and I hope to see it more widely available. My implementation of this can be found in Granite for reference:

https://github.com/Themaister/Granite/blob/master/vulkan/wsi_timing.cpp

The state of Window System Integration (WSI) in Vulkan for retro emulators

In the world of graphics programming, interacting with the windowing system is not exactly the most riveting subject, but it is of critical importance to the usability of applications. Tearing, skipped frames, judder, etc, are all issues which stem from poor window system code. Over the years, it has been an ongoing frustration that good windowing system integration is something we just cannot rely on. For whatever reason, implementations and operating systems always find a way to screw things up.

For emulation, perfection is the only thing good enough. It is immediately obvious when tearing or skipped frames are observed in retro emulation. These games work on a fixed time step, and we must obey, or the result is borderline unplayable. For “normal” games, it seems like this isn’t as much of a concern. Games are written around the assumption of variable frame rates, users can disable V-Sync (especially for fast-paced, reaction based games), etc, and variable refresh rate display standards were introduced to get the best of both worlds. The problems I’m going to explore in this post can often be glossed over in the common case.

Requirements for RetroArch:

  • Perfect, tear-free 1:1 frame mapping if game frame rate and monitor frame rate are close enough. Essentially VK_PRESENT_MODE_FIFO_KHR when it’s working as intended.
  • Ability to toggle between locked and unlocked (fast forward) frame rates, seamlessly. This is a very emulation specific requirement, but extremely important. Unlocked could be either MAILBOX or IMMEDIATE.
  • Consistent frame pacing in FIFO mode. If there is too much variation in frame times, this translates into variable input latency for fixed FPS content, which is not ideal. It also hurts rate control for audio, although the frame pacing can get rather bad before this becomes a real issue. There’s no reason why we should have more than a millisecond in jitter for low-GPU load scenarios.
  • Control over latency. The GPU load of retro emulation is usually quite insignificant, but we need full control over the swap  chain, when swaps happen, and when we can begin a frame and poll inputs. Absurd amounts of effort has gone into aiming to reduce input latency by various developers, and all that effort consists of working around false assumption of GPU drivers which have been optimising for “normal game” FPS rather than latency and predictability. In Vulkan, with more explicit control over the swap chain, we should be able to control this far better than we ever could in legacy APIs. However, discussing latency (by like, measuring stuff with a high-speed camera) is outside of the scope of this post. There are more fundamental issues I would like to cover instead.
  • In (exclusive) full screen modes, we should have a flipping model, and if I ask for double buffer, I should get just that.
  • Borderless windowed full screen mode is also important for casual play when minimum latency isn’t the highest priority.
  • Windowed mode should be tear-free, without stutter, but is allowed to have a bit more latency than fullscreen because of compositors.

I have had the “great pleasure” of fighting with many different WSI implementations in the RetroArch Vulkan backend. I would like to summarise these, starting from “best” to “worst” for dramatic effect. I’ll also discuss some heinous workarounds I have had to employ to work around the worst implementations.

How it should work

Vulkan WSI is a fairly explicit model compared to its predecessors. You can request the number of images there should be in the swap chain, and you acquire and present these images directly. There is no magic “framebuffer 0”, or a magic SwapBuffers call which buffers an unknown amount for you.

There is no direct way to toggle between FIFO and MAILBOX/IMMEDIATE on a swap chain. Instead, we need to create a new swap chain. According to the specification, there can only be one non-retired swap chain active at a time. We have the option of passing in our “oldSwapchain” when creating a new swap chain. My understanding of this is that we can “pass over” ownership from one swap chain to the next. Passing in oldSwapchain will effectively retire the swap chain as well. If we just change the present mode for the new swap chain, this should give us a seamless transition over to the new present mode. After we have created the new swap chain, we can delete the old one and query the new (or the same!) swap chain images.

vkAcquireNextImageKHR will give us new images to render into, and it should be unblocked on V-Blank when new images are flipped onto the display. vkAcquireNextImageKHR is an asynchronous acquire operation, but for RetroArch I want to sync the frame begin to V-Blank, so I just wait on the VkFence to signal anyways to force a synchronous acquire. Now, it turns out all the implementation I’ve tried so far seem to implement a synchronous vkAcquireNextImageKHR anyways, so it doesn’t really matter.

vkQueuePresentKHR should queue a flip as we expect, but it shouldn’t block.

One of my gripes with WSI is that there is no way to tell if you’re going to get exclusive or borderless windowed modes. It seems to be driver magic that controls this unfortunately. I’m not even sure if this is an OS concept or not at this point.

The test matrix gets rather large, there are two OSes I test here:

  • Linux
  • Windows 10 (7 seems to behave very differently according to users, but I don’t have a setup for that)

Android should be on this list, but I haven’t tested that.

Surface types:

  • Wayland (on GNOME3)
  • X11/XCB (over Xorg and XWayland, on GNOME3)
  • Win32
  • KHR_display

Driver stacks:

  • Mesa – RADV / Anvil (AMD/Intel), the WSI code is shared and behave the same
  • AMDVLK (AMD Linux open source)
  • Nvidia (closed source)
  • AMD (closed source)

I’m using the latest public drivers as of writing. Now, time for some “good, bad and ugly” tiering for dramatic effect.

The good

These are implementations I consider fully functioning without any serious flaws, at least for RetroArch.

Mesa – Wayland – Linux

Wayland on Mesa is so far the shining implementation of WSI in my book.

  • Can toggle seamlessly between present modes without any hickup, missed frames or anything silly.
  • Frame pacing is excellent. About 2% frame time standard deviation (about 300 microseconds) in my measurements.
  • Supports MAILBOX for tear-free fast forward.

If I have to point out one flaw it’s the oddly reported minImageCount. It is 4 on Mesa Wayland, because of the MAILBOX implementation, which requires 4. However, even if I get 4 images in FIFO mode, it seems to only use 3 of them for some reason (no round-robin for you). I think this is a flawed implementation and minImageCount should be 2. It is perfectly valid for a WSI implementation to return more images than you request (Android seems to do this). It’s called “minImageCount” after all. But I think this highlights a Vulkan WSI flaw. minImageCount does not depend on presentMode!

I haven’t tested if it’s possible to get true double-buffer with page-flip on Mesa Wayland, but it should be fairly trivial to patch that.

Mesa – X11/XCB on Xorg – Linux

Xorg vsync has always been a serious pain point for me. It randomly works, or it doesn’t, depending on the phase of the moon. With DRI3 being used in Mesa’s implementation, it actually seems to work really well in both windowed and fullscreen. I would put it roughly on par with Wayland.

AMDVLK – Wayland – Linux (pending update to WSA and PAL)

I’ve submitted a lot of issues recently on their bugtracker to fix Wayland-related issues, and with the latest development branch I tried recently, AMDVLK now reached parity with Mesa’s Wayland implementation in quality. It’s all using the same low-level user space and kernel code thanks to the AMDGPU efforts, so this is to be expected, great stuff.

The bad

These are flawed, but useable.

Mesa – VK_KHR_display – Linux

The main issue is that fast-forward does not exist, i.e., no MAILBOX or IMMEDIATE, because DRM does not support true MAILBOX as it stands, IMMEDIATE is conditionally supported (not implemented, but could be very easily added), and only AMD seems to support it. I ranted about this topic here: http://themaister.net/blog/2018/07/02/improving-vk_khr_display-in-mesa-or-lets-make-drm-better/

Nvidia – VK_KHR_display – Linux

Nvidia was the first vendor to support VK_KHR_display, which I applaud. VK_KHR_display is great for focused emulation boxes (although the only reason I assume we got any VK_KHR_display implementation on Linux was Steam VR). We never got this “direct to display” path working on their GL driver, but it works on Vulkan, yay.

The flaw with this implementation is that toggling fast forward works, but it causes a strange mode change, and there’s basically a short pause for reprogramming the DRM Crtc (or what it is doing). This is unacceptable, but at least the non-fast forward experience is flawless in my book. At least Nvidia’s implementation can support MAILBOX or IMMEDIATE, so it is a bit better than Mesa’s KHR_display implementation.

Nvidia – XCB on Xorg – Linux

This one is rather disappointing. Windowed mode is a stuttering, tearing mess (but so is GL it seems). Full screen seems to start off tearing, but after a second or so it seems to figure out that it should move into a flip-like mode. Toggling presentation modes leads to a frame or two of corrupted garbage on screen, but it doesn’t trigger a mode change at least. Fully functional, but a bit rough around the edges.

AMDVLK – XCB on Xorg – Linux

Struggles with frame pacing and tearing, but might have been fixed now. I haven’t tested it that much, but I much prefer RADV on Xorg to this. After recent updates, AMDVLK’s Wayland backend is much better. The main reason to use this setup is for Radeon Graphics Profiler.

The ugly

Broken and buggy for my use cases.

Mesa – XCB on Wayland (XWayland) – Linux

Pretty awful, and a stuttering mess. Poking at the source, there’s likely some kind of fallback path with fallback blits to deal with this, but stay far away from this. Also, in fullscreen, vkAcquireNextImageKHR seems to block indefinitely on XWayland for some reason. At least, there is no reason why you need to subject yourself to this backend for WSI.

AMD – Windows 10

Windows WSI implementations seem to be their own kind of hell, but red beats green here. Windowed mode is just fine. Toggling presentation modes works just fine without issue.

It’s fullscreen where we get a lot of problems. Toggling presentation modes throws the application out to desktop for 3 seconds, then you get the mode change. This is just broken. I found that it’s the deleting of the oldSwapchain which triggers the issue. Not even getting a few present operations through in the new swap chain will save you. There is no escape. Basically, the conclusion I came to is that we can never ever create a new swap chain on Windows, or we are screwed. So now, I had to come up with ways to workaround this. Fortunately, Vulkan is flexible enough with the threading model that we can do some nasty tricks.

AMD seems to support vkAcquireNextImageKHR with a timeout of 0, but some other vendor did not …

Nvidia – Windows 10

This is nightmare fuel for RetroArch. It behaves like the AMD driver, except some added fun:

  • vkAcquireNextImageKHR with timeout == 0 does not seem to be implemented. It will just block. 🙁
  • It’s impossible to create a swap chain in certain cases. maxImageExtent will be 0 when you minimize a window or alt+tab out of fullscreen, which leads to some rather interesting workarounds. Because I need to allow a state where a swap chain does not exist yet, and avoid rendering to any swap chain related image while this goes on. No other vendor seems to have this behavior, but it is allowed by the specification, unfortunately.
  • Using oldSwapchain has been reported to break the driver, causing black screens. To work around this, I ifdef out oldSwapchain on Windows, and just destroy the swap chain before creating a new one. This breaks any hope of toggling presentation modes until this is fixed 🙁

Windows commonalities

  • Frame pacing in FIFO is great.
  • Windowed mode works just fine.
  • Fullscreen seems to be “exclusive” only. Alt-tabbing out of Vulkan is a rather sluggish operation, unlike borderless windowed which is designed to fix this.
  • Changing present modes in fullscreen is completely broken, and needs some serious workarounds.

The nasty workaround – MAILBOX emulation

To emulate fast forward, I ended up with a nasty hack for Windows. If fast forward is enabled in fullscreen mode, I spawn a thread which will do vkAcquireNextImageKHR in the background on-demand, and I’ll deal with the case where we haven’t acquired quite yet, by nooping out any access to the swap chain for that frame. The initial workaround for this was to just use timeout == 0, and avoid a dedicated thread, but … Nvidia’s implementation threw a wrench into that plan.

By doing it like this I can stay in FIFO present mode while faking a really terrible implementation of MAILBOX. Its performance isn’t that great, but at least I don’t get a 3 second delay just to trigger fast forward which should be instant in any sensible WSI implementation.

The Windows Vulkan experience in RetroArch should be not so terrible now, but know that it is only through several weeks of banging my head against the wall.

I hope some IHVs take this into consideration and make sure that toggling presentation modes works properly. Someone out there cares at least. No vendor I have seen so far deals with oldSwapchain in any way. There is a reason it’s there!

Why coverage based pixel art filtering is terrible

I recently made a blog talking about how to properly filter pixel art in 3D, where I chose a cosine kernel as our low-pass. Please read through this to get a better idea of what I’m discussing here.

Pseudo-bandlimited pixel art filtering in 3D – a mathematical derivation

In this blog, I would like to analyze some alternative filters which we could potentially use, and try to explore the quality of the other methods from a signal processing standpoint.

“Ideal” sinc

The theoretical low-pass filter. Just as a reference.

Cosine

The filter kernel I chose as my filter.

d/dx (smoothstep)

I mentioned smoothstep in the previous post. smoothstep was mentioned as the integral of the filter kernel in many implementations, so to get the underlying filter to analyze, we take the derivative of smoothstep. Smoothstep is defined as 3x^2 – 2x^3. Its derivative is 6x – 6x^2, and if we normalize it to [-1, 1] range instead of [0, 1], we get:

Triangle

The LINEAR filter. This is essentially the case we would get if we upscale pixel art a lot with NEAREST, then filter the result.

Rect

This is the worst low-pass filter ever as I mentioned in the last post. Interestingly, this is the filter we would use for coverage-based pixel art filtering if the texture and screen were aligned. (I don’t expect the result for rotated textures to be much different, but the rigorous analysis becomes a bit too hard for me). If our input signal is turned into a series of rects (pixel extent) ahead of time (which is what we do in our implementation), our coverage filter is basically the equivalent of convolving a rect against that signal. Basically, rect is the filter we would end up with if we cranked SSAA up to infinity.

One filter which implements this is (from what I can tell): https://github.com/libretro/slang-shaders/blob/master/retro/shaders/pixellate.slang

Windowed sinc (Lanczos2)

Lanczos2 is a fairly popular filter for image scaling. I’ve included it here to have some reference on how these filters behave in frequency space.

Filter response

Now, it’s time to drone on about the results. All of this on frequency response, stop bands, blah blah, is probably going to be too pedantic for purposes of graphics, but we can, so why not.

So, first we see the response around 0.5 in this graph. This is the Nyquist frequency. Our ideal sinc (or well – to be pedantic – a 16 lobe Hamming windowed sinc) falls immediately after 0.5. This is a completely impractical filter obviously.

Another consideration is how fast we reach the “stop band”, i.e., when the filter response has rolled off completely. Cosine and smoothstep are better here, while Lanczos2 is a bit slower to roll off. This is because of the window function. Applying a window function is the same as “blurring” frequency space, so the perfect sinc is smearing out too far. Linear and Rect don’t hit their first low-point until twice the Nyquist :(.

As for stop-band attenuation, windowed sinc is the clear winner, but again, this filter is impractical for our purposes (analytic windowed sinc integral with negative lobes breaking bilinear optimization? just no). Cosine seems to have a slight edge over smoothstep, with about 2 dB improvement. 2 dB is quite significant (For reference, halving root-mean-square error is ~6 dB). Triangle kernel (LINEAR filter) is about 3 dB worse again, and rect sniffs glue alone in the corner. Triangle and rect look very similar because a triangle kernel is just two rects convolved with each other, and thus its frequency response is squared (the more you know).

Conclusion

Cosine is a solid filter for what it’s trying to do, given our constraints on needing to analytically integrate whatever filter kernel we choose. The quality should be very similar to smoothstep, but I think I’ll consider cosine the winner here with 2 dB better stop-band attenuation. It also has slightly better response in the pass-band, which will preserve sharpness slightly better. This should make sense, because the cosine kernel has a higher peak and rolls of faster to zero than the smoothstep kernel. This frequency response can be expanded or contracted based on how we multiply the “d” parameter. Multiply it by 2 for example, and we have the equivalent of LOD bias +1.

Basically, coverage/area based filtering is pretty terrible from a signal processing standpoint. Rect is a terrible filter, and you should avoid it if you can.

Reference script

I captured this with Octave using this script for reference:

cosine_kernel = pi/4 * cos(pi/2 * (-1024:1024) / 1024);
windowed_sinc = sinc((-2048:2048) / 1024) .* sinc((-2048:2048) / 2048);
perfect_windowed_sinc = sinc((-64*1024:64*1024) / 1024) .* hamming(128 * 1024 + 1)';
rect_kernel = ones(1, 1024);
linear_kernel = conv(rect_kernel, rect_kernel) / 1024;

ramp = (0:2048) / 2048;
smoothstep_kernel = 3 * ramp - 3 * (ramp .* ramp);

cosine_fft = fft(cosine_kernel, 1024 * 1024)(1 : 16 * 1024);
windowed_fft = fft(windowed_sinc, 1024 * 1024)(1 : 16 * 1024);
perfect_windowed_fft = fft(perfect_windowed_sinc, 1024 * 1024)(1 : 16 * 1024);
rect_fft = fft(rect_kernel, 1024 * 1024)(1 : 16 * 1024);
linear_fft = fft(linear_kernel, 1024 * 1024)(1 : 16 * 1024);
smoothstep_fft = fft(smoothstep_kernel, 1024 * 1024)(1 : 16 * 1024);

offset = 20.0 * log10(1024);

cosine_fft = 20.0 * log10(abs(cosine_fft)) - offset;
windowed_fft = 20.0 * log10(abs(windowed_fft)) - offset;
perfect_windowed_fft = 20.0 * log10(abs(perfect_windowed_fft)) - offset;
rect_fft = 20.0 * log10(abs(rect_fft)) - offset;
linear_fft = 20.0 * log10(abs(linear_fft)) - offset;
smoothstep_fft = 20.0 * log10(abs(smoothstep_fft)) - offset;

x = linspace(0, 1024 * 16 * 1024 / (1024 * 1024), 16 * 1024);

figure
plot(x, cosine_fft, 'r', x, windowed_fft, 'g', x, rect_fft, 'b', x, linear_fft, 'b--', x, smoothstep_fft, 'k', x, perfect_windowed_fft, 'c-')
legend('cosine', 'lanczos2', 'rect', 'triangle', 'smoothstep', 'ideal sinc')
xlabel('frequency / sampling rate')
ylabel('Filter response (dB)')
xticks(linspace(0, 1024 * 16 * 1024 / (1024 * 1024), 65))

Pseudo-bandlimited pixel art filtering in 3D – a mathematical derivation

Recently, I’ve been playing Octopath Traveler, and I’m very disappointed with the poor texture filtering seen in this game. It is PS1-level, with severe shimmering artifacts, which ruins the nice pixel art. Tastefully merging retro pixel art with 3D environment is a very cool aesthetic to me, so I want it to look right. Taking a signal processing approach to the problem, I wanted to see if I could solve the issue in a mathematically sound way.

Correctly filtering pixel art is a challenge, especially in a 3D environment, because none of the GPU hardware assisted filtering methods work well. We have two GPU filters available to us:

  • NEAREST/POINT: Razor sharp filtering, but exhibits severe aliasing artifacts.
  • LINEAR: Smooth, but way too blurry.

Our goal is to preserve the pixellated nature of the textures, yet have an alias-free result. A classic solution for this problem is to pre-scale the texture by some integer multiple with NEAREST, then sample this texture with LINEAR filtering. While this works reasonably well, it costs a lot of extra VRAM and bandwidth to sample huge textures which mostly contain duplicate pixels. Duplicating pixels also puts a maximum level on how sharp the pixels can become. On top of that, straight LINEAR still has some level of aliasing, as LINEAR is not a very good low-pass filter with its triangular kernel.

A flawed assumption of texture sampling using LINEAR is also that each sample point in the texture is basically a dirac delta function, i.e. the sample value only exists right at the texel center, and thus has an infinitesimal area. The pixel art assumption is that each texel has proper area, the sample point exists across the entire texel. For bandlimited signals, i.e. “natural” images, the dirac delta assumption is how you sample, but pixel art is not a sampled bandlimited signal.

Filtering textures in 3D means we need to work well in many different scenarios:

  • Magnification
  • Minification
  • Rotation
  • Scale
  • Anisotropy / uneven scaling (look at the texture at an angle)

For minification, we are going to rely on the existing texture filtering hardware to do mip-mapping and anisotropic filtering for us. For minification, we are going to assume it is a good enough approximation to the true solution. What we want to focus on is magnification, as this is where the blurring issues with LINEAR become obvious.

Aliasing from hell

It’s even worse in motion.

Zoomed in 4x with pixel duplication.

Blurry mess

Straight LINEAR, in linear space.

Zoomed in 4x with pixel duplication.

Smooth and sharp

Here’s my method.

Zoomed in 4x with pixel duplication.

Bandlimited sampling

To understand the derivation, we must understand what bandlimited sampling means. All signals (audio or images) have a frequency response. Nyquist’s theorem tells us that if we sample at some frequency F, there must be no frequency component over F / 2 in the signal, or it aliases. (F / 2 is also called the Nyquist frequency.) Our input signal, i.e. a series of dirac delta functions, has an infinite frequency response. Therefore, we must convolve a low-pass filter on that signal to reject as much energy above the Nyquist frequency as possible before we sample it again. In LINEAR sampling, a triangular kernel is applied, which acts as a low-pass filter, but it is not very good at removing high frequencies. NEAREST is effectively just applying a rect filter, which is the worst low-pass filter you can have.

Triangle kernel:

Rect kernel:

For completeness, there exists a perfect low-pass filter, sinc. However, it is purely theoretical as its width is infinitely large, a common approach is to limit the extent of the sinc using a cleverly chosen window function, but now we lose the perfect low-pass. We can get as close as we wish to the perfect low pass by using a larger windowing function, but it’s rarely practical to go this far except for image/video resizing, where we are able to use separable filtering in multiple passes with precomputed filter coefficients. We will not have that luxury for texture filtering in a 3D setting.

We also get negative filter coefficients, which usually manifests itself as “ringing” or “haloing” when filtering graphics. This is something we would like to avoid when filtering pixel art as well. We also need to avoid negative values in the kernel because we will need it for a crucial optimization later.

We will just need to come to terms that we can never get perfect bandlimiting, and we will need to be practical about choosing our filter, hence pseudo-bandlimited.

Picking a practical filter kernel

To sample pixel art, we actually need to apply two filters at once, which complicates things. First, we need to apply a rect kernel (NEAREST filter) to give our texel some proper area. We then apply a low-pass filter which will aim to band-limit the rect. Applying two filters after each other is the same as convolving them together. Convolution is an integral, so now we have some constraints on our filter kernel, because it needs to be cheap to analytically integrate. LUTs will be too costly and annoying to use.

A key point is that the low-pass filter kernel needs to adapt to our sampling rate of the texture. Basically, we need to be band-limited in screen space, not texture space. If we have a filter kernel

we should be able to change it to

where d is the screenspace partial derivative in either X or Y.

// Get derivatives in texel space.
// Need a non-zero derivative since we need to divide it later.
vec2 d = max(fwidth(uv) * texture_size, 1.0 / 256.0);

For our filter kernel, we could pick between:

  • Triangle (piece-wise integration, annoying)
  • Rect (lol)
  • Polynomial (Easy to integrate)
  • Cosine (Easy to integrate)
  • Gaussian (No analytic integration possible)
  • Windowed sinc (negative lobes and super difficult integral, no thanks)

I chose a cosine kernel. If we think about Taylor expansions, a polynomial kernel and cosine is basically in the same ballpark. Cosine is not a perfect low-pass by any means, but it’s pretty good for our purpose here.

The cosine kernel will be

The normalization factor is to make sure the area of the kernel is 1. The rect kernel is

To get our filter with a given d, we will convolve.

The integration boundaries need to be limited to the range of W and R. If we solve this, we get

The brackets in the integration range denote a signed saturate, i.e. clamp(x, -1, 1).

Here’s how the kernel look for different d values:

d = 1:

d = 0.5:

d = 0.25:

d = 0.1:

Nice, so as we can see, the filter kernel starts off fairly smooth, but sharpens into a smoothly rolling off rect as we sample with a higher and higher resolution.

The 2×2 filter (d <= 0.5)

From the filter kernel above, we can see that if d <= 0.5 (LOD = -1), the extent of the filter kernel is 1 pixel. This means we can implement the kernel by a simple 2×2 kernel, or as we shall see, a single bilinear filter tap. For the implementation, we are going to assume we are sampling between two texels, where x is the phase, in range 0 to 1.

We can implement this as a simple lerp. Instead of evaluating two sines, we’re just going for one which implements the transition from 0 to 1.

This is very similar to a smoothstep, which explains why smoothstep techniques for this kind of filtering works so well. sin might be rather expensive on your GPU targets, so instead we can use a simple Taylor expansion to get a very good approximation to the sine

In fact, if we use smoothstep, we would get a filter kernel which is the derivative of smoothstep. Now, if we compute the result for both and X and Y dimensions, we will have lerp factors for both dimensions. Since our filter weights are all positive, and our kernel is separable we can make use of bilinear filtering to get the correct result.

// Get base pixel and phase, range [0, 1).
vec2 pixel = uv * texture_size - 0.5;
vec2 base_pixel = floor(pixel);
vec2 phase = pixel - base_pixel;

// We can resolve the filter by just sampling a single 2x2 block.

mediump vec2 shift = 0.5 + 0.5 * taylor_sin(PI_half * clamp((phase - 0.5) / min(d, vec2(0.5)), -1.0, 1.0));
vec2 filter_uv = (base_pixel + 0.5 + shift) * inverse_texture_size;

As d increases, we should no longer use our filter since we can only support up to d = 0.5, so we implement something ala trilinear filtering where we lerp between our ideal LOD = -1, and LOD = 0, where we fully sample with a normal trilinear/anisotropy filter. This implementation will only require two bilinear texture lookups, one for the d <= 0.5 sampling, and one for the d > 0.5 sampling. These two results need to be lerped.

The 4×4 filter (d <= 1.5)

Now, I’m heading into more interesting territory. While the good old smoothstep with fwidth is a well known hack, it cannot deal with larger kernels than 2×2. Using our success with replacing 4 filter taps with a single bilinear we’re going to continue implementing a 4×4 kernel with 4 bilinear taps. If we support a 4×4 filter kernel we can have our nice filter even for some slight minification as well. It’s going to require a lot of ALU, so we can split the implementation into the case where d <= 0.5, use the single tap, d <= 1.5, use this 4 tap method, otherwise, just sample normally. If we remove this case, we effectively have a “speed hack” mode for slower devices.

The filter coefficients for each element in the 4×4 grid will be

 

u and v are the phase variables in range [0, 1] as mentioned earlier. Since the filter is separable, we can compute X and Y separately and perform an outer product to complete the kernel.

To evaluate four F values, we only need to compute 5 sines (or Taylor approximations), not 8, because F(u) shares a value with F(u + 1), and so on. For d = 1.5, the filter kernel for one dimension will look like

Since we compute X and Y separate, we end up with a cost of 10 Taylor approximations per pixel, ouch, but GPUs crunch this like butter.

Now, each 2×2 block of this kernel can be replaced with one bilinear lookup and a weight.

// Given weights, compute a bilinear filter which implements the weight.
// All weights are known to be non-negative, and separable.
mediump vec3 compute_uv_phase_weight(mediump vec2 weights_u, mediump vec2 weights_v)
{
	// The sum of a bilinear sample has combined weight of 1, we will need to adjust the resulting sample
	// to match our actual weight sum.
	mediump float w = dot(weights_u.xyxy, weights_v.xxyy);
	mediump float x = weights_u.y / max(weights_u.x + weights_u.y, 0.001);
	mediump float y = weights_v.y / max(weights_v.x + weights_v.y, 0.001);
	return vec3(x, y, w);
}

Is 4×4 worth it?

The difference between 4×4 and 2×2 is very subtle and hard to show with still images. The difference manifests itself around LOD -0.5 to 0.5, i.e. close to 1:1 sampling. 2×2 is sharper with more aliasing while the 4×4 kernel remains a little blurrier but alias-free. For most use cases, I expect the 2×2-only method to be good enough, i.e. the good old “smoothstep/sine & fwidth”, but now I’ve come to that conclusion through math and not random graphics hackery.

Performance

A quick and dirty check on RX 470 @ 1440p

  • Full screen with all pixels hitting 4×4 sampling case: 441.44 µs
  • Full screen with all pixels hitting 2×2 sampling case: 207.68 µs

Anisotropy

We naturally get support for anisotropy because we compute different filters for X and Y dimensions. However, once max(d.x, d.y) is too large, we must fall back to normal sampling. Either a very blurry trilinear or actual anisotropic filtering in hardware.

Here’s a shot where we gradually go from crisp texels to normal, very blurry trilinear.

The green region is where d <= 0.5, 1 bilinear fetch, blue is where d < 0.5, d <= 1.5, 4 taps, red region is regular trilinear fallback. The blue region will fade towards the normal trilinear fallback to avoid any abrupt artifacts.

Code

The complete shader implementation (reuseable GLSL header) can be found in: https://github.com/Themaister/Granite/blob/master/assets/shaders/inc/bandlimited_pixel_filter.h
A test project to play around with the filter: https://github.com/Themaister/Granite/blob/master/tests/bandlimited_pixel_test.cpp

Go forth and filter your pixels correctly.

 

Improving VK_KHR_display in Mesa – or, let’s make DRM better!

VK_KHR_display was recently added to Mesa, and I was very excited. Finally, we could get direct-to-display, lowest possible latency in Vulkan for programs like RetroArch (VK_KHR_display is not just for VR!). We have had support for KMS/GBM for a long time, and it works really well when you want to get the most direct access to the display at the cost of convenience. Since we have full control of how page flips happen we avoid many of the pitfalls with compositors. When they work well, you should get optimal conditions in full-screen, but it’s very hard to guarantee. We have no control if X or Wayland choose to go into a direct-to-display mode without compositors when we fullscreen the window surface. VK_KHR_display or the EGL equivalent is also the only realistic way to get good display performance on more embedded Linux systems which are fairly popular for emulation boxes. There’s no need for X or Wayland getting in the way when you have a dedicated, 10-foot UI setup.

We can also control how much total buffering we want. Sometimes we want 2 buffers where CPU emulation + GPU rendering needs to complete in 16.66 ms (great for retro console emulation), and sometimes we want 3 buffers where CPU, GPU and display can overlap (usually the case when running GPU-intensive emulation). Controlling this is almost impossible to do reliably with compositors. We synchronize vkAcquireNextImageKHR using a VkFence instead of a VkSemaphore. Most implementations do not actually support async AcquireNextImageKHR, and certainly not Mesa’s implementation of VK_KHR_display.

In EGL, we made use of GBM to allocate DRM buffers directly and pumped through our own page flips using the super low-level DRM API. In Vulkan however, we cannot go that low-level as we go through VK_KHR_display. There are tradeoffs. Nvidia for example supports VK_KHR_display on Linux, but not GBM. VK_KHR_display is a cleaner abstraction than raw DRM.

I had some issues with Mesa’s implementation of VK_KHR_display however, so I tried to fix them.

Lack of MAILBOX or IMMEDIATE present modes

This is the first glaring omission. In emulators, a key feature is being able to fast-forward. Usually, we also want to fast-forward completely seamlessly. Are you’re playing an old console RPG and want to make random encounters less dull? Just hit fast forward and blast through. Unfortunately, the Mesa implementation of VK_KHR_display does not support the display modes which facilitate this use case. In GL, we trivially support his by hitting glXSwapInterval(0) or 1 and it should “just work”.

MAILBOX is preferred because it is tear-free, but IMMEDIATE is a good fallback too.

So, I tried patching^H^H^Hhacking this up in Mesa.

MAILBOX

Basically, the FIFO implementation revolves around using drmModePageFlip. This queues up a framebuffer to be display on the next VBlank, if the buffer DRM is rendering to has completed its rendering. Apparently, the kernel tracks images on DRM, and whatever semaphores you pass in to vkQueuePresent don’t seem to matter at all.

Now, one glaring problem with drmModePageFlip is that once you have queued up a flip, you cannot queue up another one until it has completed on the next vblank. The page flip itself can be polled through the DRM FD. The page flip event will say when the page flip happened, and which image was flipped in.

After the page flip, the VK_KHR_display has a thread which checks if there are any queued up frames which can be setup with drmModePageFlip, and that keeps the FIFO queue going. If there are multiple frames queued up, the first image queued by the application is selected for the next page flip. So, I implemented MAILBOX using a really basic idea: When queuing up for a new page flip, pick the latest image to be queued by the application. Other frames which were queued up were transitioned back into the IDLE state, because they would never end up being displayed anyways. This worked wonderfully for my use case. Hundreds of FPS without tearing achieved.

I also implemented a better AcquireNextImageKHR. If there are multiple IDLE state images, I picked the image which was presented earliest. This way, you get a round-robin-like model, whereas the old model just picked the first IDLE frame. This might work just fine for FIFO, but not for MAILBOX.

Another thing I did was to force at least 4 images for MAILBOX. Unlike the Wayland implementation, I didn’t force a 4-deep swapchain in minImageCount. It’s perfectly fine for a swapchain to return more images than what is being requested. This is probably an API wart. It is a bit awkward to not have different minImageCount queries per present mode …

IMMEDIATE

For this mode, you can have tearing, and I found a particular flag in the DRM API. You can pass down DRM_MODE_PAGE_FLIP_ASYNC flag to drmModePageFlip, and the page flip will just happen when it happens. Unfortunately, this only worked as expected on AMD. Apparently, Intel cannot support this flip mode.

Lack of seamless transition between present modes

One annoying aspect of Vulkan WSI is that you cannot just change the present interval on the fly. Once you have a swapchain you cannot just call glXSwapInterval or similar to get vsync or no vsync. What you need to do in this case is to create a new VkSwapchainKHR, setting oldSwapchain to hopefully reuse the images in the old swapchain, and then, delete the old swapchain. Assuming a decent implementation, you should be getting a seamless transition over to the new present mode.

Unfortunately, this did not work well. First of all, the VK_KHR_display implementation did not bother with oldSwapchain. When deleting the old swapchain after creating a new one, the screen would black out for a while, before starting up again, usually 3-4 seconds, kinda like doing a full mode change. I tracked this down to drmModeRmFB which deletes the old framebuffer references. Apparently, if you delete a framebuffer which is being displayed in DRM, the screen blacks out. This kind of makes sense, as there is nothing to display anymore. However, for toggling vsync-state this is just unacceptable.

The patching of this got a bit hairier.

I ended up with a scheme which can “steal” images from oldSwapchain assuming the formats and dimensions match up. I tried to pilfer the displayed image especially, because its framebuffer cannot be allowed to die or we get a black screen. (Note to self, there might be a ton of edge cases here with synchronization …) To make sure I don’t get stale page flip events, I block to make sure any pending flip has completed before continuing. The reason for this is to avoid a race condition where I end up freeing the oldSwapchain before the page flip handler. The page flip handler has a reference to the swapchain it came from. After being pilfered from, the old swapchain is considered dead, and I return VK_ERROR_OUT_OF_DATE_KHR on any following requests to acquire from it.

This combined with my crude MAILBOX implementation allowed me to get seamless transitions between tear-free fast-forward and butter smooth vsync gameplay with very low latency.

Doing MAILBOX properly in DRM

Unfortunately, the MAILBOX implementation I made is just a crude hack, and is not a valid implementation. Effectively, when you queue up a page flip in DRM, you’re expecting what is going to be the next image to display at the next vblank, long before that vblank actually occurs. This increases latency by quite a lot, but you also risk overshooting quite a lot, where the GPU cannot complete the frame in time, and you cannot flip anything on the next vblank. This is just dumb.

True MAILBOX needs to be handled in the VBlank handler inside DRM somehow, and DRM has no API available which can support this present mode. The way I expect this to work is:

  • drmModePageFlip can be called multiple times, up to N times. If you pass down a flag called say, DRM_MODE_PAGE_FLIP_REPLACE. This would allow multiple pending page flips to be in flight, and not return EBUSY if a pending flip is active.
  • In the vblank handler, the latest entry in the queue, whose rendering is also complete on the GPU, is selected. This frame is programmed into the display controller.
  • The earlier frames in the queue which were available to be presented, but were not selected for page flip, will be reported through a “discard” event, similar to the PAGE_FLIP event. This allows the WSI implementation to release the images to the application.
  • The current page flip callback will report the actual image which was selected for presentation.

Some further improvements to this scheme could be that discard events are returned as early as possible. If a swapchain image completes in the middle of the frame, it knows that it can “discard” earlier completed frames ahead of time. This way, we avoid stalling the GPU at DisplayRate * (SwapchainImages – 1) frame rates, which can happen if we render all available swapchain images, and we need to wait for next vblank to discard all frames.

This is way beyond me unfortunately, but it sounds like a very valuable thing to have. It seems like I will need to maintain my own hacky patch set on top of Mesa for now. 🙂

Render graphs and Vulkan — a deep dive

Modern graphics APIs such as Vulkan and D3D12 bring new challenges to engine developers. While the CPU overhead has dramatically been reduced by these APIs, it’s clear that it is difficult to bridge the gap in terms on GPU performance when we are hitting the “good” paths of the driver, and we are GPU bound. OpenGL and D3D11 drivers (clearly) go to extreme lengths in order to improve GPU performance using all sorts of trickery. The cost we pay for this as developers is unpredictable performance and higher CPU overhead. Writing graphics backends has become more interesting again, as we are still figuring out how to build great rendering backends for these APIs which balance flexibility, performance and ease of use.

Last week I released my side-project, Granite, which is my take on a Vulkan rendering engine. While there are plenty of such projects out in the wild, all with their own merits, I would like to discuss my render graph implementation in particular.

The render graph implementation is inspired by Yuriy O’Donnells GDC 2017 presentation: “FrameGraph: Extensible Rendering Architecture in Frostbite.” While this talk focuses on D3D12, I’ve implemented my own for Vulkan.

(Note: render graphs and frame graphs mean the same thing here. Also, if I mention Vulkan, it probably also applies to D3D12 as well … maybe)

The problem

Render graphs fundamentally solve a very annoying problem in modern APIs. How do we deal with manual synchronization? Let’s go over the obvious alternatives.

Just-in-time synchronization

The most straight forward approach is basically doing synchronization at the last minute. Whenever we start rendering to a texture, bind a resource or similar, we need to ask ourselves, “does this resource have pending work which needs to be synchronized?” If so, we need to somehow at the very last minute deal with it. This kind of tracking clearly becomes very painful because we might read a resource 1000+ times, while we only write to it once. Multithreading becomes very painful, what if two threads discover a barrier is needed? One thread needs to “win”, and now we have a lot of useless cross-thread synchronization hassles to deal with.

It’s also not just execution itself we need to track, we also have the problem of image layouts and memory access in Vulkan. Using a resource in a particular way will require a specific image layout (or just GENERAL, but you might lose framebuffer compression!).

Essentially, if what we want is just-in-time automatic sync, we basically want OpenGL/D3D11 again. Drivers have already been optimized to death for this, so why do we want to reimplement it in a half-assed way?

Fully explicit synchronization

On the other side of the spectrum, the API abstraction we choose completely removes automatic synchronization, and the application needs to deal with every synchronization point manually. If you make a mistake, prepare for some “interesting” debugging sessions.

For simpler applications, this is fine, but once you start going down this route you quickly realize what a mess it turns into. Typically your rendering pipeline will be compartmentalized into blocks — maybe you have the forward/deferred/whatever-is-cool-now renderer in one module, some post-processing passes scattered around in other modules, maybe you drag in some feedbacks for reprojection steps, you add a new technique here and there and you realize you have to redo your synchronization strategy — again, and things turn sour.

Why does this happen?

Let’s write some pseudo-code for a dead-simple post-processing pass and think about it.

// When was the last time I read from this image? Probably last frame later in the post-chain ...
// We want to avoid write-after-read hazards.
// We're going to write the whole image,
// so we might as well transition from UNDEFINED to "discard" the previous content ...
// Ideally I would keep careful track of VkEvents from earlier frames, but that got so messy ...
// Where was this render target allocated from?
BeginRenderPass(RT = BloomThresholdBuffer)

// This image was probably written to in the previous pass, but who knows anymore.
BindTexture(HDR)

DrawMyQuad()
EndRenderPass()

These kinds of problems are typically solved with a big fat pipeline barrier. Pipeline barriers let you reason locally about global synchronization issues, but they’re not always the optimal way to do it.

// To be safe, wait for all fragment execution to complete, this takes care of write-after-read and syncing the HDR render pass ...
// Assuming they are never used in async compute ... hm, this will probably work fine for now.

PipelineBarrier(FRAGMENT -> FRAGMENT,
    RT layout: UNDEFINED -> COLOR_ATTACHMENT_OPTIMAL,
    RT srcAccess: 0 (write-after-read)
    RT dstAccess: COLOR_ATTACHMENT_WRITE_BIT,
    HDR layout: COLOR_ATTACHMENT_OPTIMAL -> SHADER_READ_ONLY,
    HDR srcAccess: COLOR_ATTACHMENT_WRITE_BIT,
    HDR dstAccess: SHADER_READ_BIT)

BeginRenderPass(...)

So we transitioned the HDR image, because we assumed it was the previous pass, but maybe in the future you add a different pass in between which also transitions … So now you still need to keep track of image layouts, bleh, but not the end of the world.

If you’re only dealing with FRAGMENT -> FRAGMENT workloads, this is probably not so bad, there isn’t all that much overlap which happens between render passes anyways. When you start throwing compute into the mix is when you start pulling your hair out, because you just can’t slap pipeline barriers like this all over the place, you need some non-local knowledge about your frame in order to achieve optimal execution overlap. Plus, you might even need semaphores because you’re doing async compute now in a different queue.

Render graph implementation

I’ll be mostly referring to these files: render_graph.hpp and render_graph.cpp.

Note: This is a huge brain dump. Try to follow along in the code while reading this, I’ll go through things in order.

Note #2: I use the terms “flush” and “invalidate” in the implementation. This is not Vulkan spec lingo. Vulkan uses the terms “make available” and “make visible” respectively. Flush refers to cache flushing, invalidate refers to cache invalidation.

The basic idea is that we have a “global” render graph. All components in the system which need to render stuff need to register with this render graph. We specify which passes we have, which resources go in, which resources are written and so on. This could be done once on application startup, once every frame, or however often you need. The main idea is that we form global knowledge of the entire frame and we can optimize accordingly at a higher level. Modules can reason locally about their inputs and outputs while allowing us to see the bigger picture, which solves a major issue we face when the backend API does not schedule automatically and deal with dependencies for us. The render graph can take care of barriers, layout transitions, semaphores, scheduling, etc.

Outputs from a render pass need some dimensions, fairly straight forward.

Images:

struct AttachmentInfo
{
	SizeClass size_class = SizeClass::SwapchainRelative;
	float size_x = 1.0f;
	float size_y = 1.0f;
	VkFormat format = VK_FORMAT_UNDEFINED;
	std::string size_relative_name;
	unsigned samples = 1;
	unsigned levels = 1;
	unsigned layers = 1;
	bool persistent = true;
};

Buffers:

struct BufferInfo
{
	VkDeviceSize size = 0;
	VkBufferUsageFlags usage = 0;
	bool persistent = true;
};

These resources are then added to render passes.

// A deferred renderer setup

AttachmentInfo emissive, albedo, normal, pbr, depth; // Default is swapchain sized.
emissive.format = VK_FORMAT_B10G11R11_UFLOAT_PACK32;
albedo.format = VK_FORMAT_R8G8B8A8_SRGB;
normal.format = VK_FORMAT_A2B10G10R10_UNORM_PACK32;
pbr.format = VK_FORMAT_R8G8_UNORM;
depth.format = device.get_default_depth_stencil_format();

auto &gbuffer = graph.add_pass("gbuffer", VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT);
gbuffer.add_color_output("emissive", emissive);
gbuffer.add_color_output("albedo", albedo);
gbuffer.add_color_output("normal", normal);
gbuffer.add_color_output("pbr", pbr);
gbuffer.set_depth_stencil_output("depth", depth);

auto &lighting = graph.add_pass("lighting", VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT);
lighting.add_color_output("HDR", emissive, "emissive");
lighting.add_attachment_input("albedo");
lighting.add_attachment_input("normal");
lighting.add_attachment_input("pbr"));
lighting.add_attachment_input("depth");
lighting.set_depth_stencil_input("depth");

lighting.add_texture_input("shadow-main"); // Some external dependencies
lighting.add_texture_input("shadow-near");

Here we see three ways which a resource can be used in a render pass.

  • Write-only, the resource is fully written to. For render targets, loadOp = CLEAR or DONT_CARE.
  • Read-write, preserves some input, and writes on top, for render targets, loadOp = LOAD.
  • Read-only, duh.

The story is similar for compute, here’s an adaptive luminance update pass, done in async compute

auto &adapt_pass = graph.add_pass("adapt-luminance", VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
adapt_pass.add_storage_output("average-luminance-updated", buffer_info, "average-luminance");
adapt_pass.add_texture_input("bloom-downsample-3");

The luminance buffer gets a RMW here for example.

We also need some callbacks which can be called every frame to actually do some work, for gbuffer …

gbuffer.set_build_render_pass([this, type](Vulkan::CommandBuffer &cmd) {
	render_main_pass(cmd, cam.get_projection(), cam.get_view());
});

gbuffer.set_get_clear_depth_stencil([](VkClearDepthStencilValue *value) -> bool {
	if (value)
	{
		value->depth = 1.0f;
		value->stencil = 0;
	}
	return true; // CLEAR or DONT_CARE?
});

gbuffer.set_get_clear_color([](unsigned render_target_index, VkClearColorValue *value) -> bool {
	if (value)
	{
		value->float32[0] = 0.0f;
		value->float32[1] = 0.0f;
		value->float32[2] = 0.0f;
		value->float32[3] = 0.0f;
	}
	return true; // CLEAR or DONT_CARE?
});

The render graph is responsible for allocating the resources and driving these callbacks, and finally submitting this to the GPU in the proper order. To terminate this graph, we promote a particular resource as the “backbuffer”.

// This is pretty handy for ad-hoc debugging :P
const char *backbuffer_source = getenv("GRANITE_SURFACE");
graph.set_backbuffer_source(backbuffer_source ? backbuffer_source : "tonemapped");

Now let’s get into the actual implementation.

Time to bake!

Once we’ve set up the structures, we need to bake the render graph. This goes through a bunch of steps, each completing one piece of the puzzle …

Validate

Pretty straight forward, a quick sanity check to ensure that the data in the RenderPass structures makes sense.

One interesting thing here, is that we can check if color input dimensions match color outputs. If they differ, we don’t do straight loadOp = LOAD, but we can do a scaled blit instead on start of the render pass. This is super convenient for things like game rendering at lower-res -> UI at native res. The loadOp in this case becomes DONT_CARE.

Traverse dependency graph

We have an acyclic graph (I hope … :D) of render passes now, which we need to flatten down into an array of render passes. The list we create will be a valid submission order if we were to submit every pass one after the other. This submission order might not be the most optimal, but we’ll get close later.

The algorithm here is straight forward. We traverse the tree bottom-up. Using recursion, push the pass index of all the passes which write to backbuffer, then, for all those passes, push the writes for the resources in those passes … and so on until we reach the top leaves. This way, we ensure that if a pass A depends on pass B, pass B will always be found later than A in the list. Now, reverse the list, and prune duplicates.

We also register if a pass is a good “merge candidate” with another pass. For example, the lighting pass uses input attachments from gbuffer pass, and it shares some color/depth attachments … On tile-based architectures we can actually merge those passes without going to main memory using Vulkan’s multipass feature, so we keep this in mind for the reordering pass which comes after.

Render pass reordering

This is the first interesting step of the process. Ideally, we want a submission order which has optimal overlap between passes. If pass A writes some data, and pass B reads it, we want the maximum number of passes between A and B in order to minimize the number of “hard barriers”. This becomes our optimization metric.

The algorithm implemented is probably very inoptimal in terms of CPU time, but it gets the job done. It looks through the list of passes not yet scheduled in, and tries to figure out the best one based on three criteria:

  • Do we have a merge candidate as determined by the dependency graph traveral step earlier? (Score: infinite)
  • What is the latest pass in the list of already scheduled passes which we need to wait for? (Score: number of passes which can overlap in-between)
  • Does scheduling this pass break the dependency chain? (If so, skip this pass).

Reading the code is probably more instructive, see RenderGraph::reorder_passes().

Another sneaky consideration which should be included is when the lighting pass depends on some resources, while the G-buffer pass doesn’t. This can break subpass merging, because we go through this scheduling process:

  • Schedule in G-buffer pass, it has no dependencies
  • Try to schedule in lighting pass, but whoops, we haven’t scheduled the shadow passes which we depend on yet … Oh well 🙂

The dirty solution to this was to lift dependencies from merge candidates to the first pass in the merge chain. Thus, the G-buffer pass will be scheduled after shadow passes, and it’s all good. A more clever scheduling algorithm might help here, but I’d like to keep it as simple as possible.

Logical-to-physical resource assignment

When we build our graph, we might have some read-modify-writes. For lighting pass, emissive goes in, HDR result goes out, but clearly, it’s really the same resource, we just have this abstraction to figure out the dependencies in a sensible way, give some descriptive names to resources, and avoid cycles. If we had multiple passes, all doing emissive -> emissive for example, we have no idea which pass comes first, they all depend on each other (?), and I’d rather not deal with potential cycles.

What we do now is assign a physical resource index to all resources, and alias resources which do read-modify-write. If we cannot alias for some reason, it’s a sign we have a very wonky submission order which tries to do reads concurrently with writes. The implementation just throws its hands in the air in that case. I don’t think this will happen with an acyclic graph, but I cannot prove it.

Logical-to-physical render pass assignment

Next, we try to merge adjacent render passes together. This is particularly important on tile-based renderers. We try to merge passes together if:

  • They are both graphics passes
  • They share some color/depth/input attachments
  • Not more than one unique depth/stencil attachment exists
  • Their dependencies can be implemented with BY_REGION_BIT, i.e. no “texture” dependency, which allows sampling for arbitrary locations.

Transient or physical image storage

Similar story as subpass merging, tile-based renderers can avoid allocating physical memory for the attachment if you never actually write to it (with storeOp = STORE)! This can save a lot of memory for deferred especially, but also for depth buffers if they are not used later in post for example.

A resource can be transient if:

  • It is used in a single physical render pass (i.e. it never needs to storeOp = STORE)
  • It is invalidated at the start of the render pass (no loadOp = LOAD needed)

Build RenderPassInfo structures

Now, we have a clear view of all the passes, their dependencies and so on. It is time to make some render pass info structures.

This part of the implementation is very tied into how Granite’s Vulkan backend does things, but it closely mirrors the Vulkan API, so it shouldn’t be too weird. VkRenderPasses are generated on demand in the Vulkan backend, so we don’t do that here, but we could potentially bake that right now.

The actual image views will be assigned later (every frame actually), but subpass infos, number of color attachments, inputs, resolve attachments for MSAA, and so on can be done up front at least. We also build a list of which physical resource indices should be pulled in as attachments as well.

We also figure out which attachments need loadOp = CLEAR or DONT_CARE now by calling some callbacks. For attachments which have an input, just use loadOp = LOAD (or use scaled blits!). For storeOp we just say STORE always. Granite recognizes transient attachments internally, and forces storeOp = DONT_CARE for those attachments anyways.

Build barriers

It is time to start looking at barriers. For each pass, each resource goes through three stages:

  • Transition to the appropriate layout, caches need to be invalidated
  • Resource is used (read and/or writes happen)
  • The resource ends up in a new layout, with potential writes which need to be flushed later

For each pass we build a list of “invalidates” and “flushes”.

Inputs to a pass are placed in the invalidate bucket, outputs are placed in the flush bucket. Read-modify-write resources will get an entry in both buckets.

For example, if we want to read a texture in a pass we might add this invalidate barrier:

  • stages = FRAGMENT (or well, VERTEX, but I’d have to add extra stage flags to resource inputs)
  • access = SHADER_READ
  • layout = SHADER_READ_ONLY_OPTIMAL

For color outputs, we might say:

  • stages = COLOR_ATTACHMENT_OUTPUT
  • access = COLOR_ATTACHMENT_WRITE
  • layout = COLOR_ATTACHMENT_OPTIMAL

This tells the system that “hey, there are some pending writes in this stage, with this memory access which needs to be flushed with srcAccessMask. If you want to use this resource, sync with these things!”

We can also figure out a particular scenario here with render passes. If a resource is used as both input attachment and read-only depth attachment, we can set the layout to DEPTH_STENCIL_READ_ONLY_OPTIMAL. If color attachment is used also as an input attachment we can use GENERAL (programmable blending yo!), and similar for read-write depth/stencil with input attachment.

Build physical render pass barriers

Now, we have a complete view of each pass’ barriers, but what happens when we start to merge passes together? Multipass will likely perform some barriers internally as part of the render pass execution (think deferred shading), so we can omit some barriers here. These barriers will be resolved internally with VkSubpassDependency when we build the VkRenderPass later, so we can forget about all barriers which need to happen between subpasses.

What we are interested in is building invalidation barriers for the first pass a resource is used. For flush barriers we care about the last use of a resource.

Now, there are two cases we need to cover here to ensure that every pass can deal with synchronization before and after the pass executes.

Only invalidation barrier, no flush barrier

This is the case for read-only resources. We still need to guard ourselves against write-after-read hazards later. For example, what if the next pass starts to write to this resource? Clearly, we need to let other passes know that this pass needs to complete before we can start scribbling on a resource. The way this is implemented is by injecting a fake flush barrier with access = 0. access = 0 basically means: “there are no pending writes to be seen here!” This way we can have multiple passes back to back which all just read a resource. If the image layout stays the same and srcAccessMask is 0, we don’t need barriers.

Only flush barrier, no invalidation barrier

This is typically the case for passes which are “write only”. This lets us know that before the pass begins we can discard the resource by transitioning from UNDEFINED. We still need an invalidation barrier however, because we need a layout transition to happen before we start the render pass and caches need to be invalidated, so we just inject an invalidate barrier here with same layout and access as the flush barrier.

Ignore barriers for transients/swapchain

You might notice that barriers for transients are just “dropped” for some reason. Granite internally uses external subpass dependencies to perform layout transitions on transient attachments, although this might be kind of redundant now with the render graph. The swapchain is similar. Granite internally uses subpass dependencies to transition the swapchain image to finalLayout = PRESENT_SRC_KHR when it is used in a render pass.

Render target aliasing

The final step in our baking process is to figure out if we can temporally alias resources in the graph. For example, we might have two or more resources which exist at completely different times in a frame. Consider a separable blur:

  • Render a frame (Buffer #0)
  • Blur horiz (Buffer #1)
  • Blur vert (Should ping-pong back to buffer #0)

When we specify this in the render graph we have 3 distinct resources, but clearly, the vertical blur render target can alias with the initial render target. I suggest looking at Frostbite’s presentation here on their results with aliasing, it’s quite massive.

We could technically alias actual VkDeviceMemory here, but this implementation just tries to reuse VkImages and VkImageViews directly. I’m not sure if there is much to be gained by trying to suballocate directly from the dead corpses of other images and hope that it will work out. Something to look at if you’re really starved for memory I guess. The merit of aliasing image memory might be questionable, as VK_*_dedicated_allocation is a thing, so some implementation might prefer that you don’t alias. Some numbers and IHV guidance on this is clearly needed.

The algorithm is fairly straight forward. For each resource we figure out the first and last physical render pass where a resource is used. If we find another resource with the same dimensions/format, and their pass range does not overlap, presto, we can alias! We inject some information where we can transition “ownership” between resources.

For example, if we have three resources:

  • Alias #0 is used in pass #1 and #2
  • Alias #1 is used in pass #5 and #7
  • Alias #2 is used in pass #8 and #11

At the end of pass #2, the barriers associated with Alias #0 are copied over to Alias #1, and the layout is forced to UNDEFINED. When we start pass #5, we will magically wait for pass #2 to complete before we transition the image to its new layout. Alias #1 hands over to alias #2 after pass #7 and so on. Pass #11 hands over control back to alias #0 in the next frame in a “ring”-like fashion.

Some caveats apply here. Some images might have “history” or “feedback” where each image actually has two instances of itself, one for current frame, and one for previous frame. These images should never alias with anything else. Also, transient images do not alias. Granite’s internal transient image allocator takes care of this aliasing internally, but again, with the render graph in place, that is kind of redundant now …

Another consideration is that adding aliasing might increase the number of barriers needed and reduce GPU throughput. Maybe the aliasing code needs to take extra barrier cost into consideration? Urk … At least if you know your VRAM size while baking, you have a pretty good idea if aliasing is actually worth it based on all the resources in the graph. Optimizing the dependency graph for maximum overlap also greatly reduces the oppurtunities for aliasing, so if we want to take memory into consideration, this algorithm could easily get far more involved …

Preparing resources for async compute

For async compute, resources might be accessed by both a graphics and a compute queue. If their queue families differ (ohai AMD), we have to decide if we want EXCLUSIVE or CONCURRENT queue access to these resources. For buffers, using CONCURRENT seems like an obvious choice, but it’s a bit more complicated with images. In the name of not making this horribly complicated, I went with CONCURRENT, but only for the resources which are truly needed in both compute and graphics passes. Dealing with EXCLUSIVE will be brutal, because now we have to consider read-after-read barriers as well and ping-pong ownership between two queue families 😀 (Oh dear)

Summary

A lot of stuff to consider to go through, but now we have all the data structures in place to start pumping out frames.

The runtime

While baking is a very involved process, executing this is reasonably simple, we just need to track the state of all resources we know about in the graph.

Each resource stores:

  • The last VkEvent. If we need to ask ourselves, “what do I need to wait for before I touch this resource”, this is it. I opted for VkEvent because it can express execution overlap, while pipeline barriers cannot.
  • The last VkSemaphore for graphics queue. If the resource is used in async compute, we use semaphores instead of VkEvents. Semaphores cannot be waited on multiple times, so we have a semaphore which can be waited on once in the graphics queue if needed.
  • The last VkSemaphore for compute queue. Same story, but for waiting in the compute queue once.
  • Flush stages (VkPipelineStageFlags), this contains the stages which we need to wait for (srcStageMask) if we need to wait for the resource.
  • Flush access (VkAccessFlags), this contains the srcAccessMask of memory we need to flush before we can use the resource.
  • Per-stage invalidation flags (VkAccessFlag for each pipeline stage). These bitmasks keep track of in which pipeline stages and access flags it is safe to use the resource. If we figure out that we have an invalidation barrier, but all the relevant stages and access bits are already good to go, we can drop the barrier altogether. This is great for cases where we read the same resource over and over, all in SHADER_READ_ONLY_OPTIMAL layout.
  • The current layout of the resource. This is currently stored inside the image handles themselves, but this might be a bit wonky if I add multithreading later …

For each frame, we assign resources. At the very least we have to replace the swapchain image, but some images might have been assigned as “not persistent”, in which case we allocate a fresh resource every frame. This is useful for scenarios where we trade more memory usage (more copies in flight on the GPU) for removal of all cross-frame barriers. This is probably a terrible idea for large render targets, but small compute buffers of a few kB each? Duh. If we can kick off GPU work earlier, that’s probably a good thing.

If we allocate a new resource, all barrier state is cleared to its initial state.

Now, we get into pushing render passes out. The current implementation loops through all the passes and deal with barriers as they come up. If you interleave this loop hard enough, I’m sure you’ll see some multithreading potential here 🙂

Check conditional execution

Some render passes do not need to be run this frame, and might only need to run if something happened (think shadow maps). Each pass has a callback which can determine this. If a pass is not executed, it does not need invalidation/flush barriers. We still need to hand over aliasing barriers, so just do that and go to next pass.

Handle discard barriers

If a pass has discard barriers, just set the current layout of the image to UNDEFINED. When we actually do the layout transition, we will have oldLayout = UNDEFINED.

Handle invalidate barriers

This part comes down to figuring out if we need to invalidate some caches, and potentially flush some caches as well. There are some things we have to check here:

  • Are there pending flushes?
  • Does the invalidate barrier need a different image layout than the current one?
  • Are there some caches which have not been flushed yet?

If the answer to either question is yes, we need some kind of barrier. We implement this barrier in one of three ways:

  • vkCmdWaitEvents – If the resource has a pending VkEvent, along with appropriate VkBufferMemoryBarrier/VkImageMemoryBarrier.
  • vkQueueSubmit w/ semaphore wait. Granite takes care of adding semaphores at submit time. We push in a wait semaphore along with dstWaitStageMask which matches our invalidate barrier. If we also need a layout transition, we can add a vkCmdPipelineBarrier with srcStageMask = dstStageMask to latch onto the dstWaitStageMask … and keep the pipeline going. We generally do not need to deal with srcAccessMask if we waited on a semaphore, so usually this will just be forced to 0.
  • vkCmdPipelineBarrier(srcStage = TOP_OF_PIPE_BIT). This is used if the resource hasn’t been used before, and we just need to transition away from UNDEFINED layout.

The barriers are batched up as appropriate and submitted. Buffers are much simpler as they do not have layouts.

After invalidation we mark the appropriate stages as properly invalidated. If we changed the layout or flushed memory access as part of this step, we clear everything to 0 before this step.

Execute render passes

This is the easy part, just call begin/nextsubpass/end and fire off some callbacks to push the real graphics work. For compute, just drop the begin/end.

For graphics we might do some scaled blits at the beginning and some automatic mipmap generation at the end.

Handle flush barriers

This part is simpler. If there is at least one resource which is only used in a single queue, we signal an VkEvent here and assign it to all relevant resources. If we have at least one resource which is used cross-queue, we also signal two semaphores here (one for graphics, one for compute later …)

We also update the current layout, and mark flush stages/flush access for later use.

Alias handoff

If the resource is aliased, we now copy the barrier state of a resource over to its next alias, and force the layout to UNDEFINED.

Submission

The command buffer for each pass is now submitted to Granite. Granite tries to batch up command buffers until it needs to wait for a semaphore or signal one.

Scale to swapchain

After all the passes are done, we can inject a final blit to swapchain if the backbuffer resource dimensions do not match the actual swapchain. Otherwise, we alias those resources anyways, so no need for useless blitting passes.

Conclusion

Hopefully this was interesting. The word count of this post is close to 5K at this point, and the render graph is a 3 ksloc behemoth (sigh). I’m sure there are bugs (actually I found two in async compute while writing this), but I’m quite happy how this turned out.

Future goals might be trying to see if this can be made into a reusable, standalone library and getting some actual numbers.