September 2018 – Maister's Graphics Adventures

In the world of graphics programming, interacting with the windowing system is not exactly the most riveting subject, but it is of critical importance to the usability of applications. Tearing, skipped frames, judder, etc, are all issues which stem from poor window system code. Over the years, it has been an ongoing frustration that good windowing system integration is something we just cannot rely on. For whatever reason, implementations and operating systems always find a way to screw things up.

For emulation, perfection is the only thing good enough. It is immediately obvious when tearing or skipped frames are observed in retro emulation. These games work on a fixed time step, and we must obey, or the result is borderline unplayable. For “normal” games, it seems like this isn’t as much of a concern. Games are written around the assumption of variable frame rates, users can disable V-Sync (especially for fast-paced, reaction based games), etc, and variable refresh rate display standards were introduced to get the best of both worlds. The problems I’m going to explore in this post can often be glossed over in the common case.

Requirements for RetroArch:

Perfect, tear-free 1:1 frame mapping if game frame rate and monitor frame rate are close enough. Essentially VK_PRESENT_MODE_FIFO_KHR when it’s working as intended.
Ability to toggle between locked and unlocked (fast forward) frame rates, seamlessly. This is a very emulation specific requirement, but extremely important. Unlocked could be either MAILBOX or IMMEDIATE.
Consistent frame pacing in FIFO mode. If there is too much variation in frame times, this translates into variable input latency for fixed FPS content, which is not ideal. It also hurts rate control for audio, although the frame pacing can get rather bad before this becomes a real issue. There’s no reason why we should have more than a millisecond in jitter for low-GPU load scenarios.
Control over latency. The GPU load of retro emulation is usually quite insignificant, but we need full control over the swap chain, when swaps happen, and when we can begin a frame and poll inputs. Absurd amounts of effort has gone into aiming to reduce input latency by various developers, and all that effort consists of working around false assumption of GPU drivers which have been optimising for “normal game” FPS rather than latency and predictability. In Vulkan, with more explicit control over the swap chain, we should be able to control this far better than we ever could in legacy APIs. However, discussing latency (by like, measuring stuff with a high-speed camera) is outside of the scope of this post. There are more fundamental issues I would like to cover instead.
In (exclusive) full screen modes, we should have a flipping model, and if I ask for double buffer, I should get just that.
Borderless windowed full screen mode is also important for casual play when minimum latency isn’t the highest priority.
Windowed mode should be tear-free, without stutter, but is allowed to have a bit more latency than fullscreen because of compositors.

I have had the “great pleasure” of fighting with many different WSI implementations in the RetroArch Vulkan backend. I would like to summarise these, starting from “best” to “worst” for dramatic effect. I’ll also discuss some heinous workarounds I have had to employ to work around the worst implementations.

How it should work

Vulkan WSI is a fairly explicit model compared to its predecessors. You can request the number of images there should be in the swap chain, and you acquire and present these images directly. There is no magic “framebuffer 0”, or a magic SwapBuffers call which buffers an unknown amount for you.

There is no direct way to toggle between FIFO and MAILBOX/IMMEDIATE on a swap chain. Instead, we need to create a new swap chain. According to the specification, there can only be one non-retired swap chain active at a time. We have the option of passing in our “oldSwapchain” when creating a new swap chain. My understanding of this is that we can “pass over” ownership from one swap chain to the next. Passing in oldSwapchain will effectively retire the swap chain as well. If we just change the present mode for the new swap chain, this should give us a seamless transition over to the new present mode. After we have created the new swap chain, we can delete the old one and query the new (or the same!) swap chain images.

vkAcquireNextImageKHR will give us new images to render into, and it should be unblocked on V-Blank when new images are flipped onto the display. vkAcquireNextImageKHR is an asynchronous acquire operation, but for RetroArch I want to sync the frame begin to V-Blank, so I just wait on the VkFence to signal anyways to force a synchronous acquire. Now, it turns out all the implementation I’ve tried so far seem to implement a synchronous vkAcquireNextImageKHR anyways, so it doesn’t really matter.

vkQueuePresentKHR should queue a flip as we expect, but it shouldn’t block.

One of my gripes with WSI is that there is no way to tell if you’re going to get exclusive or borderless windowed modes. It seems to be driver magic that controls this unfortunately. I’m not even sure if this is an OS concept or not at this point.

The test matrix gets rather large, there are two OSes I test here:

Linux
Windows 10 (7 seems to behave very differently according to users, but I don’t have a setup for that)

Android should be on this list, but I haven’t tested that.

Surface types:

Wayland (on GNOME3)
X11/XCB (over Xorg and XWayland, on GNOME3)
Win32
KHR_display

Driver stacks:

Mesa – RADV / Anvil (AMD/Intel), the WSI code is shared and behave the same
AMDVLK (AMD Linux open source)
Nvidia (closed source)
AMD (closed source)

I’m using the latest public drivers as of writing. Now, time for some “good, bad and ugly” tiering for dramatic effect.

The good

These are implementations I consider fully functioning without any serious flaws, at least for RetroArch.

Mesa – Wayland – Linux

Wayland on Mesa is so far the shining implementation of WSI in my book.

Can toggle seamlessly between present modes without any hickup, missed frames or anything silly.
Frame pacing is excellent. About 2% frame time standard deviation (about 300 microseconds) in my measurements.
Supports MAILBOX for tear-free fast forward.

If I have to point out one flaw it’s the oddly reported minImageCount. It is 4 on Mesa Wayland, because of the MAILBOX implementation, which requires 4. However, even if I get 4 images in FIFO mode, it seems to only use 3 of them for some reason (no round-robin for you). I think this is a flawed implementation and minImageCount should be 2. It is perfectly valid for a WSI implementation to return more images than you request (Android seems to do this). It’s called “minImageCount” after all. But I think this highlights a Vulkan WSI flaw. minImageCount does not depend on presentMode!

I haven’t tested if it’s possible to get true double-buffer with page-flip on Mesa Wayland, but it should be fairly trivial to patch that.

Mesa – X11/XCB on Xorg – Linux

Xorg vsync has always been a serious pain point for me. It randomly works, or it doesn’t, depending on the phase of the moon. With DRI3 being used in Mesa’s implementation, it actually seems to work really well in both windowed and fullscreen. I would put it roughly on par with Wayland.

AMDVLK – Wayland – Linux (pending update to WSA and PAL)

I’ve submitted a lot of issues recently on their bugtracker to fix Wayland-related issues, and with the latest development branch I tried recently, AMDVLK now reached parity with Mesa’s Wayland implementation in quality. It’s all using the same low-level user space and kernel code thanks to the AMDGPU efforts, so this is to be expected, great stuff.

The bad

These are flawed, but useable.

Mesa – VK_KHR_display – Linux

The main issue is that fast-forward does not exist, i.e., no MAILBOX or IMMEDIATE, because DRM does not support true MAILBOX as it stands, IMMEDIATE is conditionally supported (not implemented, but could be very easily added), and only AMD seems to support it. I ranted about this topic here: http://themaister.net/blog/2018/07/02/improving-vk_khr_display-in-mesa-or-lets-make-drm-better/

Nvidia – VK_KHR_display – Linux

Nvidia was the first vendor to support VK_KHR_display, which I applaud. VK_KHR_display is great for focused emulation boxes (although the only reason I assume we got any VK_KHR_display implementation on Linux was Steam VR). We never got this “direct to display” path working on their GL driver, but it works on Vulkan, yay.

The flaw with this implementation is that toggling fast forward works, but it causes a strange mode change, and there’s basically a short pause for reprogramming the DRM Crtc (or what it is doing). This is unacceptable, but at least the non-fast forward experience is flawless in my book. At least Nvidia’s implementation can support MAILBOX or IMMEDIATE, so it is a bit better than Mesa’s KHR_display implementation.

Nvidia – XCB on Xorg – Linux

This one is rather disappointing. Windowed mode is a stuttering, tearing mess (but so is GL it seems). Full screen seems to start off tearing, but after a second or so it seems to figure out that it should move into a flip-like mode. Toggling presentation modes leads to a frame or two of corrupted garbage on screen, but it doesn’t trigger a mode change at least. Fully functional, but a bit rough around the edges.

AMDVLK – XCB on Xorg – Linux

Struggles with frame pacing and tearing, but might have been fixed now. I haven’t tested it that much, but I much prefer RADV on Xorg to this. After recent updates, AMDVLK’s Wayland backend is much better. The main reason to use this setup is for Radeon Graphics Profiler.

The ugly

Broken and buggy for my use cases.

Mesa – XCB on Wayland (XWayland) – Linux

Pretty awful, and a stuttering mess. Poking at the source, there’s likely some kind of fallback path with fallback blits to deal with this, but stay far away from this. Also, in fullscreen, vkAcquireNextImageKHR seems to block indefinitely on XWayland for some reason. At least, there is no reason why you need to subject yourself to this backend for WSI.

AMD – Windows 10

Windows WSI implementations seem to be their own kind of hell, but red beats green here. Windowed mode is just fine. Toggling presentation modes works just fine without issue.

It’s fullscreen where we get a lot of problems. Toggling presentation modes throws the application out to desktop for 3 seconds, then you get the mode change. This is just broken. I found that it’s the deleting of the oldSwapchain which triggers the issue. Not even getting a few present operations through in the new swap chain will save you. There is no escape. Basically, the conclusion I came to is that we can never ever create a new swap chain on Windows, or we are screwed. So now, I had to come up with ways to workaround this. Fortunately, Vulkan is flexible enough with the threading model that we can do some nasty tricks.

AMD seems to support vkAcquireNextImageKHR with a timeout of 0, but some other vendor did not …

Nvidia – Windows 10

This is nightmare fuel for RetroArch. It behaves like the AMD driver, except some added fun:

vkAcquireNextImageKHR with timeout == 0 does not seem to be implemented. It will just block. 🙁
It’s impossible to create a swap chain in certain cases. maxImageExtent will be 0 when you minimize a window or alt+tab out of fullscreen, which leads to some rather interesting workarounds. Because I need to allow a state where a swap chain does not exist yet, and avoid rendering to any swap chain related image while this goes on. No other vendor seems to have this behavior, but it is allowed by the specification, unfortunately.
Using oldSwapchain has been reported to break the driver, causing black screens. To work around this, I ifdef out oldSwapchain on Windows, and just destroy the swap chain before creating a new one. This breaks any hope of toggling presentation modes until this is fixed 🙁

Windows commonalities

Frame pacing in FIFO is great.
Windowed mode works just fine.
Fullscreen seems to be “exclusive” only. Alt-tabbing out of Vulkan is a rather sluggish operation, unlike borderless windowed which is designed to fix this.
Changing present modes in fullscreen is completely broken, and needs some serious workarounds.

The nasty workaround – MAILBOX emulation

To emulate fast forward, I ended up with a nasty hack for Windows. If fast forward is enabled in fullscreen mode, I spawn a thread which will do vkAcquireNextImageKHR in the background on-demand, and I’ll deal with the case where we haven’t acquired quite yet, by nooping out any access to the swap chain for that frame. The initial workaround for this was to just use timeout == 0, and avoid a dedicated thread, but … Nvidia’s implementation threw a wrench into that plan.

By doing it like this I can stay in FIFO present mode while faking a really terrible implementation of MAILBOX. Its performance isn’t that great, but at least I don’t get a 3 second delay just to trigger fast forward which should be instant in any sensible WSI implementation.

The Windows Vulkan experience in RetroArch should be not so terrible now, but know that it is only through several weeks of banging my head against the wall.

I hope some IHVs take this into consideration and make sure that toggling presentation modes works properly. Someone out there cares at least. No vendor I have seen so far deals with oldSwapchain in any way. There is a reason it’s there!

Month: September 2018

The state of Window System Integration (WSI) in Vulkan for retro emulators