Improving VK_KHR_display in Mesa – or, let’s make DRM better!

VK_KHR_display was recently added to Mesa, and I was very excited. Finally, we could get direct-to-display, lowest possible latency in Vulkan for programs like RetroArch (VK_KHR_display is not just for VR!). We have had support for KMS/GBM for a long time, and it works really well when you want to get the most direct access to the display at the cost of convenience. Since we have full control of how page flips happen we avoid many of the pitfalls with compositors. When they work well, you should get optimal conditions in full-screen, but it’s very hard to guarantee. We have no control if X or Wayland choose to go into a direct-to-display mode without compositors when we fullscreen the window surface. VK_KHR_display or the EGL equivalent is also the only realistic way to get good display performance on more embedded Linux systems which are fairly popular for emulation boxes. There’s no need for X or Wayland getting in the way when you have a dedicated, 10-foot UI setup.

We can also control how much total buffering we want. Sometimes we want 2 buffers where CPU emulation + GPU rendering needs to complete in 16.66 ms (great for retro console emulation), and sometimes we want 3 buffers where CPU, GPU and display can overlap (usually the case when running GPU-intensive emulation). Controlling this is almost impossible to do reliably with compositors. We synchronize vkAcquireNextImageKHR using a VkFence instead of a VkSemaphore. Most implementations do not actually support async AcquireNextImageKHR, and certainly not Mesa’s implementation of VK_KHR_display.

In EGL, we made use of GBM to allocate DRM buffers directly and pumped through our own page flips using the super low-level DRM API. In Vulkan however, we cannot go that low-level as we go through VK_KHR_display. There are tradeoffs. Nvidia for example supports VK_KHR_display on Linux, but not GBM. VK_KHR_display is a cleaner abstraction than raw DRM.

I had some issues with Mesa’s implementation of VK_KHR_display however, so I tried to fix them.

Lack of MAILBOX or IMMEDIATE present modes

This is the first glaring omission. In emulators, a key feature is being able to fast-forward. Usually, we also want to fast-forward completely seamlessly. Are you’re playing an old console RPG and want to make random encounters less dull? Just hit fast forward and blast through. Unfortunately, the Mesa implementation of VK_KHR_display does not support the display modes which facilitate this use case. In GL, we trivially support his by hitting glXSwapInterval(0) or 1 and it should “just work”.

MAILBOX is preferred because it is tear-free, but IMMEDIATE is a good fallback too.

So, I tried patching^H^H^Hhacking this up in Mesa.

MAILBOX

Basically, the FIFO implementation revolves around using drmModePageFlip. This queues up a framebuffer to be display on the next VBlank, if the buffer DRM is rendering to has completed its rendering. Apparently, the kernel tracks images on DRM, and whatever semaphores you pass in to vkQueuePresent don’t seem to matter at all.

Now, one glaring problem with drmModePageFlip is that once you have queued up a flip, you cannot queue up another one until it has completed on the next vblank. The page flip itself can be polled through the DRM FD. The page flip event will say when the page flip happened, and which image was flipped in.

After the page flip, the VK_KHR_display has a thread which checks if there are any queued up frames which can be setup with drmModePageFlip, and that keeps the FIFO queue going. If there are multiple frames queued up, the first image queued by the application is selected for the next page flip. So, I implemented MAILBOX using a really basic idea: When queuing up for a new page flip, pick the latest image to be queued by the application. Other frames which were queued up were transitioned back into the IDLE state, because they would never end up being displayed anyways. This worked wonderfully for my use case. Hundreds of FPS without tearing achieved.

I also implemented a better AcquireNextImageKHR. If there are multiple IDLE state images, I picked the image which was presented earliest. This way, you get a round-robin-like model, whereas the old model just picked the first IDLE frame. This might work just fine for FIFO, but not for MAILBOX.

Another thing I did was to force at least 4 images for MAILBOX. Unlike the Wayland implementation, I didn’t force a 4-deep swapchain in minImageCount. It’s perfectly fine for a swapchain to return more images than what is being requested. This is probably an API wart. It is a bit awkward to not have different minImageCount queries per present mode …

IMMEDIATE

For this mode, you can have tearing, and I found a particular flag in the DRM API. You can pass down DRM_MODE_PAGE_FLIP_ASYNC flag to drmModePageFlip, and the page flip will just happen when it happens. Unfortunately, this only worked as expected on AMD. Apparently, Intel cannot support this flip mode.

Lack of seamless transition between present modes

One annoying aspect of Vulkan WSI is that you cannot just change the present interval on the fly. Once you have a swapchain you cannot just call glXSwapInterval or similar to get vsync or no vsync. What you need to do in this case is to create a new VkSwapchainKHR, setting oldSwapchain to hopefully reuse the images in the old swapchain, and then, delete the old swapchain. Assuming a decent implementation, you should be getting a seamless transition over to the new present mode.

Unfortunately, this did not work well. First of all, the VK_KHR_display implementation did not bother with oldSwapchain. When deleting the old swapchain after creating a new one, the screen would black out for a while, before starting up again, usually 3-4 seconds, kinda like doing a full mode change. I tracked this down to drmModeRmFB which deletes the old framebuffer references. Apparently, if you delete a framebuffer which is being displayed in DRM, the screen blacks out. This kind of makes sense, as there is nothing to display anymore. However, for toggling vsync-state this is just unacceptable.

The patching of this got a bit hairier.

I ended up with a scheme which can “steal” images from oldSwapchain assuming the formats and dimensions match up. I tried to pilfer the displayed image especially, because its framebuffer cannot be allowed to die or we get a black screen. (Note to self, there might be a ton of edge cases here with synchronization …) To make sure I don’t get stale page flip events, I block to make sure any pending flip has completed before continuing. The reason for this is to avoid a race condition where I end up freeing the oldSwapchain before the page flip handler. The page flip handler has a reference to the swapchain it came from. After being pilfered from, the old swapchain is considered dead, and I return VK_ERROR_OUT_OF_DATE_KHR on any following requests to acquire from it.

This combined with my crude MAILBOX implementation allowed me to get seamless transitions between tear-free fast-forward and butter smooth vsync gameplay with very low latency.

Doing MAILBOX properly in DRM

Unfortunately, the MAILBOX implementation I made is just a crude hack, and is not a valid implementation. Effectively, when you queue up a page flip in DRM, you’re expecting what is going to be the next image to display at the next vblank, long before that vblank actually occurs. This increases latency by quite a lot, but you also risk overshooting quite a lot, where the GPU cannot complete the frame in time, and you cannot flip anything on the next vblank. This is just dumb.

True MAILBOX needs to be handled in the VBlank handler inside DRM somehow, and DRM has no API available which can support this present mode. The way I expect this to work is:

  • drmModePageFlip can be called multiple times, up to N times. If you pass down a flag called say, DRM_MODE_PAGE_FLIP_REPLACE. This would allow multiple pending page flips to be in flight, and not return EBUSY if a pending flip is active.
  • In the vblank handler, the latest entry in the queue, whose rendering is also complete on the GPU, is selected. This frame is programmed into the display controller.
  • The earlier frames in the queue which were available to be presented, but were not selected for page flip, will be reported through a “discard” event, similar to the PAGE_FLIP event. This allows the WSI implementation to release the images to the application.
  • The current page flip callback will report the actual image which was selected for presentation.

Some further improvements to this scheme could be that discard events are returned as early as possible. If a swapchain image completes in the middle of the frame, it knows that it can “discard” earlier completed frames ahead of time. This way, we avoid stalling the GPU at DisplayRate * (SwapchainImages – 1) frame rates, which can happen if we render all available swapchain images, and we need to wait for next vblank to discard all frames.

This is way beyond me unfortunately, but it sounds like a very valuable thing to have. It seems like I will need to maintain my own hacky patch set on top of Mesa for now. 🙂