Experimenting with VK_GOOGLE_display_timing, taking control over the swap chain

Recently, I’ve been experimenting with the VK_GOOGLE_display_timing extension. Not many people seem to have tried it yet, and I am over average interested in pristine swap chain performance (The state of Window System Integration (WSI) in Vulkan for retro emulators, Improving VK_KHR_display in Mesa – or, let’s make DRM better!), so when I learned there was an experimental patch set for Mesa (X11/DRM) from Keith Packard, I had to try it out. My experience here will reflect whatever I got running by rebasing this patch set, so YMMV.

Croteam’s presentation from GDC is a very good read to understand how important this stuff is for normal gaming (Article, PDF)

The extension supports a few critical components we have been missing for years in graphics APIs:

  • When presenting with vkQueuePresentKHR, specify that an image cannot be presented on screen before some wall time (CLOCK_MONOTONIC on Linux/Android). This allows two critical features, SwapInterval > 1, e.g. locked 30 FPS, and proper audio/video sync for video players since we can schedule frames to be presented at very specific timestamps (subject to rounding to next VBlank). VDPAU and presentation APIs like that have long supported things like this, and it’s critical for non-interactive applications like media players.
  • Feedback about what actually happened a few frames after the fact. We get an accurate timestamp when our image was actually flipped on screen, as well as reports about how early it could have been presented, and even how much GPU processing margin we had.
  • Query in nanoseconds how long the display refresh interval is. This is critical since monitors are never exactly 60 FPS, but rather something odd like 59.9524532452 or 59.9723453245. Normally, graphics APIs will just report 60 and call it a day, but I’d like to tune my frame loop against the correct refresh rate. Small differences like these matter, because if you naively assume 60 without any feedback, you will face dropped or duped frames when the rounding error eventually hits you. I’ve grown allergic to dropped or duped frames, so this is not acceptable for me.

Using this feature set, we can do a lot of interesting things and discover some interesting behavior. As you might expect from the extension name, GOOGLE_display_timing ships on a few Android devices as well, so I’ve tested two implementations.

Use case #1: Monitoring display latency

So, for the purposes of this discussion, the latency we will measure is the time is takes for input to be sampled until we reach the frame buffer being scanned out to display. We obviously cannot monitor the latency of the display itself without special equipment (and there’s nothing we can do about that), so the latency would be:

  • Polling input <– Start timer
  • Running application logic
  • Building rendering commands
  • Execute frame on GPU
  • Wait for FIFO queue to flip image on-screen <– This is the lag which is very hard to quantify without this extension!

We’re going to call AcquireNextImageKHR, submit some work, and call QueuePresentKHR in a loop, using VkSemaphore to synchronize, and let’s see how much input latency we get. The way we do this with display_timing is fairly simple. In vkQueuePresentKHR we pass in:

typedef struct VkPresentTimeGOOGLE {
    uint32_t    presentID;
    uint64_t    desiredPresentTime;
} VkPresentTimeGOOGLE;

We pass in some unique ID we want to poll later, and desiredPresentTime. Since we only care about latency here, we pass in 0, which means just present ASAP (in FIFO fashion without tearing of course). Later, we can call:

vkGetPastPresentationTimingGOOGLE

to get a report on what actually happened, here’s the data you get:

typedef struct VkPastPresentationTimingGOOGLE {
    uint32_t    presentID;
    uint64_t    desiredPresentTime;
    uint64_t    actualPresentTime;
    uint64_t    earliestPresentTime;
    uint64_t    presentMargin;
} VkPastPresentationTimingGOOGLE;

This data will tell you in wall time, when the frame was actually flipped on screen (actualPresentTime), when it could have been flipped on-screen potentially (earliestPresentTime), and the presentMargin meaning, how much earlier did the GPU complete rendering compared to earliestPresentTime.

To estimate total latency we’ll compute actualPresentTime – CLOCK_MONOTONIC when polling input. actualPresentTime is defined to use CLOCK_MONOTONIC base on Linux and Android. This is powerful stuff, so let’s see what happens.

X11, full-screen, 2 image FIFO

The interesting thing we observe here is about 17 ms total latency. My monitor is 60 FPS, so that’s 1 frame of total latency, which means we have true flip mode. Great! The reason we get one frame total is that X11’s implementation in Mesa is a synchronous acquire. We cannot get a new image from AcquireNextImageKHR until something else has actually flipped on screen. 2-image FIFO on Xorg isn’t all that practical however. We’re forced to eat a full drain of the GPU when calling AcquireNextImageKHR in this case (unless we do some heroics to break the bubble), which may or may not be a problem, but for GPU-intensive workloads, this is bad, and probably not recommended.

X11, windowed, 2 image FIFO

In Windowed mode, we observe ~33ms, or 2 frames. It’s clear that Xorg adds a frame of latency to the mix here. Likely, we’re seeing a blit-style compositor frame being added in the middle. It’s great that we can actually measure this stuff, because what happens after a present is usually pretty opaque and hard to reason about without good tools.

X11, 3 image FIFO

As we expect, the observed latency is simply 1 frame longer for both windowed and full-screen. ~33 ms latency, or two frames.

X11, presentation timing latency

Another latency we need to consider is how long time it takes to get back a past presentation timing. On the Mesa implementation, it is very tight, matching the overall latency, since we have a synchronous AcquireNextImage. For 2-image FIFO full-screen for example, we know this information just one frame after we submitted a present, nice! For triple buffer it takes 2 frames to get the information back, etc.

X11, frame time jitter

Generally, the deltas between actualPresentTime is rock stable, showing +/- 1 microsecond or so. I think it’s probably using the DRM flip timestamps which come straight from the kernel. In my experience it’s about this accurate, and more than good enough.

Android 8.0, 3 image FIFO

Android is a little strange, we observe ~45 ms latency. About 2.7 frames. This suggests there is actually some kind of asynchronous acquire going on here. If I add a fence to vkAcquireNextImage to force synchronous acquire, I get ~33 ms latency, as we got with Xorg. Not sure why it’s not ~3 frames of latency … Maybe it depends on GPU rendering times somehow.

Android 8.0, 2 image FIFO

We now have ~28 ms, about 1.7 frames. If we try the fence mechanism to get sync acquire, we drop down to a stuttering mess, with ~33 ms latency. Apparently, getting one frame latency is impossible, so it consistently misses a frame. I didn’t really expect this to work, but nice to have tested it. At this latency, I get a lot of stuttering despite trivial GPU load. Triple buffering is probably a good idea on Android …

Android 8.0, presentation timing latency

This is rather disappointing, no matter what I do, it takes 5 frames for presentation timing results to trickle back to the application. 🙁 This will make it harder to adapt dynamically to frame drops later.

Android 8.0, frame time jitter

This one is also a bit worrying. The deltas between actualPresentTime hover around the right target, but show a jitter of about 0.3 ms +/-. This leads me to think the presentation timing is not tied to a kernel timestamp derived from an IRQ directly, but rather an arbitrary user-space timestamp.

Use case #2: Adaptive low-latency tuning

Sometimes, we have little to no GPU workload, but we want to achieve sub-frame latencies. One use case here is retro emulation which might have GPU workloads close to just a few blits, so we want to squeeze the latency as much as we can if possible.

To do this we want to monitor the time it takes to make the GPU frame buffers ready for presentation, then try to lock our frame so we start it at estimatedFuturePresentTime – appCPUandGPUProcessingTime – safetyMargin. The best we can currently do is through ugly sleeping with clock_nanosleep with TIMER_ABSTIME, but at least we have a very good estimate what that timestamp is now.

E.g., rendering some trivial stuff which only takes, say, 1 ms in total for CPU -> GPU pipeline, I add in some safety margin, say 4 ms, then I should be able to sleep until 5 ms before next frame needs to be scanned out. Seems to work just fine on Xorg in fullscreen, which is pretty cool. What makes this so nice is that we can dynamically observe how much latency we need to be able to reach the deadline in time. While we could have used GPU timestamps, it gets hairy because we would need to correlate GPU timestamps with CPU time, and the presentation engine might need some buffering before an image is actually ready to be presented, so using presentMargin is the correct way.

As you’d expect, this doesn’t work on Android, because SurfaceFlinger apparently forces some buffering. presentMargin seems kinda broken as well on Android, so it’s not something I can rely on.

Use case #3: Adaptive locked 60 or 30 FPS

Variable FPS game loops are surprisingly complicated as discussed in the Croteam GDC talk. If we have a v-synced display at 60 FPS, the game should have a frame time locked to N * refreshDuration.

Instead of sampling timing deltas on CPU which is incredible brittle for frame pacing, we can take a better approach, where we try to lock our frame to N * refreshDuration + driftCompenstation. Based on observations over time we can see if we should drop our fixed rendering rate or increase it. This allows for butter smooth rendering void of jitter, but we can still adapt to how fast the GPU can render.

The driftCompentation term is something I added where we can deal with a dropped frame here and there and combat natural drift over time, so we can slowly sync up with real wall-time. For example, if we let the frame time jitter up to a few %, we can catch up quickly to any lost frame, and overall animation speed should remain constant over time. This might not be suitable for all types of content (fixed frame-rate pixel-artsy 2D stuff), but techniques like these can work well I think.

For adaptive refresh rate, we could for example have heuristics like:

  • If we observe at least N frame drops the last M frames, increase swap interval since it’s better to have steady low FPS than a jumpy, jittery mess.
  • If we observe that earliestPresentTime is consistently lower than actualPresentTime and presentMargin is decent, it’s a sign we should bump up the frame rate.

The way to implement a custom swap interval with this extension is fairly simple as we just specify desiredPresentationTime to have a cadence of two or more refreshDurations instead of one refreshDuration.

Getting the refresh interval

We can observe the refresh interval by calling

vkGetRefreshCycleDurationGOOGLE

Normally, I would expect this to just work, but on Xorg, it seems like this is learned over time by looking at presentation timestamps, so we basically need to wait some frames before we can observe the true refresh cycle duration, Android doesn’t seem to have this problem.

An immediate question is of course how all this would work on a variable refresh rate display …

Important note on rounding issues

Just naively doing targetPresentTime = someOlderObservedPresentationTime + frameDelta * refreshDuration calculation will give you troubles. The spec says that the display controller cannot present before desiredPresentationTime, so due to rounding issues we might effectively say, “don’t present until frame 105.00000001, and the driver will say, “oh, I’ll wait a frame extra till frame 106, since 105 is before 105.0000001!”. This is a bit icky.

The solution seems to be just subtracting a few ms off the desiredPresentationTime just to be safe, but I actually don’t know what drivers actually expect here. Using absolute nanoseconds like this gets tricky very quickly.

Discussion

For a future multi-vendor extension, some things should probably be reconsidered.

Absolute vs relative time?

The current model of absolute timings in nanoseconds is simple in theory, but it does get a bit tricky since we need to compensate for drift, estimate future timings with rounding considerations, and issues like these. Being able to specify relative timings between presents would simplify the cases where we just want to fiddle with present interval and not really caring exactly when the image comes on screen, but at the same time, relative timings would complicate cases where we want to present exactly at some time (e.g. a video player).

Time vs frame counters?

On fixed refresh rate displays, it seems a bit weird to use time rather than frame counters (media stream counters on X11). It is much easier to reason about, and the current X11 implementation translates time back and forth to these counters anyways.

The argument against frame counters would be variable refresh rate displays I suppose.

Just support all the things?

The best extension would probably support all variants, i.e.:

  • Let you use relative or absolute timing
  • Let you specify time in frame counters or absolute time
  • Let you query variable refresh rate vs fixed refresh rate

Implementation issues

I had a few issues on Android 8.0, which I haven’t been able to figure out. The Mesa implementation I had working was basically flawless in my book, sans the X11 issue of not knowing correct refresh rate ahead of time.

presentMargin is really weird

I don’t think presentMargin is working as intended, and many times when you drop a frame, you can observe that presentMargin becomes “negative” in int64_t, but the member is uint64_t, so it’s a bizarre overflow instead. The numbers I get back don’t really make sense anyways, so I doubt it’s tied to a GPU timestamp or anything like that.

earliestPresentationTime is never lower than actualPresentTime

This makes it impossible to use adaptive locked frame rates, since we cannot ever safely bump the frame rate.

Randomly dropping frames with almost no GPU load

SurfaceFlinger seems to have a very strange tendency to just drop frames randomly even if I have very low target refresh rate, and GPU time is like 1 ms per frame. I wonder if this is related to rounding errors or not, but it is a bit disturbing. Touching the screen helps, which might be a clue related to power management somehow, but it can still drop frames randomly even if I touch the screen all the time.

Here’s an APK if anyone wants to test. It’s based on this test app: https://github.com/Themaister/Granite/blob/master/tests/display_timing.cpp

http://themaister.net/tmp/display-timing-test.apk

The screen flashes red when a frame drops, it should be a smooth scrolling quad moving across the screen. It logs output with:

adb logcat -c && adb logcat -s Granite

and I often get output like:

https://gist.github.com/Themaister/873021cba88acb48e04c668eae3ab4e9

Conclusion

I think this is a really cool extension, and I hope to see it more widely available. My implementation of this can be found in Granite for reference:

https://github.com/Themaister/Granite/blob/master/vulkan/wsi_timing.cpp