A tour of Granite’s Vulkan backend – Part 6

Pipelines – what is your pain tolerance?

A lot of thought goes into pipelines. Eager or lazy creation, dynamic or static render state. Forget about one size fits all. How close will you approach the volcano? Make sure there is no lava under your feet when you’re done.

My pain tolerance is kinda low, I’d rather watch it on TV. Granite is a bit similar, it prefers to be cooled off magma instead.

The ideal case

Vulkan is designed to let you forget about filthy, filthy render state management and work exclusively with pristine VkPipeline objects. These objects encode every possible choice you can make when flipping the fixed-function bits and bobs on the GPU.

Getting to a point when you only think in terms of VkPipelines, and all pipelines are compiled up front in load-time is a holy grail of modern graphics API implementation. Gone are the stutters, the hitches, the sad 100 ms glitches which throw you off guard when you peek around the wall.

To get there, you must sacrifice all notions of flexibility, no last minute decisions, everything must be planned out in detail ahead of time. There is a lot of state which is pulled together to form a VkPipeline, an all-star cast of colorful characters and a plot with a lot of depth.

… ahem, that got a bit weird.

Shader modules

Obviously, the core part of a pipeline is the shader modules, the Vulkan::Program in Granite. From the program we automatically know the VkPipelineLayout because of reflection, so no problems there.

Render pass

We also need to know the render pass (and subpass index!) in order to create a pipeline. This one can be really counter-intuitive. The shader compiler often needs to know which render target formats are in use in order to generate final ISA. This is where we start running into problems. There is no obvious reason to combine a render pass and shader modules together. In my mental model these two should not know about each other, but drivers would really like that to be the case. For example, if I were to render a scene it would look something like:

  • Start rendering to some attachments (VkRenderPass is known here)
  • Set up the default rendering state appropriate for the pass. There are different “default” states for depth-only, opaque, lighting, and transparency rendering. Part of the render state vector is determined here.
  • Ask the renderer to render some list of visible objects which survived culling. Shader modules are known at this level, and some render state might be per-material, like two-sided rendering, etc.

There are a few ways to make this work, but somewhere you must have higher-level knowledge which shader modules are used in which render passes. If an application has a baking step during build, that might be a nice place to do it, but not all graphics API use cases work this way. Emulation comes to mind where you cannot know what an application will do until you execute it. User scripting could be a nightmare as well …

Render passes also have a lot of combinatorial explosion. If we just change from MSAA 2x to MSAA 4x, that means new render passes, and new pipelines which are compatible with those render passes. Clearly we see that something trivial like changing a setting in the options menu of most games will trickle down into a completely different set of pipelines for all materials. This kind of coupling isn’t what I call clean, but sometimes sanity must be sacrificed for performance. I’d prefer to keep my sanity.

Fixed-function vertex bindings

This consists of attributes, bindings, strides and input rates. This one is usually not a problem if you control the asset pipeline. You can decide on a “standard” vertex buffer layout and forget about it. There is some slight annoyance here if we want to support glTF or other scene transmission formats unless we’re prepared to rewrite all vertex buffers to match the standard layout.

Shader compilers like to know about this information since some ISAs need to fetch vertices in software, and therefore need to be able to compute correct offsets based on VertexIndex/InstanceIndex.

10 – Fixed-function render state

When rendering triangles in Vulkan, there is still a ton of state to deal with. Vulkan takes all the gunk you’d set in glEnable/glDisable and various other functions and bundles it together into one massive struct. I wrote up a sample which demonstrates how render state is set, saved and restored.

I have to admit I kinda like the old-school way of setting state individually. Isolating render state to a command buffer avoid almost all the horrifying issues with state management in OpenGL. In GL, the state is global, and leaked between modules and render passes. This is really scary, and you’re basically forced to make a custom state tracker on top of GL to keep yourself sane. There was also no good way of “saving” just the state you cared about and restoring it without writing a lot of custom code. I like the idea of setting some “standard” state which clears out any possible leakage of state. Overall, Granite’s model is maximum convenience.

A concept I’ve seen in other projects is the idea of creating big structures on the user side which mimics a pipeline, but I don’t think this is very useful unless it’s basically a full VkGraphicsPiplineCreateInfo with all the bells and whistles. If we don’t, we still don’t have the information we need to create a pipeline anyways, like render pass information for example, and we’re back to hashing with lazy creation.

Even just render state tends to be split in two halves for me. Some state tends to be “global” in nature and some tends to be “local”. This is state which is set by the higher level renderer which thinks in terms of:

  • Opaque pass vs transparent pass (alpha blending)
  • Depth-only? (depth write enable, depth bias?, equal test?)
  • Lighting pass? (additive blending?)
  • Stencil? (for deferred)

This state is saved and restored as necessary, then we have the objects which are rendered in a render pass which typically think in terms of:

  • Two sided mesh? (face culling)
  • Primitive restart?
  • Topology?
  • Shader program?
  • Vertex attributes?

I don’t like to couple these parts of the renderer together, so a tightly packed blob of state in Vulkan::CommandBuffer does the job for me. At the end of the day, the only real cost of this flexibility is some extra hashing cost. It doesn’t light up in the profile for me.

Overall, I like the “immediate” nature of the CommandBuffer interface. There’s always a hybrid solution if that is ever needed where I would set the state I’m interested in, then pull out a persistent VkPipeline handle which can be used later and bypasses any hashing of state when bound.

Avoiding stutters

The real problem with lazily creating pipelines is vkCreateGraphicsPipelines in my opinion. Doing this at the last minute is almost a guaranteed hitch, and it should be avoided at all cost. Avoiding last minute pipeline compilation is the real reason we should know all state combinations up front, not because we get to bind VkPipelines directly and avoid some small hashing cost.

My strategy for dealing with this problem is pre-warming the hashmaps with previously seen data. Granite integrates the Fossilize project to solve the problem of serializing all information needed to create pipelines in a GPU and driver independent way. In theory, I would be able to ship a Fossilize database as part of an application and use that to pre-warm all historically observed pipelines and their dependent objects at Vulkan::Device creation time.

To my knowledge, this is basically how all GL and D3D11 drivers behave. Cache all the things.


Granite’s render state management is old-school, but I like it. Pre-warming the various hashmaps in Vulkan::Device is the strategy I used to avoid any pipeline compilation stutters.

There are many alternatives for any graphics API abstraction. There are things I like in legacy APIs, and things I hate. I wanted to keep the parts I liked, and avoid the parts I disliked.

… that’s all folks!

I think this is the end of this series for now. I’ve gone over the Vulkan backend in broad strokes, and I hope it was interesting and useful.

A tour of Granite’s Vulkan backend – Part 5

Render passes and synchronization

This is part 5 in the tour of Granite‘s Vulkan backend. We’re going to get knee-deep in the aspects of Vulkan which are the most difficult to learn in my opinion, and mastering these topics of Vulkan is the real hurdle towards mastery. This level of understanding is something high-level APIs will prevent you from reaching.

This post isn’t intended to be a tutorial on Vulkan synchronization, so I’ll assume some basic level of knowledge.

Render passes

Render passes is a new fundamental part of Vulkan which does not exist in any of the legacy APIs. In older APIs you can freewheel how you render to frame buffers, but that approach was always terrible on tile-based GPUs, and these days with hybrid tilers, it’s probably terrible on desktop as well. Vulkan forces you to think about rendering all you need in one go to a frame buffer and then proceed to the next.

In Granite, I wanted to make sure most of the flexibility and explicitness of Vulkan render passes could be expressed with minimal boilerplate. Most projects don’t seem to pay attention to this part except that it’s something you just have to do, and very few see the benefits they bring. That is probably a reasonable stance for 2019 if you do not care about mobile performance. If you need to target mobile though, it is worth the extra work. As of writing, the feature is quite mobile-centric, but desktop GPUs seem to be inching towards tile-based architectures, so it will be interesting to see if this view on render passes will shift in the future. Even D3D12 recently got render passes too, albeit in a simplified form.

In the most basic form, render passes in Vulkan can be rather daunting to set up, and it’s one of the many battles you have to fight to get hello triangle on screen. If we take a render pass with just one sub-pass (the case we care about 99% of the time), we need to specify up front:

  • How many attachments?
  • Which formats are used?
  • How many MSAA samples?
  • initialLayout and finalLayout
  • Which image layouts to use while rendering?
  • Do we load from memory or clear the attachment on render pass begin?
  • Do we bother storing the attachments to memory?

Most of this information is boilerplate we can automate, but things like load/clear/store we cannot deduce in the backend before it is too late. Knowing this kind of information up-front can be very beneficial for bandwidth consumption, at least on mobile.

The ugly framebuffer objects

An ugly aspect of Vulkan is the use of VkFramebuffer. I want an API where I just say “start a render pass where we render to these attachments”. Creating “FBOs” up front was really ugly in GL, and I think it’s a bad abstraction to have API users carry around ownership of objects which represent little to no useful work. FBOs are empty husks which might as well just be an array of image views.

We could just create VkFramebuffers every render pass we begin and defer the deletion of it right away, but creating these objects have some cost. There’s a handle allocation in the driver at minimum, and probably a little more on certain drivers. Here I just reuse the temporary hashmap allocator which I introduced in the descriptor set model post. VkFramebuffer objects can be reused over multiple frames, but if they aren’t used for a while, they are just deleted since VkFramebuffer objects are immutable.

Automating VkRenderPass creation

This topic is actually quite complicated when we start diving into the deep end of Vulkan render passes, but we can start with the trivial cases. The API in Granite looks something like:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 1;
rp.color_attachments[0] = &rt->get_view();
rp.store_attachments = 1 << 0;
rp.clear_attachments = 1 << 0;

rp.clear_color[0].float32[0] = 1.0f;
rp.clear_color[0].float32[1] = 0.0f;
rp.clear_color[0].float32[2] = 1.0f;
rp.clear_color[0].float32[3] = 0.0f;


This is an immediate way of starting a render pass, no reason to create frame buffers up front and all that. VkRenderPass can be created lazily on-demand like everything else.

Formats / MSAA sample counts

Render passes need to know formats and sample counts, and since we pass the concrete attachments directly in begin_render_pass(), we have the information we need right here.

Image layouts and VK_SUBPASS_EXTERNAL dependencies

There are three kinds of attachments in Granite:

  • User-created. These attachments are render targets which are created with Device::create_image() and the backend does not own the resource or knows anything about how long the resource will live. Common case for user-created render targets.
  • WSI images. These images are special since they came from VkSwapchainKHR or some equivalent mechanism. We know that these images are only used for rendering and they are only consumed by the presentation engine, or some other mechanism.
  • Transient images. Images with transient usage flags only live inside render passes. They cannot be sampled from, their memory does not necessarily exist except in page tables which point to /dev/null. We don’t care what happens to these images once the render pass is done.

To deduce image layouts for a render pass we have a few different paths.

wsi images

I don’t care about preserving WSI images over multiple frames, and I don’t care about sampling from WSI images or any such weird use case after rendering, so the flow of image layouts is:

  • initialLayout = UNDEFINED (discard)
  • VkAttachmentReference -> COLOR_ATTACHMENT_OPTIMAL or whatever is required for the subpass
  • finalLayout = PRESENT_SRC_KHR or whatever layout we need when using external WSI. For something like libretro, this will be SHADER_READ_ONLY_OPTIMAL since the image will be handed off to some other render pass which we don’t know or care about. For headless PNG/video dumping, it might be TRANSFER_SRC_OPTIMAL.

When initialLayout != the layout used in the first subpass, vkCmdBeginRenderPass will actually need to perform a memory barrier implicitly to make this work. The big question is when this memory barrier takes place, and the answer is “as soon as possible” (TOP_OF_PIPE_BIT) if we don’t specify it anywhere. For this case, Granite will add a subpass dependency which waits for VK_SUBPASS_EXTERNAL in the COLOR_ATTACHMENT_OUTPUT stage. This latches onto the WSI acquire semaphore, more on that later.

Final layout != last layout is used, so we get a transition at the end of the render pass, but we don’t need to care about external subpass dependencies here. The automatically generated one is perfect, and we’re going to use the WSI release semaphore to properly synchronize this image anyways.

When we see a WSI image in a render pass, we can trivially mark this command buffer as “touching WSI”. This will affect command buffer submission later. This is indeed the kind of tracking which I have been arguing against in earlier posts, but it’s so trivial that it’s a no-brainer to me in this case.

Transient images

For transient images, we automate it just like WSI images, except that finalLayout will match last used layout in the render pass. This way we avoid some useless image layout transition at the end of the render pass. Next time we use the image, it’s going to be discarded anyways.

Because we deal with transitions automatically, users can freely pull images from Vulkan::Device with get_transient_attachment(), render to it, and forget about it. This is super convenient for things like MSAA rendering where the multi-sampled attachment just needs to exist temporarily for purposes of being resolved into the single-sampled attachment we care about. Having to care about synchronization for resources we don’t own is weird I think.

other images

For any other image, we need to avoid any implicit layout transition, so we simply force initialLayout to match the first use in the render pass, and finalLayout will match the last use. In our small sample, it’s all going to be COLOR_ATTACHMENT_OPTIMAL. It’s up to the API user to know what layouts a render pass will expect, but it’s straight forward to map a render pass to expected layout. Color attachments are COLOR_ATTACHMENT_OPTIMAL, depth-stencil is DEPTH_STENCIL_OPTIMAL or DEPTH_STENCIL_READ_ONLY_OPTIMAL based on the read/write mode for depth, input attachments are SHADER_READ_ONLY, etc. It’s possible to use an attachment for multiple purposes in a subpass, and Granite supports that as well. Some examples:

  • Color attachment + input attachment: Feedback loop ala GL_ARB_texture_barrier (super useful for certain emulators) -> GENERAL
  • RW Depth attachment + input attachment (some weird decal algorithm?) -> GENERAL
  • Read-Only depth + input attachment (deferred shading use case) -> DEPTH_STENCIL_READ_ONLY_OPTIMAL

All of this is analyzed when a newly observed VkRenderPass is created, and subpass dependencies are set up automatically. Anything which happens outside the render pass is the user’s responsibility.

08 – Bare-bones “deferred rendering” sample

I made a cut-down sample to show how the API expresses multi-pass in the context of deferred rendering with transient gbuffer + depth. The meat of it is:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 3;
rp.color_attachments[0] = &device.get_swapchain_view();

rp.color_attachments[1] = &device.get_transient_attachment(
rp.color_attachments[2] = &device.get_transient_attachment(

rp.depth_stencil = &device.get_transient_attachment(

rp.store_attachments = 1 << 0;
rp.clear_attachments = (1 << 0) | (1 << 1) | (1 << 2);

Vulkan::RenderPassInfo::Subpass subpasses[2];
rp.num_subpasses = 2;
rp.subpasses = subpasses;

rp.clear_depth_stencil.depth = 1.0f;

subpasses[0].num_color_attachments = 2;
subpasses[0].color_attachments[0] = 1;
subpasses[0].color_attachments[1] = 2;

subpasses[0].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadWrite;

subpasses[1].num_color_attachments = 1;
subpasses[1].color_attachments[0] = 0;
subpasses[1].num_input_attachments = 3;
subpasses[1].input_attachments[0] = 1;
subpasses[1].input_attachments[1] = 2;
subpasses[1].input_attachments[2] = 3;  // Depth attachment
subpasses[1].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadOnly;

// Do work
// Do work

See code comments in sample for more detail. To write this kind of sample in raw Vulkan would be almost a full day’s project.


Unlike many aspects of Granite which are reasonably high-level, synchronization in Granite is almost 100% explicit. The general philosophy of Granite is that excessive tracking of resource use is a no-no, unless it is trivial to do so (e.g. WSI images). Synchronization is a case where you need a lot of tracking to do a good job, and it is impossible to do a perfect job since you end up relying on heuristics, at least if you are to implement automatic synchronization on top of Vulkan. GL and D3D11 drivers have an advantage here since they can tap into GPU-specific synchronization mechanisms which might be better suited for implicit synchronization. A good example here is the i915 driver stack in the Linux DRM stack which can do automatic synchronization in kernel space somehow. I’m sure that simplifies the Mesa GL driver a lot, but I don’t know the details.

Let’s go through a thought experiment where we look at the big problems we run into if we are to implement a fully automatic barrier system. (I have tried :p)

Problem #1 – We cannot rewind time

When touching a resource, we must ask ourselves: “When and where was this resource accessed last?” We have three potential solutions to resolve a hazard:

  • Pipeline barrier (was used just now)
  • Event (was used some time ago in this queue)
  • Semaphore (was used in a different queue)

Ideally, we need to inject a barrier at the exact point where a resource was last used, but we cannot inject new commands in the middle of a command buffer which has already been recorded.

There is no winning this fight, either we eagerly inject barriers after every command in the hope that some future command will need to synchronize against it (VkEvent is nice here), or we inject barriers too late, stalling the GPU needlessly.

Eagerly injecting barriers is pure insanity if we take into the account that the resource might be used on a different queue in the future. That means signalling a heavy semaphore after every render pass or command. We could simply ignore supporting multiple queues, but that’s a huge compromise to make.

Problem #2 – Redundant tracking of read-only resources

A problem I found while trying to implement automatic barrier tracking was that static resources might be written in the future, so we need to keep track of them. This is a waste of CPU time, but it might be possible to explicitly mark these resources as “do not track, they’re truly static, pinky promise”, but I feel this is bolting on hacks.

Problem #3 – Multi-threading

The question of “where was this resource touched last” might not actually be possible to answer in a multi-threaded scenario. If we are recording command buffers in parallel, the backend has no idea what is going on until we serialize execution in vkQueueSubmit. A common solution I have seen for this problem is to resolve synchronization internally in command buffers as they are recorded, and on command buffer submission time, we can look at all used resources and resolve any cross-command buffer synchronization points right before submitting the command buffers in vkQueueSubmit. The complexity starts to shoot through the roof though. That’s a good sign we need to rethink.

I think this is the kind of solution you end up with when you have no choice but to port some old legacy API to Vulkan, and breaking the abstraction API is not an option.

Render graphs

A Vulkan backend which solves synchronization can only look back in time and deal with hazards at the last minute, but that is only because we do not have any context about what the application is doing. At a higher level, we can know what is going to happen in the future, and we can make automated decisions at that level, where we actually have context about what is going on. This is another reason why I do not want to have automatic synchronization in a Vulkan backend. Either we get a sub-optimal solution, or we try to close the gap with heuristics, but now run-time behavior is completely non-deterministic. We should aim for something better.

I believe the synchronization problem should be solved at a higher level, like a render graph. I wrote a blog some time ago about this topic: http://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/

Signalling fences

Granite’s way of signalling fences is very similar to plain Vulkan.

Vulkan::Fence fence;
device.submit(cmd, &fence);

// Somewhere else, potentially in a different thread.
// fence ref-count goes to 0, queued up for recycling.

There is a pool of VkFence objects which can be reused. Signalling a fence forces a vkQueueSubmit. Once the lifetime of a Vulkan::Fence ends it is recycled back to the frame context. Nothing out of the ordinary here.


Semaphores work very similar to fences and are requested in Device::submit() like fences. Like Vulkan, they have a restriction that they can only be waited on once. Semaphores can be waited on in other queues with Device::add_wait_semaphore() in a particular queue and pipeline stage. This matches pDstWaitStages. Semaphores are also recycled like fences.


Events can be signalled and later waited on in the same queue. Again, we have a pool of VkEvent objects, CommandBuffer::signal_event() requests a fresh event, signals it with vkCmdSetEvent and hands it to the user. VkEvents are recycled using the frame context. There is a similar CommandBuffer::wait_event() which maps 1:1 to vkCmdWaitEvents.


Granite has many different methods to inject pipeline barriers, the most common ones are:

cmd->barrier(srcStage, srcAccess, dstStage, dstAccess);

which maps to a vkCmdPipelineBarrier with a VkMemoryBarrier, and image barriers which act on all subresources:

cmd->image_barrier(image, oldLayout, newLayout, srcStage, srcAccess, dstStage, dstAccess);

There are cases where we want to batch barriers or otherwise use more complicated commands than this, so there are also 1:1 interfaces to vkCmdPipelineBarrier where the full structures are passed in, but these are only really used by the render graph since writing out full structures is super painful.

The automatic barriers in Granite

There are a few instances where I think having automatic barriers makes sense. These are cases where it’s convenient to do so, and there is no tracking required, so we can resolve all hazards right away and forget about it. Some of them we’ve seen already, like WSI images and transient images in render passes.

The other major case is static read-only resources. If you pass in initial_data to Device::create_buffer() or Device::create_image(), we generally have a desire to upload some data, and never touch it again.

The general gist of it is that we can upload data with a staging buffer over the transfer queue and inject semaphores which block all possible pipeline stages (based on bufferUsage/imageUsage flags). The downside is that we might end up creating too many submissions if we somehow want to upload a ridiculous amount of buffers or images in one go, but we can opt-out of this automatic behavior by simply not passing initial_data and do all the batching and synchronization work ourselves.

The end goal is that we should be able to call create_buffer or create_image and just use the static resource right away without having to think about synchronization at all.

09 – Rendering to image and reading it back to CPU on transfer queue

I wrote a sample which flexes most of the synchronization APIs. It renders a small 4×4 texture on the graphics queue, synchronizes that with the transfer queue with a semaphore and reads it back to a CACHED host buffer. We spawn threads which wait on a fence, maps the buffer and reads the results.


In these parts of the backend, the low-level explicit nature of Vulkan shines through. I think we have to be fairly low-level, or we inherit most of the problems with the older APIs.

… up next!

In the next installment, we’ll have a look at pipeline creation.

A tour of Granite’s Vulkan backend – Part 4

Optimizing for scratch data allocation

Allocating memory from a heap is fine and all, but very often in an engine we need to allocate throwaway data. Uniform buffers are the perfect use case here. With transient command buffers, certain data is also going to be transient. It’s very common to just want to send some constant data to a draw call and forget about it.

In Vulkan, there is a perfect descriptor type for this use case, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. It’s not just uniform buffers, it’s fairly common that we want to allocate scratch vertex buffers, index buffers and staging data for texture updates.

Being able to implement allocators like these with no API overhead is a huge deal with Vulkan for me. In legacy APIs there are extremely painful limitations where “fire and forget” buffer allocations are very hard to implement well. Buffers generally cannot be mapped when submitting draw calls, so we need to fight really hard and think about copy-on-write behavior, discard behavior, API overhead to call map/unmap all the time (which breaks threaded driver optimizations) or batch up allocations and memcpy data around a couple of extra times. It’s too painful and a lot of CPU performance can go down the drain if we don’t hit all the fast paths.

The only proper solution in legacy APIs I can think of is GL 4.5’s GL_ARB_buffer_storage, which supports persistently mapped buffers like Vulkan, but relying on GL 4.5 (or GLES 3.2 + extensions) just does not seem reasonable to me, since targeting GL should be considered a compatibility option for old GPUs which do not have Vulkan drivers. This feature was a cornerstone of the “AZDO” (approaching zero driver overhead) buzz back in the GL days. D3D11 is still going to be the “compat” option on Windows for a long time, and forget about relying on latest and greatest GLES on Android.

This is the perfect occasion to present a “hello triangle” sample which uses most of these features, but we first need WSI, so let’s start there.

06 – Pushing pixels with SDL2

Granite’s main codebase normally uses GLFW, so to get a less redundant sample working, I wrote this sample to use SDL2’s Vulkan support, which is very similar to GLFW’s support starting with SDL 2.0.8.

Implementing WSI is similar to instance and device creation where there is a lot of boilerplate to churn through, with little room for design considerations. Granite’s WSI implementation has two main paths:

On-screen / VK_KHR_surface

In this mode, The WSI class creates and owns the Vulkan::Context and Vulkan::Device automatically for us and owns a VkSwapchainKHR. The only thing it cannot on its own is create the VkSurfaceKHR, which is platform dependent. Fortunately, the surface is the only platform-dependent object so we can supply an interface implementation to create this surface when Vulkan::WSI needs it. The sample implements an SDL2Platform class which uses SDL2’s built in wrappers for surface creation, nifty!

Off-screen / externally owned swapchains

Granite is also used in cases where we don’t necessarily own a swapchain which is displayed on screen. We might want to supply already created images in lieu of VkSwapchainKHR and provide our own image indices as well as acquire/release semaphores. After completing a frame, we can pass along the fake swapchain image to our consumer. The prime case for this is the libretro API implementation in Granite.

Pumping the main loop

Vulkan’s Acquire/Present model maps directly to a “begin” and “end” model in Granite. We call Vulkan::WSI::begin_frame() to acquire a new image index, advance the frame context and deal with any in-between frame work. We might have to deal with out-of-date swapchains here and various janitorial work which we never had to consider in old APIs.

Semaphores for WSI images are dealt with automatically. WSI images are treated specially and automatically handling synchronization for WSI resources is straight-forward to the point that there is no point in exposing that to the user. (Synchronization in Granite is in general very explicit, but WSI is one of few exceptions.) The main loop looks something like:

wsi.begin_frame(); // <-- acquire image if necessary, advances frame context
auto cmd = device.request_command_buffer();
// do work and render to swapchain
wsi.end_frame(); // <-- flushes frame, queues up presents if swapchain was rendered to this frame

Overall, WSI code is must to abstract in Vulkan, and I’m happy with the flexibility and simplicity in use I ended up with.

07 – Hello triangle (quad?) with scratch allocated VBO, IBO and UBO

Now that we can get stuff on-screen, now we’re getting to the actual meat of this post. https://github.com/Themaister/Granite-MicroSamples/blob/master/07_linear_allocators.cpp augments the WSI sample with a nice little quad. The VBO, IBO and UBOs are allocated directly on the command buffer.

Linear allocator – allocating memory at the speed of light

This allocator has many names – chain allocator, bump allocator, scratch allocator, stack allocator, etc. This is the perfect allocator for when we want to allocate a lot of small blobs, and just wink it all away at some point in the future. Allocation happens by incrementing an offset, and freeing happens by setting the offset to 0 again, i.e. all memory in one go is just “winked away”.

Buffer pools of linear allocators

Some engine implementations have a strategy where there is only one huge linear allocator in flight and once exhausted, it is considered OOM and a GPU stall is inevitable. This strategy is nice from an “explicit descriptor set” design standpoint if we use UNIFORM_DYNAMIC descriptor type, since we can use a fixed descriptor set for uniform data, as offsets into the UBO are encoded when binding the descriptor set. I find this concept a bit too limiting, since there is no obvious limit to use (very content and scene dependent). I opted for a recycled pool of smaller buffers instead since Granite’s descriptor binding model is very flexible as we saw in the previous post in this series. If I had to deal with explicit descriptor sets, uniform data would be kind of nightmarish to deal with.

Vulkan::CommandBuffer can request a suitable chunk of data from Vulkan::Device, and once exhausted or on submission, the buffers are recycled back again. We can only reuse the buffer once the frame is complete on the GPU, so we also use the frame context to recycle linear allocators back into the “ready for allocation” pool at the right time.

To DMA queue or not to DMA queue …

Discrete GPUs have a property where accessing memory in VRAM is very fast, while host memory can be accessed over PCI-e at a far slower rate. For staging data like vertex, index and uniform buffers, it might be reasonable to assume that we should copy the CPU-side to GPU-side and let the GPU consume the streamed data in fast VRAM. Granite supports two modes where we let the GPU read data read-only from HOST_VISIBLE, and one where we automatically perform staging buffer copies over to GPU from the CPU buffer.

In practice however, I don’t see any gain from doing the staging copy. The extra overhead of submitting a command buffer on the DMA queue which copies data over, and adding the extra synchronization overhead with semaphores and friends just does not seem worthwhile. Discrete GPUs can cache read-only data sourced from PCI-e just fine.

Super-convenient API

Since we have a very free-flowing descriptor binding model, we can have an API like this:

auto cmd = device.request_command_buffer();
MyUBO *ubo = cmd->allocate_typed_constant_data<MyUBO>(set, binding, count);
// Fill in data on persistently allocated buffer.
ubo->data1 = 1;
ubo->data2 = 2;

void *vert_data = cmd->allocate_vertex_data(binding, size, stride);
// Fill in data.
void *index_data = cmd->allocate_index_data(size, VK_INDEX_TYPE_UINT16);
// Fill in data.

// Pointers are now invalidated.

The allocation functions are just light wrappers which allocate, and bind the buffer at the appropriate offset. It’s perfectly possible to roll your own linear allocation system, e.g. you want to reuse a throwaway allocation in multiple command buffers in the same frame, or something like that.


I think spending time on making temporary allocations as convenient as possible will pay dividends like nothing else. The productivity boost of knowing you can allocate data on the command buffer for near-zero overhead simplifies a lot of code around the callsite, and there is little to no cost of implementing this. Linear allocators are trivial to implement.

… up next!

On the next episode of “this all seems so high-level, where’s my low-level goodness”, we will look at render passes and synchronization in Granite, which is where the low-level aspects of Granite will be exposed.

A tour of Granite’s Vulkan backend – Part 3

Shaders and descriptor sets

This is part 3 of a blog series I’m writing on Granite‘s Vulkan backend. In this episode we are looking at how we deal with shaders and descriptor sets. At this point in our design process, there are many, many choices to make. Especially descriptor sets need to be carefully considered.

Hash all the things

A theme we start to see now is hashmaps and lazy creation of objects. One thing you run into with Vulkan’s pipeline-related types are how much work it is to be explicit all the time. The amount of information we need to provide is staggering. I believe it not healthy for mind and soul to work at low levels here except in special cases, and we should aggressively hide away detail where we can. There is naturally a clock cycles vs. sanity tradeoff to be made here.

You can argue that the lines between high-level GL/D3D11-style design and Granite’s model are quite blurred. The (mental) price to pay to be explicit is just not worth it in my opinion. I will try to explore the obvious alternatives here and provide more context why the design is the way it is.

04 – Shaders and pipeline layouts

The first step in creating a pipeline is of course, to create a VkShaderModule from our SPIR-V code. This is a no-brainer, but next we need a pipeline layout, which in turn requires VkDescriptorSetLayouts. The sample is here https://github.com/Themaister/Granite-MicroSamples/blob/master/04_shaders_and_programs.cpp.

Rather than manually declaring the pipeline layout like a caveman I think using reflection to automatically generate layouts is a good idea. There is no reason for users to copy information which exists in the shaders already. For the reflection, I use SPIRV-Cross. If we never need to compile SPIR-V in runtime (game engine scenario), there is no reason why we cannot shift the reflection step to off-line as well and just pass the side-band data along to remove a runtime dependency. I never got as far as building a nice off-line SPIR-V baking pipeline, so I just compile GLSL on the fly with shaderc. However, the interface in the Vulkan backend just consumes raw SPIR-V.

A common mistake beginners tend to do is to think that names are important in binding interfaces. This is a mistake carried over from the GL and D3D11 days. The only things we should care about are descriptor sets, bindings and location decorations as well as push constant use. This is the only semantic information we need to create binding interfaces, i.e. pipeline layouts and pipelines.

A pipeline layout in Vulkan needs to know all shader stages a-la GL programs, so we also need a step to combine shaders into a Vulkan::Program. Here we take the union of reflection information and request handles for Vulkan::DescriptorSetAllocator and Vulkan::PipelineLayout. This is hashed, but there is no performance concern here since we should do all of this work in load time when possible anyways. These handles are all owned internally in Vulkan::Device, and there is no reason to worry about object lifetime for these objects.

I don’t think there is a reason to deviate far from this design unless you have a very specific scheme in mind with descriptor set allocation. As I’ll explore later, using bindless descriptors extensions or explicit descriptor set allocation could motivate use of a “standard” pipeline layout, in which case reflection gets kind of meaningless anyways.

05 – The binding model – embracing laziness

I never really had a problem with the old-school way of binding resources to binding slots. It just isn’t the part of the old APIs I felt were lacking, so Granite is kind of old school here, but it does have full consideration for descriptor sets and I removed any impedance mismatch with Vulkan (i.e. no translation needed to bridge between Granite and Vulkan). E.g.:

cmd->set_storage_buffer(set, binding, *resource);
cmd->set_texture(set, binding, resource->get_view(), Vulkan::StockSampler::LinearClamp);

The old binding models in GL/D3D11 have flat binding spaces with no separation of per-frame, per-material, or per-draw bindings. In Granite I wanted to take full advantage of the descriptor set feature where we can assign some kind of “frequency” and relation between bindings. Here is an example to illustrate how it is used: https://github.com/Themaister/Granite-MicroSamples/blob/master/05_descriptor_sets_and_binding_model.cpp.

In draw time, we can use the current pipeline layout and pull in the binding points which are active and make sure we bind descriptor sets with the correct resources. This is actually hot code, so I spent time designing a nice system here which tries to be as optimal as possible, given these restrictions.

Because of mobile, we need some conservative limits. I use 4 descriptor sets and 16 (dense) binding points per set (minimum spec of Vulkan). This allows for fairly compact pipeline layout descriptions, and we can loop over bitsets to look at resources. This is also just fine for my use cases.

When it comes to allocation of descriptor sets themselves, I think I have a very different approach to most. A Vulkan::DescriptorSetAllocator is represented as:

  • The VkDescriptorSetLayout
  • A bunch of VkDescriptorPools which can only allocate VkDescriptorSets of this set layout. Pools are added on-demand.
  • A pool of unused VkDescriptorSets which are already allocated and can be freely updated.
  • A temporary hashmap which keeps track of which descriptor sets have been requested recently. This allows us to reuse descriptor sets directly. In the ideal case, we almost never actually need to call vkUpdateDescriptorSets. We end up with hash -> get VkDescriptorSet -> vkCmdBindDescriptorSets. When a descriptor set has not been used for a couple of frames (8), we assume that it is no longer relevant, and the set is recycled, and some other descriptor set can reuse it and just call vkUpdateDescriptorSet. We definitely do not want to keep track of when any buffer or image resources is destroyed, and recycle early. That’s tracking hell which slows everything down.

The temporary hashmap is a data structure I’m quite happy with. It’s used for a few other resources as well. See https://github.com/Themaister/Granite/blob/master/util/temporary_hashmap.hpp for the implementation.

On certain GPUs, allocating descriptor sets is, or at least used to be very costly. The descriptor pools might not be implemented as true pools (sigh …), so every vkAllocateDescriptorSets would mean a global heap allocation, absolutely horrible for performance. This is the reason I’m not a big fan of the “one large pool” design. In this model, we just allocate a massive VkDescriptorPool, and we just allocate from that, for any kind of descriptor set. This means recycling VkDescriptorSet handles over many frames is impractical. The intended use pattern is to call vkResetDescriptorPool and allocate new descriptor sets which are only valid for one frame at a time, just like command buffers. There is also the problem of knowing how to balance the descriptor load for these massive pools, what’s the ratio of image descriptors vs uniform buffer descriptors, etc. With per-descriptor set layout allocators, there is zero guess work involved.

Alternative design – Bindless

Bindless is all the rage right now. The only real complaint I have is that it’s only supported on desktop and requires an EXT extension. It also means writing shaders in a very specific way. I don’t really need it for my use cases, but bindless enables certain complex algorithms which benefit from accessing a huge set of resources dynamically.

Alternative design – persistent explicit VkDescriptorSets

An alternative is exposing descriptor sets directly and only allow users to bind descriptor sets rather than individual resources. The API user would need to build the sets manually. While this is an idea, I think there are too many hurdles to make it practical.

  • We need to know and declare the target imageLayout of textures up front. This is obvious 99% of the time (e.g. a group of material textures which are SHADER_READ_ONLY_OPTIMAL), but in certain cases, especially with depth textures, things can get rather ambiguous. This does seem to me like an API design fault. It is unclear why this information is needed.
  • Some resources are completely transient in nature and it does not make sense to place them in persistent descriptor sets. The perfect example here is uniform buffers. In later samples, we’ll look at the linear allocator system for transient data.
  • Some resources depend on the frame buffer, i.e. input attachments. Baking descriptor sets for these resources is not obvious, since we need to know the combination pipeline layout + frame buffer, which should have nothing to do with each other.
  • We need to know the descriptor set layout (and by extension, the shaders as well) up-front. This is problematic if resources are to be used in more than one shader. The common fix here is to settle on a “standard” pipeline layout so we can decouple shaders and resources. This usually means a lot of padding and redundant descriptor allocations instead. We have a limited amount of descriptor sets when targeting mobile (4). We do not have the luxury of splitting every individual “group” of resources into their own sets, some combinatorial effects are inevitable, making persistent descriptor sets less practical. On desktop, 8 sets is the norm, so that might be something to consider.
  • Hybrid solutions are possible, but complexity is increased for little obvious gain.


I’m happy with my design. It’s very easy to use, but there is a CPU prize I’m willing to pay and I honestly never saw it in the profiler. I think resource binding models are cases where shaving overhead away will shave your sanity away as well, at least if you want to be compatible with a wide range of hardware. It’s much easier if you only cater to high-end desktop where bindless can be deployed.

… up next!

Next up we will explore the linear allocators for uniform, vertex, index and staging data.

A tour of Granite’s Vulkan backend – Part 2

The life and death of objects

This is a part 2 in a series where I explore Granite‘s Vulkan backend. See part 1 for an introduction. In this blog entry we will dive into code, and we will start with the basics. Our focus in this entry will be to discuss object lifetimes and how Granite deals with the Vulkan rule that you cannot delete objects which are in use by the GPU.

Sample code structure

I will be referring to concrete code samples from here on out. I have started a small code repository which contains all the samples. See README.md for how to build, but you won’t need to run the samples to understand where I’m going with these samples. Stepping through the debugger can be rather helpful however.

Sample 01 – Create a Vulkan device

Before we can do anything, we must create a VkDevice. This aspect of Vulkan is quite dull and full of boilerplate, as is the setup code for any graphics API. There is not a lot to cover from an API design perspective, but there are a few things to mention. The sample code for this part is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/01_device_creation.cpp

The API for this is pretty straight forward. I decided to split up how we load the Vulkan loader library, since there are two main use cases here:

  • User wants Granite to load libvulkan.so/dll/dylib from standard locations and bootstrap from there.
  • User wants to load an already provided function pointer to vkGetInstanceProcAddr. This is actually the common case, since GLFW loads the Vulkan loader for you dynamically and you can just use the GLFW provided glfwGetInstanceProcAddr to bootstrap yourself. The volk loader has support for this.

To create the instance and device, we need to do the usual song and dance of creating a VkInstance and VkDevice:

  • Setup Vulkan debug callbacks
  • Identify and enable relevant extensions
  • Enable Vulkan validation layers in debug build
  • Find appropriate VkQueues to cover graphics, async compute, transfer

Vulkan::Context and Vulkan::Device

The Context owns the VkInstance and VkDevice, and Vulkan::Device borrows a VkDevice and manages the big objects which are created from a VkDevice. It is possible to have multiple Vulkan::Device on top of a VkDevice, but we end up sharing the VkQueues and the global heaps for that device, which is a very nice property of Vulkan, since it allows frontend/backend systems like e.g. RetroArch/libretro to share a VkDevice without having hidden global state leak between the API boundary, which is a huge problem with the legacy APIs like GL and D3D11.

Note that this sample, and all other samples in this chapter are completely headless. There is no WSI involved. Vulkan is really nice in that we don’t need to create window system contexts to do any GPU work.

02 – Creating objects

Creating new resources in a graphics API should be very easy, and here I spent a lot of time on convenience. Creating images and uploading data to them in raw Vulkan is a lot of work, since there are so many things you have to think about. Creating staging buffers, copy those, defer deletion of that staging buffer, maybe we copy on the transfer queue, or not? Emit some semaphores to transfer ownership to graphics queue, creating image views, and just so many things which is very painful to write. Just creating an image in a solid way is several hundred lines of code. Fortunately, this kind of code is very easy to wrap in an API. See sample: https://github.com/Themaister/Granite-MicroSamples/blob/master/02_object_creation.cpp, where we create a buffer and image. I think the API is about as simple as you can make it while keeping a reasonable amount of flexibility.

Memory management

When we allocate resources, we allocate it from Granite’s heap allocator for Vulkan. If I had done Granite today, I would just use AMD’s Vulkan Memory Allocator, but it did not exist at the time I designed my allocator, and I’m pretty happy with my design as it stands. Maybe if I need de-fragmentation in the future or some really complex memory management strategy, I’ll have to rethink and use a mature library.

To get a gist of the algorithms, Granite will allocate 64 MB chunks, which are split in 32 chunks. Those 32 chunks can then be subdivided into 32 smaller chunks, etc, all the way down to 256 bytes little chunks. I made a cute little algorithm to allocate effectively from these blocks with CTZ operations and friends. Classic buddy allocator, but you have 32 buddies.

There are also dedicated allocations. I use VK_KHR_dedicated_allocation to query if an image should be allocated with a separate vkAllocateMemory rather than being allocated from the heap. This is generally useful when allocating large frame buffers on certain architectures. Also, for allocations which exceed 64 MB, dedicated allocations are used.

Memory domains

A nice abstraction I made is that rather than dealing with memory types like DEVICE_LOCAL, HOST_VISIBLE, and the combination of all the possible types, I declare up-front where I like my buffers and images to reside. For buffers, there are 4 use cases:

  • Vulkan::BufferDomain::Device – Must reside on DEVICE_LOCAL_BIT memory. May or may not be host visible (discrete vs integrated GPUs).
  • Vulkan::BufferDomain::Host – Must be HOST_VISIBLE, prefer not CACHED. This for uploads to GPU.
  • Vulkan::BufferDomain::CachedHost – Must be HOST_VISIBLE and CACHED. Falls back to non-cached, but should never happen. Might not be COHERENT. Used for readbacks from GPU.
  • Vulkan::BufferDomain::LinkedDeviceHost – HOST_VISIBLE and DEVICE_LOCAL. This maps to AMD’s pinned PCI mapping, which is restricted to 256 MB. I don’t think I’ve ever actually used it, but it’s a niche option if I ever need it.

When uploading initial data to a buffer, and Device is used, we can take advantage of integrated GPUs which share memory with the CPU. In this case, we can avoid any staging buffer, and just memcpy data straight into the new DEVICE_LOCAL memory. Don’t just blindly use staging buffers when you don’t need it. Integrated GPUs will generally have DEVICE_LOCAL and HOST_VISIBLE memory types.

Mapping host memory

While not present in the sample, it makes sense to discuss how we map Vulkan memory to the CPU. A good rule of thumb in general is to keep host memory persistently mapped. vkMapMemory and vkUnmapMemory is quite expensive, especially on mobile, and we can only have one mapping of a VkDeviceMemory (64 MB with tons of suballocations!) active at any time. Rather than Map/Unmap all the time, we implement map/unmap in Vulkan::Device, by checking if we need to perform cache maintenance, with no extra CPU cost. On map() for example, we need to call vkInvalidateMappedRanges if the memory type is not COHERENT, and for unmap, we call vkFlushMappedRanges if the memory is not COHERENT. This is fairly common on mobile when doing readbacks from GPU, since we need CACHED, but we might not get COHERENT. Granite’s backend abstracts all of this away.

Physical and transient image memory

A very powerful feature of Vulkan is the support for TRANSIENT images. These images do not have to be backed by physical memory in Vulkan, and is very nice on tile-based mobile renderers.

In Granite I fully support transient images where I can pass in two different domains for images, Physical and Transient. Since Transient images are generally used for throw-away scenarios, there is a convenient method in Vulkan::Device::get_transient_attachment() to simply request a transient image with a format and resolution for rendering. Transient images are generally never created manually since they are so easy to manage internally.

Handle types

There are many ways to abstract handle types in general, but I went for my own “smart pointer” variant, the trusty intrusive ref-counted pointer. It can basically be thought of a std::shared_ptr, but simpler, and we can pool the allocations of handles very nicely. How we design these handle types are not really important for Vulkan though, but I figured this point would generate some questions, so I’m addressing it here. See https://github.com/Themaister/Granite/blob/master/util/intrusive.hpp for details.

03 – Deferring deletions of GPU resources

Now we’re getting into topics where there can be significant design differences between Vulkan backends. My design philosophy for a middle-level abstraction is convenient, deterministic and good enough at the cost of a theoretical optimal solution.

A common theme you’ll find in Granite is the use of RAII. Once lifetimes of objects end, we automatically clean up resources. This is nothing new to C++ programmers, but the big problem in Vulkan is we’re not managing just memory on CPU with new/delete. We actually need to carefully control when things are deleted, since the GPU might be using the resources we are freeing. The strategy here will be to defer any deletions. The sample is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/03_frame_contexts.cpp

The frame context

In order to handle object lifetimes in Granite, I have a concept of a frame context. The frame context is responsible for holding all resources which belong to a “frame” of work. Normally this corresponds to a frame of work between AcquireNextImage and QueuePresent, but it is not tightly coupled. The Device has an array of frame contexts, usually 2 of them to allow double-buffering between CPU and GPU, (and 3 on Android because TBDR GPUs are a bit more pipelined and tend to prefer a little more buffering). The frame context is basically a huge data structure which holds data like:

  • Which VkFences must be waited on to make sure that all GPU work associated with this queue is done. This is the gatekeeper which holds all our recycling and deletions back.
  • Command pools for each worker thread and queue types.
  • VkBuffers, VkImages, etc, to be deleted once the fences signal.
  • Memory allocations from heap allocator to be freed.
  • … and various other resources.

Basically, we have a central place to chuck any things which need to happen “later”, when the GPU is guaranteed to be done with this frame.

As a special consideration, the big fat “make it go slow” call Device::wait_idle() will automatically clean up everything in one go since it knows at this instant the GPU is not doing anything.

Command buffer lifetime compromise

To make the frame based cleanup work in practice, we need to simplify our notion of what command buffers can do. In Vulkan, we have the flexibility to record command buffers once and reuse them at will at any time. This creates some complications. First of all, it throws the idea of a per-frame command pool out of the window. We can never reset the command pool in that case, since there will be free-floating command buffers out there which might be used later. Command pools work their best in Vulkan when you don’t allow individual command buffers to be freed.

If we have reusable command buffers, we also have the problem of object lifetimes. We end up with a painful situation where GPU resources must be retained until all command buffers which reference them are discarded. This leads to a really difficult situation where you have two options – deep reference-counting per command buffer or just pray all of this works out and make sure objects are kept alive as long as necessary. The former option is very costly and bug-prone, and the latter is juggling with razor blades too much for my taste where a large, meaningless burden is placed on the user.

I generally don’t think reusable command buffers are a worthwhile idea, at least not for interactive applications where we’re not submitting a static workload to the GPU over and over. There just aren’t many reasonable use-cases where this gives you anything meaningful. The avenues where you can submit the same calls over and over are maybe restricted to post-processing, but recording a few draw calls which render a few full-screen quads (or compute dispatches for the cool kids) is not exactly where your draw call overhead is going to matter.

I find that beginners obsess over the idea of aggressive reuse a little too much. In the end I feel it is misguided, and there are many better places to spend your time. Recording command buffers itself in Vulkan is super efficient.

My ideal use for command buffers are where command buffers are light-weight handles which all source their memory from a common command pool linearly. No reuse, so we use ONE_TIME_SUBMIT_BIT and TRANSIENT_BIT on the pool.

In Granite, I greatly simplified the idea of command buffers into transient handles which are requested, recorded and submitted. They must be recorded and submitted in the same frame context you requested the command buffer. This way we remove the whole need for keeping track of objects per-command buffers. Instead, we just tie the resource destruction to a frame context, and that’s it. No need for complicated tracking, it’s very efficient, but we risk destroying the object a little later than is theoretically optimal. This could potentially increase memory pressure in certain situations, but I think the trade-off I made is good. If needed, I can always add explicit “delete this resource now, I know it’s safe” methods, but I haven’t found any need for this. This would only be important if we are truly memory bound.

A design decision I made was that I would never want to do internal ref-counts for resources like images and buffers, and the design would be forced to not rely on fine-grained tracking which you would typically find in legacy API implementations. Any ref-counted operations should be immediately visible to API users and never be hidden behind API implementations. In fact, command buffer arguments for binding resources are plain references or pointers, not ref-counted types.

The memory pressure of very large frames

The main flaw of this design is that there might be cases where there is one spurious frame context that has extreme use of creation and deletions of resources. A prime example here is loading screens or similar. Since Vulkan resources are not freed until the frame context itself is complete, we cannot recycle memory within a frame unless we explicitly iterate the frame context with Device::next_frame_context(). This tradeoff means that the Granite backend does not have to heuristically stall the GPU in order to reclaim memory at suitable times, which adds a lot of complexity and ruins the determinism of Granite’s behavior.

… up next!

In the next episode of Granite shenanigans we will look at the shader pipeline where we discuss VkShaderModule, VkDescriptorSetLayout, VkPipelineLayout and VkPipeline objects.

A tour of Granite’s Vulkan backend – Part 1


Since January 2017, I’ve been working on my little engine project, which I call Granite. It’s on Github here. Like many others, I felt I needed to write a little engine for myself to fully learn Vulkan and I needed a test bed to implement various graphics techniques. I’ve been steadily working on it since then and used it as the backbone for many side-projects, but I think for others its value right now is for teaching Vulkan concepts by example.

A while back I wrote a blog about my render graph implementation. The render graph sits on top of the Vulkan implementation, but in this series I would like to focus on the Vulkan layer itself.

The motivation for a useful mid-level abstraction

One thing I’ve noticed in the Twitter-sphere and various panel discussions over the last years is the idea of the mid-level abstraction. Where GL and D3D11 is too high-level and inflexible for our needs in non-trivial applications, Vulkan and D3D12 tend to overshoot in low-level complexity, with the goal of being as low-level and explicit as possible while staying GPU architecture/OS-portable. I think everyone agrees that having a good mid-level abstraction is important, but the problem we always have when designing these layers is where to make the right trade-offs. There will always be those who chase maximum possible performance, even if complexity when using the abstraction shoots through the roof.

For Granite I always wanted to promote convenience, while avoiding the worst penalties in performance. The good old 80/20 rule basically. There are many, many opportunities in Vulkan to not do redundant CPU work, but at what cost? Is it worth architecting yourself into a diamond – a super solid, but in the end, inflexible implementation? I’m noticing a lot of angst in general around this topic, especially among beginners. A general fear of not chasing every last possible performance optimization because it “might be really important later” is probably why we haven’t seen a standard, mid-level graphics API yet in wide use.

I feel that the benefits you gain by designing for maximum possible CPU performance are more theoretical design exercises than practical ones. Even naive, straight forward, single-threaded Vulkan blows GL/GLES out of the water in CPU overhead in my experience, simply because we can pick and choose the extra work we need to do, but legacy driver stacks have built up cruft over a decade or more to support all kinds of weird use cases and heuristics. Add multi-threading on top of that, and then you can think about micro-optimizing API overhead, if you actually need it. I suspect you won’t even need multi-threaded Vulkan. I believe the real challenge with the modern APIs right now is optimizing GPU performance, not CPU.

Metal is getting a lot of praise for its successful trade-off in performance and usability. While I don’t know the API itself in detail to make a judgement (I know all the horrors of Metal Shading Language though cough), I get the impression that the mid-level abstraction is the abstraction level we should be working in 99% of the time.

I think Granite is one such implementation. I am not trying to propose that Granite is the solution, but it is one of them. The design space is massive. There just cannot possibly be a one true graphics API for all users. Rather than suggest you go out and use it directly, I will try to explain how I designed a Vulkan interface which is quite convenient to use and runs well on both desktop and mobile (very few projects consider both), at least for my use cases. Ideally, you should be inspired to make the mid-level abstraction that is right for you and your project. I have gone through a couple of iterations to get where I am now with the design, and used it for various projects, so I think it’s a good starting point at least.

The 3D-accelerated emulation use case

How Granite got started was actually the Vulkan backend in Beetle PSX HW renderer. I wrote up a Vulkan backend, and emulators need very immediate and flexible ways of using graphics APIs. Information is generally known only in the last minute. Being able to implement such projects guided Granite’s initial design process quite a lot. This is also a case where legacy APIs are really painful since you really need the flexibility of modern APIs to do a good job with performance. There are a lot of state changes and draw calls on top of the CPU cost of emulation itself. Creating resources and modifying data on the GPU in weird ways is a common case in emulation, and many drivers simply don’t understand these usage patterns and we hit painful slow-paths everywhere. With Vulkan there is little to no magic, we just implement things how we want, and performance ends up far more predictable.

I think many forget that Vulkan is not just for big (AAA) game engines. We can successfully use it for all kinds of things. We just need the right abstractions and knowledge.

How the design and implementation will be explored

To start off, we will explore the design through commented code samples, which use only the Vulkan portion of Granite as a library. We will write concrete samples of code, and then go through how all of this works, and then discuss how things could be designed differently.

… up next!

I haven’t written up any samples yet, so it makes sense to stop here. Next time, we’ll start with some samples.

Experimenting with VK_GOOGLE_display_timing, taking control over the swap chain

Recently, I’ve been experimenting with the VK_GOOGLE_display_timing extension. Not many people seem to have tried it yet, and I am over average interested in pristine swap chain performance (The state of Window System Integration (WSI) in Vulkan for retro emulators, Improving VK_KHR_display in Mesa – or, let’s make DRM better!), so when I learned there was an experimental patch set for Mesa (X11/DRM) from Keith Packard, I had to try it out. My experience here will reflect whatever I got running by rebasing this patch set, so YMMV.

Croteam’s presentation from GDC is a very good read to understand how important this stuff is for normal gaming (Article, PDF)

The extension supports a few critical components we have been missing for years in graphics APIs:

  • When presenting with vkQueuePresentKHR, specify that an image cannot be presented on screen before some wall time (CLOCK_MONOTONIC on Linux/Android). This allows two critical features, SwapInterval > 1, e.g. locked 30 FPS, and proper audio/video sync for video players since we can schedule frames to be presented at very specific timestamps (subject to rounding to next VBlank). VDPAU and presentation APIs like that have long supported things like this, and it’s critical for non-interactive applications like media players.
  • Feedback about what actually happened a few frames after the fact. We get an accurate timestamp when our image was actually flipped on screen, as well as reports about how early it could have been presented, and even how much GPU processing margin we had.
  • Query in nanoseconds how long the display refresh interval is. This is critical since monitors are never exactly 60 FPS, but rather something odd like 59.9524532452 or 59.9723453245. Normally, graphics APIs will just report 60 and call it a day, but I’d like to tune my frame loop against the correct refresh rate. Small differences like these matter, because if you naively assume 60 without any feedback, you will face dropped or duped frames when the rounding error eventually hits you. I’ve grown allergic to dropped or duped frames, so this is not acceptable for me.

Using this feature set, we can do a lot of interesting things and discover some interesting behavior. As you might expect from the extension name, GOOGLE_display_timing ships on a few Android devices as well, so I’ve tested two implementations.

Use case #1: Monitoring display latency

So, for the purposes of this discussion, the latency we will measure is the time is takes for input to be sampled until we reach the frame buffer being scanned out to display. We obviously cannot monitor the latency of the display itself without special equipment (and there’s nothing we can do about that), so the latency would be:

  • Polling input <– Start timer
  • Running application logic
  • Building rendering commands
  • Execute frame on GPU
  • Wait for FIFO queue to flip image on-screen <– This is the lag which is very hard to quantify without this extension!

We’re going to call AcquireNextImageKHR, submit some work, and call QueuePresentKHR in a loop, using VkSemaphore to synchronize, and let’s see how much input latency we get. The way we do this with display_timing is fairly simple. In vkQueuePresentKHR we pass in:

typedef struct VkPresentTimeGOOGLE {
    uint32_t    presentID;
    uint64_t    desiredPresentTime;
} VkPresentTimeGOOGLE;

We pass in some unique ID we want to poll later, and desiredPresentTime. Since we only care about latency here, we pass in 0, which means just present ASAP (in FIFO fashion without tearing of course). Later, we can call:


to get a report on what actually happened, here’s the data you get:

typedef struct VkPastPresentationTimingGOOGLE {
    uint32_t    presentID;
    uint64_t    desiredPresentTime;
    uint64_t    actualPresentTime;
    uint64_t    earliestPresentTime;
    uint64_t    presentMargin;
} VkPastPresentationTimingGOOGLE;

This data will tell you in wall time, when the frame was actually flipped on screen (actualPresentTime), when it could have been flipped on-screen potentially (earliestPresentTime), and the presentMargin meaning, how much earlier did the GPU complete rendering compared to earliestPresentTime.

To estimate total latency we’ll compute actualPresentTime – CLOCK_MONOTONIC when polling input. actualPresentTime is defined to use CLOCK_MONOTONIC base on Linux and Android. This is powerful stuff, so let’s see what happens.

X11, full-screen, 2 image FIFO

The interesting thing we observe here is about 17 ms total latency. My monitor is 60 FPS, so that’s 1 frame of total latency, which means we have true flip mode. Great! The reason we get one frame total is that X11’s implementation in Mesa is a synchronous acquire. We cannot get a new image from AcquireNextImageKHR until something else has actually flipped on screen. 2-image FIFO on Xorg isn’t all that practical however. We’re forced to eat a full drain of the GPU when calling AcquireNextImageKHR in this case (unless we do some heroics to break the bubble), which may or may not be a problem, but for GPU-intensive workloads, this is bad, and probably not recommended.

X11, windowed, 2 image FIFO

In Windowed mode, we observe ~33ms, or 2 frames. It’s clear that Xorg adds a frame of latency to the mix here. Likely, we’re seeing a blit-style compositor frame being added in the middle. It’s great that we can actually measure this stuff, because what happens after a present is usually pretty opaque and hard to reason about without good tools.

X11, 3 image FIFO

As we expect, the observed latency is simply 1 frame longer for both windowed and full-screen. ~33 ms latency, or two frames.

X11, presentation timing latency

Another latency we need to consider is how long time it takes to get back a past presentation timing. On the Mesa implementation, it is very tight, matching the overall latency, since we have a synchronous AcquireNextImage. For 2-image FIFO full-screen for example, we know this information just one frame after we submitted a present, nice! For triple buffer it takes 2 frames to get the information back, etc.

X11, frame time jitter

Generally, the deltas between actualPresentTime is rock stable, showing +/- 1 microsecond or so. I think it’s probably using the DRM flip timestamps which come straight from the kernel. In my experience it’s about this accurate, and more than good enough.

Android 8.0, 3 image FIFO

Android is a little strange, we observe ~45 ms latency. About 2.7 frames. This suggests there is actually some kind of asynchronous acquire going on here. If I add a fence to vkAcquireNextImage to force synchronous acquire, I get ~33 ms latency, as we got with Xorg. Not sure why it’s not ~3 frames of latency … Maybe it depends on GPU rendering times somehow.

Android 8.0, 2 image FIFO

We now have ~28 ms, about 1.7 frames. If we try the fence mechanism to get sync acquire, we drop down to a stuttering mess, with ~33 ms latency. Apparently, getting one frame latency is impossible, so it consistently misses a frame. I didn’t really expect this to work, but nice to have tested it. At this latency, I get a lot of stuttering despite trivial GPU load. Triple buffering is probably a good idea on Android …

Android 8.0, presentation timing latency

This is rather disappointing, no matter what I do, it takes 5 frames for presentation timing results to trickle back to the application. 🙁 This will make it harder to adapt dynamically to frame drops later.

Android 8.0, frame time jitter

This one is also a bit worrying. The deltas between actualPresentTime hover around the right target, but show a jitter of about 0.3 ms +/-. This leads me to think the presentation timing is not tied to a kernel timestamp derived from an IRQ directly, but rather an arbitrary user-space timestamp.

Use case #2: Adaptive low-latency tuning

Sometimes, we have little to no GPU workload, but we want to achieve sub-frame latencies. One use case here is retro emulation which might have GPU workloads close to just a few blits, so we want to squeeze the latency as much as we can if possible.

To do this we want to monitor the time it takes to make the GPU frame buffers ready for presentation, then try to lock our frame so we start it at estimatedFuturePresentTime – appCPUandGPUProcessingTime – safetyMargin. The best we can currently do is through ugly sleeping with clock_nanosleep with TIMER_ABSTIME, but at least we have a very good estimate what that timestamp is now.

E.g., rendering some trivial stuff which only takes, say, 1 ms in total for CPU -> GPU pipeline, I add in some safety margin, say 4 ms, then I should be able to sleep until 5 ms before next frame needs to be scanned out. Seems to work just fine on Xorg in fullscreen, which is pretty cool. What makes this so nice is that we can dynamically observe how much latency we need to be able to reach the deadline in time. While we could have used GPU timestamps, it gets hairy because we would need to correlate GPU timestamps with CPU time, and the presentation engine might need some buffering before an image is actually ready to be presented, so using presentMargin is the correct way.

As you’d expect, this doesn’t work on Android, because SurfaceFlinger apparently forces some buffering. presentMargin seems kinda broken as well on Android, so it’s not something I can rely on.

Use case #3: Adaptive locked 60 or 30 FPS

Variable FPS game loops are surprisingly complicated as discussed in the Croteam GDC talk. If we have a v-synced display at 60 FPS, the game should have a frame time locked to N * refreshDuration.

Instead of sampling timing deltas on CPU which is incredible brittle for frame pacing, we can take a better approach, where we try to lock our frame to N * refreshDuration + driftCompenstation. Based on observations over time we can see if we should drop our fixed rendering rate or increase it. This allows for butter smooth rendering void of jitter, but we can still adapt to how fast the GPU can render.

The driftCompentation term is something I added where we can deal with a dropped frame here and there and combat natural drift over time, so we can slowly sync up with real wall-time. For example, if we let the frame time jitter up to a few %, we can catch up quickly to any lost frame, and overall animation speed should remain constant over time. This might not be suitable for all types of content (fixed frame-rate pixel-artsy 2D stuff), but techniques like these can work well I think.

For adaptive refresh rate, we could for example have heuristics like:

  • If we observe at least N frame drops the last M frames, increase swap interval since it’s better to have steady low FPS than a jumpy, jittery mess.
  • If we observe that earliestPresentTime is consistently lower than actualPresentTime and presentMargin is decent, it’s a sign we should bump up the frame rate.

The way to implement a custom swap interval with this extension is fairly simple as we just specify desiredPresentationTime to have a cadence of two or more refreshDurations instead of one refreshDuration.

Getting the refresh interval

We can observe the refresh interval by calling


Normally, I would expect this to just work, but on Xorg, it seems like this is learned over time by looking at presentation timestamps, so we basically need to wait some frames before we can observe the true refresh cycle duration, Android doesn’t seem to have this problem.

An immediate question is of course how all this would work on a variable refresh rate display …

Important note on rounding issues

Just naively doing targetPresentTime = someOlderObservedPresentationTime + frameDelta * refreshDuration calculation will give you troubles. The spec says that the display controller cannot present before desiredPresentationTime, so due to rounding issues we might effectively say, “don’t present until frame 105.00000001, and the driver will say, “oh, I’ll wait a frame extra till frame 106, since 105 is before 105.0000001!”. This is a bit icky.

The solution seems to be just subtracting a few ms off the desiredPresentationTime just to be safe, but I actually don’t know what drivers actually expect here. Using absolute nanoseconds like this gets tricky very quickly.


For a future multi-vendor extension, some things should probably be reconsidered.

Absolute vs relative time?

The current model of absolute timings in nanoseconds is simple in theory, but it does get a bit tricky since we need to compensate for drift, estimate future timings with rounding considerations, and issues like these. Being able to specify relative timings between presents would simplify the cases where we just want to fiddle with present interval and not really caring exactly when the image comes on screen, but at the same time, relative timings would complicate cases where we want to present exactly at some time (e.g. a video player).

Time vs frame counters?

On fixed refresh rate displays, it seems a bit weird to use time rather than frame counters (media stream counters on X11). It is much easier to reason about, and the current X11 implementation translates time back and forth to these counters anyways.

The argument against frame counters would be variable refresh rate displays I suppose.

Just support all the things?

The best extension would probably support all variants, i.e.:

  • Let you use relative or absolute timing
  • Let you specify time in frame counters or absolute time
  • Let you query variable refresh rate vs fixed refresh rate

Implementation issues

I had a few issues on Android 8.0, which I haven’t been able to figure out. The Mesa implementation I had working was basically flawless in my book, sans the X11 issue of not knowing correct refresh rate ahead of time.

presentMargin is really weird

I don’t think presentMargin is working as intended, and many times when you drop a frame, you can observe that presentMargin becomes “negative” in int64_t, but the member is uint64_t, so it’s a bizarre overflow instead. The numbers I get back don’t really make sense anyways, so I doubt it’s tied to a GPU timestamp or anything like that.

earliestPresentationTime is never lower than actualPresentTime

This makes it impossible to use adaptive locked frame rates, since we cannot ever safely bump the frame rate.

Randomly dropping frames with almost no GPU load

SurfaceFlinger seems to have a very strange tendency to just drop frames randomly even if I have very low target refresh rate, and GPU time is like 1 ms per frame. I wonder if this is related to rounding errors or not, but it is a bit disturbing. Touching the screen helps, which might be a clue related to power management somehow, but it can still drop frames randomly even if I touch the screen all the time.

Here’s an APK if anyone wants to test. It’s based on this test app: https://github.com/Themaister/Granite/blob/master/tests/display_timing.cpp


The screen flashes red when a frame drops, it should be a smooth scrolling quad moving across the screen. It logs output with:

adb logcat -c && adb logcat -s Granite

and I often get output like:



I think this is a really cool extension, and I hope to see it more widely available. My implementation of this can be found in Granite for reference:


The state of Window System Integration (WSI) in Vulkan for retro emulators

In the world of graphics programming, interacting with the windowing system is not exactly the most riveting subject, but it is of critical importance to the usability of applications. Tearing, skipped frames, judder, etc, are all issues which stem from poor window system code. Over the years, it has been an ongoing frustration that good windowing system integration is something we just cannot rely on. For whatever reason, implementations and operating systems always find a way to screw things up.

For emulation, perfection is the only thing good enough. It is immediately obvious when tearing or skipped frames are observed in retro emulation. These games work on a fixed time step, and we must obey, or the result is borderline unplayable. For “normal” games, it seems like this isn’t as much of a concern. Games are written around the assumption of variable frame rates, users can disable V-Sync (especially for fast-paced, reaction based games), etc, and variable refresh rate display standards were introduced to get the best of both worlds. The problems I’m going to explore in this post can often be glossed over in the common case.

Requirements for RetroArch:

  • Perfect, tear-free 1:1 frame mapping if game frame rate and monitor frame rate are close enough. Essentially VK_PRESENT_MODE_FIFO_KHR when it’s working as intended.
  • Ability to toggle between locked and unlocked (fast forward) frame rates, seamlessly. This is a very emulation specific requirement, but extremely important. Unlocked could be either MAILBOX or IMMEDIATE.
  • Consistent frame pacing in FIFO mode. If there is too much variation in frame times, this translates into variable input latency for fixed FPS content, which is not ideal. It also hurts rate control for audio, although the frame pacing can get rather bad before this becomes a real issue. There’s no reason why we should have more than a millisecond in jitter for low-GPU load scenarios.
  • Control over latency. The GPU load of retro emulation is usually quite insignificant, but we need full control over the swap  chain, when swaps happen, and when we can begin a frame and poll inputs. Absurd amounts of effort has gone into aiming to reduce input latency by various developers, and all that effort consists of working around false assumption of GPU drivers which have been optimising for “normal game” FPS rather than latency and predictability. In Vulkan, with more explicit control over the swap chain, we should be able to control this far better than we ever could in legacy APIs. However, discussing latency (by like, measuring stuff with a high-speed camera) is outside of the scope of this post. There are more fundamental issues I would like to cover instead.
  • In (exclusive) full screen modes, we should have a flipping model, and if I ask for double buffer, I should get just that.
  • Borderless windowed full screen mode is also important for casual play when minimum latency isn’t the highest priority.
  • Windowed mode should be tear-free, without stutter, but is allowed to have a bit more latency than fullscreen because of compositors.

I have had the “great pleasure” of fighting with many different WSI implementations in the RetroArch Vulkan backend. I would like to summarise these, starting from “best” to “worst” for dramatic effect. I’ll also discuss some heinous workarounds I have had to employ to work around the worst implementations.

How it should work

Vulkan WSI is a fairly explicit model compared to its predecessors. You can request the number of images there should be in the swap chain, and you acquire and present these images directly. There is no magic “framebuffer 0”, or a magic SwapBuffers call which buffers an unknown amount for you.

There is no direct way to toggle between FIFO and MAILBOX/IMMEDIATE on a swap chain. Instead, we need to create a new swap chain. According to the specification, there can only be one non-retired swap chain active at a time. We have the option of passing in our “oldSwapchain” when creating a new swap chain. My understanding of this is that we can “pass over” ownership from one swap chain to the next. Passing in oldSwapchain will effectively retire the swap chain as well. If we just change the present mode for the new swap chain, this should give us a seamless transition over to the new present mode. After we have created the new swap chain, we can delete the old one and query the new (or the same!) swap chain images.

vkAcquireNextImageKHR will give us new images to render into, and it should be unblocked on V-Blank when new images are flipped onto the display. vkAcquireNextImageKHR is an asynchronous acquire operation, but for RetroArch I want to sync the frame begin to V-Blank, so I just wait on the VkFence to signal anyways to force a synchronous acquire. Now, it turns out all the implementation I’ve tried so far seem to implement a synchronous vkAcquireNextImageKHR anyways, so it doesn’t really matter.

vkQueuePresentKHR should queue a flip as we expect, but it shouldn’t block.

One of my gripes with WSI is that there is no way to tell if you’re going to get exclusive or borderless windowed modes. It seems to be driver magic that controls this unfortunately. I’m not even sure if this is an OS concept or not at this point.

The test matrix gets rather large, there are two OSes I test here:

  • Linux
  • Windows 10 (7 seems to behave very differently according to users, but I don’t have a setup for that)

Android should be on this list, but I haven’t tested that.

Surface types:

  • Wayland (on GNOME3)
  • X11/XCB (over Xorg and XWayland, on GNOME3)
  • Win32
  • KHR_display

Driver stacks:

  • Mesa – RADV / Anvil (AMD/Intel), the WSI code is shared and behave the same
  • AMDVLK (AMD Linux open source)
  • Nvidia (closed source)
  • AMD (closed source)

I’m using the latest public drivers as of writing. Now, time for some “good, bad and ugly” tiering for dramatic effect.

The good

These are implementations I consider fully functioning without any serious flaws, at least for RetroArch.

Mesa – Wayland – Linux

Wayland on Mesa is so far the shining implementation of WSI in my book.

  • Can toggle seamlessly between present modes without any hickup, missed frames or anything silly.
  • Frame pacing is excellent. About 2% frame time standard deviation (about 300 microseconds) in my measurements.
  • Supports MAILBOX for tear-free fast forward.

If I have to point out one flaw it’s the oddly reported minImageCount. It is 4 on Mesa Wayland, because of the MAILBOX implementation, which requires 4. However, even if I get 4 images in FIFO mode, it seems to only use 3 of them for some reason (no round-robin for you). I think this is a flawed implementation and minImageCount should be 2. It is perfectly valid for a WSI implementation to return more images than you request (Android seems to do this). It’s called “minImageCount” after all. But I think this highlights a Vulkan WSI flaw. minImageCount does not depend on presentMode!

I haven’t tested if it’s possible to get true double-buffer with page-flip on Mesa Wayland, but it should be fairly trivial to patch that.

Mesa – X11/XCB on Xorg – Linux

Xorg vsync has always been a serious pain point for me. It randomly works, or it doesn’t, depending on the phase of the moon. With DRI3 being used in Mesa’s implementation, it actually seems to work really well in both windowed and fullscreen. I would put it roughly on par with Wayland.

AMDVLK – Wayland – Linux (pending update to WSA and PAL)

I’ve submitted a lot of issues recently on their bugtracker to fix Wayland-related issues, and with the latest development branch I tried recently, AMDVLK now reached parity with Mesa’s Wayland implementation in quality. It’s all using the same low-level user space and kernel code thanks to the AMDGPU efforts, so this is to be expected, great stuff.

The bad

These are flawed, but useable.

Mesa – VK_KHR_display – Linux

The main issue is that fast-forward does not exist, i.e., no MAILBOX or IMMEDIATE, because DRM does not support true MAILBOX as it stands, IMMEDIATE is conditionally supported (not implemented, but could be very easily added), and only AMD seems to support it. I ranted about this topic here: http://themaister.net/blog/2018/07/02/improving-vk_khr_display-in-mesa-or-lets-make-drm-better/

Nvidia – VK_KHR_display – Linux

Nvidia was the first vendor to support VK_KHR_display, which I applaud. VK_KHR_display is great for focused emulation boxes (although the only reason I assume we got any VK_KHR_display implementation on Linux was Steam VR). We never got this “direct to display” path working on their GL driver, but it works on Vulkan, yay.

The flaw with this implementation is that toggling fast forward works, but it causes a strange mode change, and there’s basically a short pause for reprogramming the DRM Crtc (or what it is doing). This is unacceptable, but at least the non-fast forward experience is flawless in my book. At least Nvidia’s implementation can support MAILBOX or IMMEDIATE, so it is a bit better than Mesa’s KHR_display implementation.

Nvidia – XCB on Xorg – Linux

This one is rather disappointing. Windowed mode is a stuttering, tearing mess (but so is GL it seems). Full screen seems to start off tearing, but after a second or so it seems to figure out that it should move into a flip-like mode. Toggling presentation modes leads to a frame or two of corrupted garbage on screen, but it doesn’t trigger a mode change at least. Fully functional, but a bit rough around the edges.

AMDVLK – XCB on Xorg – Linux

Struggles with frame pacing and tearing, but might have been fixed now. I haven’t tested it that much, but I much prefer RADV on Xorg to this. After recent updates, AMDVLK’s Wayland backend is much better. The main reason to use this setup is for Radeon Graphics Profiler.

The ugly

Broken and buggy for my use cases.

Mesa – XCB on Wayland (XWayland) – Linux

Pretty awful, and a stuttering mess. Poking at the source, there’s likely some kind of fallback path with fallback blits to deal with this, but stay far away from this. Also, in fullscreen, vkAcquireNextImageKHR seems to block indefinitely on XWayland for some reason. At least, there is no reason why you need to subject yourself to this backend for WSI.

AMD – Windows 10

Windows WSI implementations seem to be their own kind of hell, but red beats green here. Windowed mode is just fine. Toggling presentation modes works just fine without issue.

It’s fullscreen where we get a lot of problems. Toggling presentation modes throws the application out to desktop for 3 seconds, then you get the mode change. This is just broken. I found that it’s the deleting of the oldSwapchain which triggers the issue. Not even getting a few present operations through in the new swap chain will save you. There is no escape. Basically, the conclusion I came to is that we can never ever create a new swap chain on Windows, or we are screwed. So now, I had to come up with ways to workaround this. Fortunately, Vulkan is flexible enough with the threading model that we can do some nasty tricks.

AMD seems to support vkAcquireNextImageKHR with a timeout of 0, but some other vendor did not …

Nvidia – Windows 10

This is nightmare fuel for RetroArch. It behaves like the AMD driver, except some added fun:

  • vkAcquireNextImageKHR with timeout == 0 does not seem to be implemented. It will just block. 🙁
  • It’s impossible to create a swap chain in certain cases. maxImageExtent will be 0 when you minimize a window or alt+tab out of fullscreen, which leads to some rather interesting workarounds. Because I need to allow a state where a swap chain does not exist yet, and avoid rendering to any swap chain related image while this goes on. No other vendor seems to have this behavior, but it is allowed by the specification, unfortunately.
  • Using oldSwapchain has been reported to break the driver, causing black screens. To work around this, I ifdef out oldSwapchain on Windows, and just destroy the swap chain before creating a new one. This breaks any hope of toggling presentation modes until this is fixed 🙁

Windows commonalities

  • Frame pacing in FIFO is great.
  • Windowed mode works just fine.
  • Fullscreen seems to be “exclusive” only. Alt-tabbing out of Vulkan is a rather sluggish operation, unlike borderless windowed which is designed to fix this.
  • Changing present modes in fullscreen is completely broken, and needs some serious workarounds.

The nasty workaround – MAILBOX emulation

To emulate fast forward, I ended up with a nasty hack for Windows. If fast forward is enabled in fullscreen mode, I spawn a thread which will do vkAcquireNextImageKHR in the background on-demand, and I’ll deal with the case where we haven’t acquired quite yet, by nooping out any access to the swap chain for that frame. The initial workaround for this was to just use timeout == 0, and avoid a dedicated thread, but … Nvidia’s implementation threw a wrench into that plan.

By doing it like this I can stay in FIFO present mode while faking a really terrible implementation of MAILBOX. Its performance isn’t that great, but at least I don’t get a 3 second delay just to trigger fast forward which should be instant in any sensible WSI implementation.

The Windows Vulkan experience in RetroArch should be not so terrible now, but know that it is only through several weeks of banging my head against the wall.

I hope some IHVs take this into consideration and make sure that toggling presentation modes works properly. Someone out there cares at least. No vendor I have seen so far deals with oldSwapchain in any way. There is a reason it’s there!

Render graphs and Vulkan — a deep dive

Modern graphics APIs such as Vulkan and D3D12 bring new challenges to engine developers. While the CPU overhead has dramatically been reduced by these APIs, it’s clear that it is difficult to bridge the gap in terms on GPU performance when we are hitting the “good” paths of the driver, and we are GPU bound. OpenGL and D3D11 drivers (clearly) go to extreme lengths in order to improve GPU performance using all sorts of trickery. The cost we pay for this as developers is unpredictable performance and higher CPU overhead. Writing graphics backends has become more interesting again, as we are still figuring out how to build great rendering backends for these APIs which balance flexibility, performance and ease of use.

Last week I released my side-project, Granite, which is my take on a Vulkan rendering engine. While there are plenty of such projects out in the wild, all with their own merits, I would like to discuss my render graph implementation in particular.

The render graph implementation is inspired by Yuriy O’Donnells GDC 2017 presentation: “FrameGraph: Extensible Rendering Architecture in Frostbite.” While this talk focuses on D3D12, I’ve implemented my own for Vulkan.

(Note: render graphs and frame graphs mean the same thing here. Also, if I mention Vulkan, it probably also applies to D3D12 as well … maybe)

The problem

Render graphs fundamentally solve a very annoying problem in modern APIs. How do we deal with manual synchronization? Let’s go over the obvious alternatives.

Just-in-time synchronization

The most straight forward approach is basically doing synchronization at the last minute. Whenever we start rendering to a texture, bind a resource or similar, we need to ask ourselves, “does this resource have pending work which needs to be synchronized?” If so, we need to somehow at the very last minute deal with it. This kind of tracking clearly becomes very painful because we might read a resource 1000+ times, while we only write to it once. Multithreading becomes very painful, what if two threads discover a barrier is needed? One thread needs to “win”, and now we have a lot of useless cross-thread synchronization hassles to deal with.

It’s also not just execution itself we need to track, we also have the problem of image layouts and memory access in Vulkan. Using a resource in a particular way will require a specific image layout (or just GENERAL, but you might lose framebuffer compression!).

Essentially, if what we want is just-in-time automatic sync, we basically want OpenGL/D3D11 again. Drivers have already been optimized to death for this, so why do we want to reimplement it in a half-assed way?

Fully explicit synchronization

On the other side of the spectrum, the API abstraction we choose completely removes automatic synchronization, and the application needs to deal with every synchronization point manually. If you make a mistake, prepare for some “interesting” debugging sessions.

For simpler applications, this is fine, but once you start going down this route you quickly realize what a mess it turns into. Typically your rendering pipeline will be compartmentalized into blocks — maybe you have the forward/deferred/whatever-is-cool-now renderer in one module, some post-processing passes scattered around in other modules, maybe you drag in some feedbacks for reprojection steps, you add a new technique here and there and you realize you have to redo your synchronization strategy — again, and things turn sour.

Why does this happen?

Let’s write some pseudo-code for a dead-simple post-processing pass and think about it.

// When was the last time I read from this image? Probably last frame later in the post-chain ...
// We want to avoid write-after-read hazards.
// We're going to write the whole image,
// so we might as well transition from UNDEFINED to "discard" the previous content ...
// Ideally I would keep careful track of VkEvents from earlier frames, but that got so messy ...
// Where was this render target allocated from?
BeginRenderPass(RT = BloomThresholdBuffer)

// This image was probably written to in the previous pass, but who knows anymore.


These kinds of problems are typically solved with a big fat pipeline barrier. Pipeline barriers let you reason locally about global synchronization issues, but they’re not always the optimal way to do it.

// To be safe, wait for all fragment execution to complete, this takes care of write-after-read and syncing the HDR render pass ...
// Assuming they are never used in async compute ... hm, this will probably work fine for now.

PipelineBarrier(FRAGMENT -> FRAGMENT,
    RT srcAccess: 0 (write-after-read)
    HDR dstAccess: SHADER_READ_BIT)


So we transitioned the HDR image, because we assumed it was the previous pass, but maybe in the future you add a different pass in between which also transitions … So now you still need to keep track of image layouts, bleh, but not the end of the world.

If you’re only dealing with FRAGMENT -> FRAGMENT workloads, this is probably not so bad, there isn’t all that much overlap which happens between render passes anyways. When you start throwing compute into the mix is when you start pulling your hair out, because you just can’t slap pipeline barriers like this all over the place, you need some non-local knowledge about your frame in order to achieve optimal execution overlap. Plus, you might even need semaphores because you’re doing async compute now in a different queue.

Render graph implementation

I’ll be mostly referring to these files: render_graph.hpp and render_graph.cpp.

Note: This is a huge brain dump. Try to follow along in the code while reading this, I’ll go through things in order.

Note #2: I use the terms “flush” and “invalidate” in the implementation. This is not Vulkan spec lingo. Vulkan uses the terms “make available” and “make visible” respectively. Flush refers to cache flushing, invalidate refers to cache invalidation.

The basic idea is that we have a “global” render graph. All components in the system which need to render stuff need to register with this render graph. We specify which passes we have, which resources go in, which resources are written and so on. This could be done once on application startup, once every frame, or however often you need. The main idea is that we form global knowledge of the entire frame and we can optimize accordingly at a higher level. Modules can reason locally about their inputs and outputs while allowing us to see the bigger picture, which solves a major issue we face when the backend API does not schedule automatically and deal with dependencies for us. The render graph can take care of barriers, layout transitions, semaphores, scheduling, etc.

Outputs from a render pass need some dimensions, fairly straight forward.


struct AttachmentInfo
	SizeClass size_class = SizeClass::SwapchainRelative;
	float size_x = 1.0f;
	float size_y = 1.0f;
	VkFormat format = VK_FORMAT_UNDEFINED;
	std::string size_relative_name;
	unsigned samples = 1;
	unsigned levels = 1;
	unsigned layers = 1;
	bool persistent = true;


struct BufferInfo
	VkDeviceSize size = 0;
	VkBufferUsageFlags usage = 0;
	bool persistent = true;

These resources are then added to render passes.

// A deferred renderer setup

AttachmentInfo emissive, albedo, normal, pbr, depth; // Default is swapchain sized.
emissive.format = VK_FORMAT_B10G11R11_UFLOAT_PACK32;
albedo.format = VK_FORMAT_R8G8B8A8_SRGB;
normal.format = VK_FORMAT_A2B10G10R10_UNORM_PACK32;
pbr.format = VK_FORMAT_R8G8_UNORM;
depth.format = device.get_default_depth_stencil_format();

auto &gbuffer = graph.add_pass("gbuffer", VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT);
gbuffer.add_color_output("emissive", emissive);
gbuffer.add_color_output("albedo", albedo);
gbuffer.add_color_output("normal", normal);
gbuffer.add_color_output("pbr", pbr);
gbuffer.set_depth_stencil_output("depth", depth);

auto &lighting = graph.add_pass("lighting", VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT);
lighting.add_color_output("HDR", emissive, "emissive");

lighting.add_texture_input("shadow-main"); // Some external dependencies

Here we see three ways which a resource can be used in a render pass.

  • Write-only, the resource is fully written to. For render targets, loadOp = CLEAR or DONT_CARE.
  • Read-write, preserves some input, and writes on top, for render targets, loadOp = LOAD.
  • Read-only, duh.

The story is similar for compute, here’s an adaptive luminance update pass, done in async compute

auto &adapt_pass = graph.add_pass("adapt-luminance", VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);
adapt_pass.add_storage_output("average-luminance-updated", buffer_info, "average-luminance");

The luminance buffer gets a RMW here for example.

We also need some callbacks which can be called every frame to actually do some work, for gbuffer …

gbuffer.set_build_render_pass([this, type](Vulkan::CommandBuffer &cmd) {
	render_main_pass(cmd, cam.get_projection(), cam.get_view());

gbuffer.set_get_clear_depth_stencil([](VkClearDepthStencilValue *value) -> bool {
	if (value)
		value->depth = 1.0f;
		value->stencil = 0;
	return true; // CLEAR or DONT_CARE?

gbuffer.set_get_clear_color([](unsigned render_target_index, VkClearColorValue *value) -> bool {
	if (value)
		value->float32[0] = 0.0f;
		value->float32[1] = 0.0f;
		value->float32[2] = 0.0f;
		value->float32[3] = 0.0f;
	return true; // CLEAR or DONT_CARE?

The render graph is responsible for allocating the resources and driving these callbacks, and finally submitting this to the GPU in the proper order. To terminate this graph, we promote a particular resource as the “backbuffer”.

// This is pretty handy for ad-hoc debugging 😛
const char *backbuffer_source = getenv("GRANITE_SURFACE");
graph.set_backbuffer_source(backbuffer_source ? backbuffer_source : "tonemapped");

Now let’s get into the actual implementation.

Time to bake!

Once we’ve set up the structures, we need to bake the render graph. This goes through a bunch of steps, each completing one piece of the puzzle …


Pretty straight forward, a quick sanity check to ensure that the data in the RenderPass structures makes sense.

One interesting thing here, is that we can check if color input dimensions match color outputs. If they differ, we don’t do straight loadOp = LOAD, but we can do a scaled blit instead on start of the render pass. This is super convenient for things like game rendering at lower-res -> UI at native res. The loadOp in this case becomes DONT_CARE.

Traverse dependency graph

We have an acyclic graph (I hope … :D) of render passes now, which we need to flatten down into an array of render passes. The list we create will be a valid submission order if we were to submit every pass one after the other. This submission order might not be the most optimal, but we’ll get close later.

The algorithm here is straight forward. We traverse the tree bottom-up. Using recursion, push the pass index of all the passes which write to backbuffer, then, for all those passes, push the writes for the resources in those passes … and so on until we reach the top leaves. This way, we ensure that if a pass A depends on pass B, pass B will always be found later than A in the list. Now, reverse the list, and prune duplicates.

We also register if a pass is a good “merge candidate” with another pass. For example, the lighting pass uses input attachments from gbuffer pass, and it shares some color/depth attachments … On tile-based architectures we can actually merge those passes without going to main memory using Vulkan’s multipass feature, so we keep this in mind for the reordering pass which comes after.

Render pass reordering

This is the first interesting step of the process. Ideally, we want a submission order which has optimal overlap between passes. If pass A writes some data, and pass B reads it, we want the maximum number of passes between A and B in order to minimize the number of “hard barriers”. This becomes our optimization metric.

The algorithm implemented is probably very inoptimal in terms of CPU time, but it gets the job done. It looks through the list of passes not yet scheduled in, and tries to figure out the best one based on three criteria:

  • Do we have a merge candidate as determined by the dependency graph traveral step earlier? (Score: infinite)
  • What is the latest pass in the list of already scheduled passes which we need to wait for? (Score: number of passes which can overlap in-between)
  • Does scheduling this pass break the dependency chain? (If so, skip this pass).

Reading the code is probably more instructive, see RenderGraph::reorder_passes().

Another sneaky consideration which should be included is when the lighting pass depends on some resources, while the G-buffer pass doesn’t. This can break subpass merging, because we go through this scheduling process:

  • Schedule in G-buffer pass, it has no dependencies
  • Try to schedule in lighting pass, but whoops, we haven’t scheduled the shadow passes which we depend on yet … Oh well 🙂

The dirty solution to this was to lift dependencies from merge candidates to the first pass in the merge chain. Thus, the G-buffer pass will be scheduled after shadow passes, and it’s all good. A more clever scheduling algorithm might help here, but I’d like to keep it as simple as possible.

Logical-to-physical resource assignment

When we build our graph, we might have some read-modify-writes. For lighting pass, emissive goes in, HDR result goes out, but clearly, it’s really the same resource, we just have this abstraction to figure out the dependencies in a sensible way, give some descriptive names to resources, and avoid cycles. If we had multiple passes, all doing emissive -> emissive for example, we have no idea which pass comes first, they all depend on each other (?), and I’d rather not deal with potential cycles.

What we do now is assign a physical resource index to all resources, and alias resources which do read-modify-write. If we cannot alias for some reason, it’s a sign we have a very wonky submission order which tries to do reads concurrently with writes. The implementation just throws its hands in the air in that case. I don’t think this will happen with an acyclic graph, but I cannot prove it.

Logical-to-physical render pass assignment

Next, we try to merge adjacent render passes together. This is particularly important on tile-based renderers. We try to merge passes together if:

  • They are both graphics passes
  • They share some color/depth/input attachments
  • Not more than one unique depth/stencil attachment exists
  • Their dependencies can be implemented with BY_REGION_BIT, i.e. no “texture” dependency, which allows sampling for arbitrary locations.

Transient or physical image storage

Similar story as subpass merging, tile-based renderers can avoid allocating physical memory for the attachment if you never actually write to it (with storeOp = STORE)! This can save a lot of memory for deferred especially, but also for depth buffers if they are not used later in post for example.

A resource can be transient if:

  • It is used in a single physical render pass (i.e. it never needs to storeOp = STORE)
  • It is invalidated at the start of the render pass (no loadOp = LOAD needed)

Build RenderPassInfo structures

Now, we have a clear view of all the passes, their dependencies and so on. It is time to make some render pass info structures.

This part of the implementation is very tied into how Granite’s Vulkan backend does things, but it closely mirrors the Vulkan API, so it shouldn’t be too weird. VkRenderPasses are generated on demand in the Vulkan backend, so we don’t do that here, but we could potentially bake that right now.

The actual image views will be assigned later (every frame actually), but subpass infos, number of color attachments, inputs, resolve attachments for MSAA, and so on can be done up front at least. We also build a list of which physical resource indices should be pulled in as attachments as well.

We also figure out which attachments need loadOp = CLEAR or DONT_CARE now by calling some callbacks. For attachments which have an input, just use loadOp = LOAD (or use scaled blits!). For storeOp we just say STORE always. Granite recognizes transient attachments internally, and forces storeOp = DONT_CARE for those attachments anyways.

Build barriers

It is time to start looking at barriers. For each pass, each resource goes through three stages:

  • Transition to the appropriate layout, caches need to be invalidated
  • Resource is used (read and/or writes happen)
  • The resource ends up in a new layout, with potential writes which need to be flushed later

For each pass we build a list of “invalidates” and “flushes”.

Inputs to a pass are placed in the invalidate bucket, outputs are placed in the flush bucket. Read-modify-write resources will get an entry in both buckets.

For example, if we want to read a texture in a pass we might add this invalidate barrier:

  • stages = FRAGMENT (or well, VERTEX, but I’d have to add extra stage flags to resource inputs)
  • access = SHADER_READ

For color outputs, we might say:


This tells the system that “hey, there are some pending writes in this stage, with this memory access which needs to be flushed with srcAccessMask. If you want to use this resource, sync with these things!”

We can also figure out a particular scenario here with render passes. If a resource is used as both input attachment and read-only depth attachment, we can set the layout to DEPTH_STENCIL_READ_ONLY_OPTIMAL. If color attachment is used also as an input attachment we can use GENERAL (programmable blending yo!), and similar for read-write depth/stencil with input attachment.

Build physical render pass barriers

Now, we have a complete view of each pass’ barriers, but what happens when we start to merge passes together? Multipass will likely perform some barriers internally as part of the render pass execution (think deferred shading), so we can omit some barriers here. These barriers will be resolved internally with VkSubpassDependency when we build the VkRenderPass later, so we can forget about all barriers which need to happen between subpasses.

What we are interested in is building invalidation barriers for the first pass a resource is used. For flush barriers we care about the last use of a resource.

Now, there are two cases we need to cover here to ensure that every pass can deal with synchronization before and after the pass executes.

Only invalidation barrier, no flush barrier

This is the case for read-only resources. We still need to guard ourselves against write-after-read hazards later. For example, what if the next pass starts to write to this resource? Clearly, we need to let other passes know that this pass needs to complete before we can start scribbling on a resource. The way this is implemented is by injecting a fake flush barrier with access = 0. access = 0 basically means: “there are no pending writes to be seen here!” This way we can have multiple passes back to back which all just read a resource. If the image layout stays the same and srcAccessMask is 0, we don’t need barriers.

Only flush barrier, no invalidation barrier

This is typically the case for passes which are “write only”. This lets us know that before the pass begins we can discard the resource by transitioning from UNDEFINED. We still need an invalidation barrier however, because we need a layout transition to happen before we start the render pass and caches need to be invalidated, so we just inject an invalidate barrier here with same layout and access as the flush barrier.

Ignore barriers for transients/swapchain

You might notice that barriers for transients are just “dropped” for some reason. Granite internally uses external subpass dependencies to perform layout transitions on transient attachments, although this might be kind of redundant now with the render graph. The swapchain is similar. Granite internally uses subpass dependencies to transition the swapchain image to finalLayout = PRESENT_SRC_KHR when it is used in a render pass.

Render target aliasing

The final step in our baking process is to figure out if we can temporally alias resources in the graph. For example, we might have two or more resources which exist at completely different times in a frame. Consider a separable blur:

  • Render a frame (Buffer #0)
  • Blur horiz (Buffer #1)
  • Blur vert (Should ping-pong back to buffer #0)

When we specify this in the render graph we have 3 distinct resources, but clearly, the vertical blur render target can alias with the initial render target. I suggest looking at Frostbite’s presentation here on their results with aliasing, it’s quite massive.

We could technically alias actual VkDeviceMemory here, but this implementation just tries to reuse VkImages and VkImageViews directly. I’m not sure if there is much to be gained by trying to suballocate directly from the dead corpses of other images and hope that it will work out. Something to look at if you’re really starved for memory I guess. The merit of aliasing image memory might be questionable, as VK_*_dedicated_allocation is a thing, so some implementation might prefer that you don’t alias. Some numbers and IHV guidance on this is clearly needed.

The algorithm is fairly straight forward. For each resource we figure out the first and last physical render pass where a resource is used. If we find another resource with the same dimensions/format, and their pass range does not overlap, presto, we can alias! We inject some information where we can transition “ownership” between resources.

For example, if we have three resources:

  • Alias #0 is used in pass #1 and #2
  • Alias #1 is used in pass #5 and #7
  • Alias #2 is used in pass #8 and #11

At the end of pass #2, the barriers associated with Alias #0 are copied over to Alias #1, and the layout is forced to UNDEFINED. When we start pass #5, we will magically wait for pass #2 to complete before we transition the image to its new layout. Alias #1 hands over to alias #2 after pass #7 and so on. Pass #11 hands over control back to alias #0 in the next frame in a “ring”-like fashion.

Some caveats apply here. Some images might have “history” or “feedback” where each image actually has two instances of itself, one for current frame, and one for previous frame. These images should never alias with anything else. Also, transient images do not alias. Granite’s internal transient image allocator takes care of this aliasing internally, but again, with the render graph in place, that is kind of redundant now …

Another consideration is that adding aliasing might increase the number of barriers needed and reduce GPU throughput. Maybe the aliasing code needs to take extra barrier cost into consideration? Urk … At least if you know your VRAM size while baking, you have a pretty good idea if aliasing is actually worth it based on all the resources in the graph. Optimizing the dependency graph for maximum overlap also greatly reduces the oppurtunities for aliasing, so if we want to take memory into consideration, this algorithm could easily get far more involved …

Preparing resources for async compute

For async compute, resources might be accessed by both a graphics and a compute queue. If their queue families differ (ohai AMD), we have to decide if we want EXCLUSIVE or CONCURRENT queue access to these resources. For buffers, using CONCURRENT seems like an obvious choice, but it’s a bit more complicated with images. In the name of not making this horribly complicated, I went with CONCURRENT, but only for the resources which are truly needed in both compute and graphics passes. Dealing with EXCLUSIVE will be brutal, because now we have to consider read-after-read barriers as well and ping-pong ownership between two queue families 😀 (Oh dear)


A lot of stuff to consider to go through, but now we have all the data structures in place to start pumping out frames.

The runtime

While baking is a very involved process, executing this is reasonably simple, we just need to track the state of all resources we know about in the graph.

Each resource stores:

  • The last VkEvent. If we need to ask ourselves, “what do I need to wait for before I touch this resource”, this is it. I opted for VkEvent because it can express execution overlap, while pipeline barriers cannot.
  • The last VkSemaphore for graphics queue. If the resource is used in async compute, we use semaphores instead of VkEvents. Semaphores cannot be waited on multiple times, so we have a semaphore which can be waited on once in the graphics queue if needed.
  • The last VkSemaphore for compute queue. Same story, but for waiting in the compute queue once.
  • Flush stages (VkPipelineStageFlags), this contains the stages which we need to wait for (srcStageMask) if we need to wait for the resource.
  • Flush access (VkAccessFlags), this contains the srcAccessMask of memory we need to flush before we can use the resource.
  • Per-stage invalidation flags (VkAccessFlag for each pipeline stage). These bitmasks keep track of in which pipeline stages and access flags it is safe to use the resource. If we figure out that we have an invalidation barrier, but all the relevant stages and access bits are already good to go, we can drop the barrier altogether. This is great for cases where we read the same resource over and over, all in SHADER_READ_ONLY_OPTIMAL layout.
  • The current layout of the resource. This is currently stored inside the image handles themselves, but this might be a bit wonky if I add multithreading later …

For each frame, we assign resources. At the very least we have to replace the swapchain image, but some images might have been assigned as “not persistent”, in which case we allocate a fresh resource every frame. This is useful for scenarios where we trade more memory usage (more copies in flight on the GPU) for removal of all cross-frame barriers. This is probably a terrible idea for large render targets, but small compute buffers of a few kB each? Duh. If we can kick off GPU work earlier, that’s probably a good thing.

If we allocate a new resource, all barrier state is cleared to its initial state.

Now, we get into pushing render passes out. The current implementation loops through all the passes and deal with barriers as they come up. If you interleave this loop hard enough, I’m sure you’ll see some multithreading potential here 🙂

Check conditional execution

Some render passes do not need to be run this frame, and might only need to run if something happened (think shadow maps). Each pass has a callback which can determine this. If a pass is not executed, it does not need invalidation/flush barriers. We still need to hand over aliasing barriers, so just do that and go to next pass.

Handle discard barriers

If a pass has discard barriers, just set the current layout of the image to UNDEFINED. When we actually do the layout transition, we will have oldLayout = UNDEFINED.

Handle invalidate barriers

This part comes down to figuring out if we need to invalidate some caches, and potentially flush some caches as well. There are some things we have to check here:

  • Are there pending flushes?
  • Does the invalidate barrier need a different image layout than the current one?
  • Are there some caches which have not been flushed yet?

If the answer to either question is yes, we need some kind of barrier. We implement this barrier in one of three ways:

  • vkCmdWaitEvents – If the resource has a pending VkEvent, along with appropriate VkBufferMemoryBarrier/VkImageMemoryBarrier.
  • vkQueueSubmit w/ semaphore wait. Granite takes care of adding semaphores at submit time. We push in a wait semaphore along with dstWaitStageMask which matches our invalidate barrier. If we also need a layout transition, we can add a vkCmdPipelineBarrier with srcStageMask = dstStageMask to latch onto the dstWaitStageMask … and keep the pipeline going. We generally do not need to deal with srcAccessMask if we waited on a semaphore, so usually this will just be forced to 0.
  • vkCmdPipelineBarrier(srcStage = TOP_OF_PIPE_BIT). This is used if the resource hasn’t been used before, and we just need to transition away from UNDEFINED layout.

The barriers are batched up as appropriate and submitted. Buffers are much simpler as they do not have layouts.

After invalidation we mark the appropriate stages as properly invalidated. If we changed the layout or flushed memory access as part of this step, we clear everything to 0 before this step.

Execute render passes

This is the easy part, just call begin/nextsubpass/end and fire off some callbacks to push the real graphics work. For compute, just drop the begin/end.

For graphics we might do some scaled blits at the beginning and some automatic mipmap generation at the end.

Handle flush barriers

This part is simpler. If there is at least one resource which is only used in a single queue, we signal an VkEvent here and assign it to all relevant resources. If we have at least one resource which is used cross-queue, we also signal two semaphores here (one for graphics, one for compute later …)

We also update the current layout, and mark flush stages/flush access for later use.

Alias handoff

If the resource is aliased, we now copy the barrier state of a resource over to its next alias, and force the layout to UNDEFINED.


The command buffer for each pass is now submitted to Granite. Granite tries to batch up command buffers until it needs to wait for a semaphore or signal one.

Scale to swapchain

After all the passes are done, we can inject a final blit to swapchain if the backbuffer resource dimensions do not match the actual swapchain. Otherwise, we alias those resources anyways, so no need for useless blitting passes.


Hopefully this was interesting. The word count of this post is close to 5K at this point, and the render graph is a 3 ksloc behemoth (sigh). I’m sure there are bugs (actually I found two in async compute while writing this), but I’m quite happy how this turned out.

Future goals might be trying to see if this can be made into a reusable, standalone library and getting some actual numbers.