A tour of Granite’s Vulkan backend – Part 5

Render passes and synchronization

This is part 5 in the tour of Granite‘s Vulkan backend. We’re going to get knee-deep in the aspects of Vulkan which are the most difficult to learn in my opinion, and mastering these topics of Vulkan is the real hurdle towards mastery. This level of understanding is something high-level APIs will prevent you from reaching.

This post isn’t intended to be a tutorial on Vulkan synchronization, so I’ll assume some basic level of knowledge.

Render passes

Render passes is a new fundamental part of Vulkan which does not exist in any of the legacy APIs. In older APIs you can freewheel how you render to frame buffers, but that approach was always terrible on tile-based GPUs, and these days with hybrid tilers, it’s probably terrible on desktop as well. Vulkan forces you to think about rendering all you need in one go to a frame buffer and then proceed to the next.

In Granite, I wanted to make sure most of the flexibility and explicitness of Vulkan render passes could be expressed with minimal boilerplate. Most projects don’t seem to pay attention to this part except that it’s something you just have to do, and very few see the benefits they bring. That is probably a reasonable stance for 2019 if you do not care about mobile performance. If you need to target mobile though, it is worth the extra work. As of writing, the feature is quite mobile-centric, but desktop GPUs seem to be inching towards tile-based architectures, so it will be interesting to see if this view on render passes will shift in the future. Even D3D12 recently got render passes too, albeit in a simplified form.

In the most basic form, render passes in Vulkan can be rather daunting to set up, and it’s one of the many battles you have to fight to get hello triangle on screen. If we take a render pass with just one sub-pass (the case we care about 99% of the time), we need to specify up front:

  • How many attachments?
  • Which formats are used?
  • How many MSAA samples?
  • initialLayout and finalLayout
  • Which image layouts to use while rendering?
  • Do we load from memory or clear the attachment on render pass begin?
  • Do we bother storing the attachments to memory?

Most of this information is boilerplate we can automate, but things like load/clear/store we cannot deduce in the backend before it is too late. Knowing this kind of information up-front can be very beneficial for bandwidth consumption, at least on mobile.

The ugly framebuffer objects

An ugly aspect of Vulkan is the use of VkFramebuffer. I want an API where I just say “start a render pass where we render to these attachments”. Creating “FBOs” up front was really ugly in GL, and I think it’s a bad abstraction to have API users carry around ownership of objects which represent little to no useful work. FBOs are empty husks which might as well just be an array of image views.

We could just create VkFramebuffers every render pass we begin and defer the deletion of it right away, but creating these objects have some cost. There’s a handle allocation in the driver at minimum, and probably a little more on certain drivers. Here I just reuse the temporary hashmap allocator which I introduced in the descriptor set model post. VkFramebuffer objects can be reused over multiple frames, but if they aren’t used for a while, they are just deleted since VkFramebuffer objects are immutable.

Automating VkRenderPass creation

This topic is actually quite complicated when we start diving into the deep end of Vulkan render passes, but we can start with the trivial cases. The API in Granite looks something like:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 1;
rp.color_attachments[0] = &rt->get_view();
rp.store_attachments = 1 << 0;
rp.clear_attachments = 1 << 0;

rp.clear_color[0].float32[0] = 1.0f;
rp.clear_color[0].float32[1] = 0.0f;
rp.clear_color[0].float32[2] = 1.0f;
rp.clear_color[0].float32[3] = 0.0f;

cmd->begin_render_pass(rp);

This is an immediate way of starting a render pass, no reason to create frame buffers up front and all that. VkRenderPass can be created lazily on-demand like everything else.

Formats / MSAA sample counts

Render passes need to know formats and sample counts, and since we pass the concrete attachments directly in begin_render_pass(), we have the information we need right here.

Image layouts and VK_SUBPASS_EXTERNAL dependencies

There are three kinds of attachments in Granite:

  • User-created. These attachments are render targets which are created with Device::create_image() and the backend does not own the resource or knows anything about how long the resource will live. Common case for user-created render targets.
  • WSI images. These images are special since they came from VkSwapchainKHR or some equivalent mechanism. We know that these images are only used for rendering and they are only consumed by the presentation engine, or some other mechanism.
  • Transient images. Images with transient usage flags only live inside render passes. They cannot be sampled from, their memory does not necessarily exist except in page tables which point to /dev/null. We don’t care what happens to these images once the render pass is done.

To deduce image layouts for a render pass we have a few different paths.

wsi images

I don’t care about preserving WSI images over multiple frames, and I don’t care about sampling from WSI images or any such weird use case after rendering, so the flow of image layouts is:

  • initialLayout = UNDEFINED (discard)
  • VkAttachmentReference -> COLOR_ATTACHMENT_OPTIMAL or whatever is required for the subpass
  • finalLayout = PRESENT_SRC_KHR or whatever layout we need when using external WSI. For something like libretro, this will be SHADER_READ_ONLY_OPTIMAL since the image will be handed off to some other render pass which we don’t know or care about. For headless PNG/video dumping, it might be TRANSFER_SRC_OPTIMAL.

When initialLayout != the layout used in the first subpass, vkCmdBeginRenderPass will actually need to perform a memory barrier implicitly to make this work. The big question is when this memory barrier takes place, and the answer is “as soon as possible” (TOP_OF_PIPE_BIT) if we don’t specify it anywhere. For this case, Granite will add a subpass dependency which waits for VK_SUBPASS_EXTERNAL in the COLOR_ATTACHMENT_OUTPUT stage. This latches onto the WSI acquire semaphore, more on that later.

Final layout != last layout is used, so we get a transition at the end of the render pass, but we don’t need to care about external subpass dependencies here. The automatically generated one is perfect, and we’re going to use the WSI release semaphore to properly synchronize this image anyways.

When we see a WSI image in a render pass, we can trivially mark this command buffer as “touching WSI”. This will affect command buffer submission later. This is indeed the kind of tracking which I have been arguing against in earlier posts, but it’s so trivial that it’s a no-brainer to me in this case.

Transient images

For transient images, we automate it just like WSI images, except that finalLayout will match last used layout in the render pass. This way we avoid some useless image layout transition at the end of the render pass. Next time we use the image, it’s going to be discarded anyways.

Because we deal with transitions automatically, users can freely pull images from Vulkan::Device with get_transient_attachment(), render to it, and forget about it. This is super convenient for things like MSAA rendering where the multi-sampled attachment just needs to exist temporarily for purposes of being resolved into the single-sampled attachment we care about. Having to care about synchronization for resources we don’t own is weird I think.

other images

For any other image, we need to avoid any implicit layout transition, so we simply force initialLayout to match the first use in the render pass, and finalLayout will match the last use. In our small sample, it’s all going to be COLOR_ATTACHMENT_OPTIMAL. It’s up to the API user to know what layouts a render pass will expect, but it’s straight forward to map a render pass to expected layout. Color attachments are COLOR_ATTACHMENT_OPTIMAL, depth-stencil is DEPTH_STENCIL_OPTIMAL or DEPTH_STENCIL_READ_ONLY_OPTIMAL based on the read/write mode for depth, input attachments are SHADER_READ_ONLY, etc. It’s possible to use an attachment for multiple purposes in a subpass, and Granite supports that as well. Some examples:

  • Color attachment + input attachment: Feedback loop ala GL_ARB_texture_barrier (super useful for certain emulators) -> GENERAL
  • RW Depth attachment + input attachment (some weird decal algorithm?) -> GENERAL
  • Read-Only depth + input attachment (deferred shading use case) -> DEPTH_STENCIL_READ_ONLY_OPTIMAL

All of this is analyzed when a newly observed VkRenderPass is created, and subpass dependencies are set up automatically. Anything which happens outside the render pass is the user’s responsibility.

08 – Bare-bones “deferred rendering” sample

I made a cut-down sample to show how the API expresses multi-pass in the context of deferred rendering with transient gbuffer + depth. The meat of it is:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 3;
rp.color_attachments[0] = &device.get_swapchain_view();

rp.color_attachments[1] = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		VK_FORMAT_R8G8B8A8_UNORM,
		0);
rp.color_attachments[2] = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		VK_FORMAT_R8G8B8A8_UNORM,
		1);

rp.depth_stencil = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		device.get_default_depth_format());

rp.store_attachments = 1 << 0;
rp.clear_attachments = (1 << 0) | (1 << 1) | (1 << 2);
rp.op_flags = Vulkan::RENDER_PASS_OP_CLEAR_DEPTH_STENCIL_BIT;

Vulkan::RenderPassInfo::Subpass subpasses[2];
rp.num_subpasses = 2;
rp.subpasses = subpasses;

rp.clear_depth_stencil.depth = 1.0f;

subpasses[0].num_color_attachments = 2;
subpasses[0].color_attachments[0] = 1;
subpasses[0].color_attachments[1] = 2;

subpasses[0].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadWrite;

subpasses[1].num_color_attachments = 1;
subpasses[1].color_attachments[0] = 0;
subpasses[1].num_input_attachments = 3;
subpasses[1].input_attachments[0] = 1;
subpasses[1].input_attachments[1] = 2;
subpasses[1].input_attachments[2] = 3;  // Depth attachment
subpasses[1].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadOnly;

cmd->begin_render_pass(rp);
// Do work
cmd->next_subpass();
// Do work
cmd->end_render_pass();

See code comments in sample for more detail. To write this kind of sample in raw Vulkan would be almost a full day’s project.

Synchronization

Unlike many aspects of Granite which are reasonably high-level, synchronization in Granite is almost 100% explicit. The general philosophy of Granite is that excessive tracking of resource use is a no-no, unless it is trivial to do so (e.g. WSI images). Synchronization is a case where you need a lot of tracking to do a good job, and it is impossible to do a perfect job since you end up relying on heuristics, at least if you are to implement automatic synchronization on top of Vulkan. GL and D3D11 drivers have an advantage here since they can tap into GPU-specific synchronization mechanisms which might be better suited for implicit synchronization. A good example here is the i915 driver stack in the Linux DRM stack which can do automatic synchronization in kernel space somehow. I’m sure that simplifies the Mesa GL driver a lot, but I don’t know the details.

Let’s go through a thought experiment where we look at the big problems we run into if we are to implement a fully automatic barrier system. (I have tried :p)

Problem #1 – We cannot rewind time

When touching a resource, we must ask ourselves: “When and where was this resource accessed last?” We have three potential solutions to resolve a hazard:

  • Pipeline barrier (was used just now)
  • Event (was used some time ago in this queue)
  • Semaphore (was used in a different queue)

Ideally, we need to inject a barrier at the exact point where a resource was last used, but we cannot inject new commands in the middle of a command buffer which has already been recorded.

There is no winning this fight, either we eagerly inject barriers after every command in the hope that some future command will need to synchronize against it (VkEvent is nice here), or we inject barriers too late, stalling the GPU needlessly.

Eagerly injecting barriers is pure insanity if we take into the account that the resource might be used on a different queue in the future. That means signalling a heavy semaphore after every render pass or command. We could simply ignore supporting multiple queues, but that’s a huge compromise to make.

Problem #2 – Redundant tracking of read-only resources

A problem I found while trying to implement automatic barrier tracking was that static resources might be written in the future, so we need to keep track of them. This is a waste of CPU time, but it might be possible to explicitly mark these resources as “do not track, they’re truly static, pinky promise”, but I feel this is bolting on hacks.

Problem #3 – Multi-threading

The question of “where was this resource touched last” might not actually be possible to answer in a multi-threaded scenario. If we are recording command buffers in parallel, the backend has no idea what is going on until we serialize execution in vkQueueSubmit. A common solution I have seen for this problem is to resolve synchronization internally in command buffers as they are recorded, and on command buffer submission time, we can look at all used resources and resolve any cross-command buffer synchronization points right before submitting the command buffers in vkQueueSubmit. The complexity starts to shoot through the roof though. That’s a good sign we need to rethink.

I think this is the kind of solution you end up with when you have no choice but to port some old legacy API to Vulkan, and breaking the abstraction API is not an option.

Render graphs

A Vulkan backend which solves synchronization can only look back in time and deal with hazards at the last minute, but that is only because we do not have any context about what the application is doing. At a higher level, we can know what is going to happen in the future, and we can make automated decisions at that level, where we actually have context about what is going on. This is another reason why I do not want to have automatic synchronization in a Vulkan backend. Either we get a sub-optimal solution, or we try to close the gap with heuristics, but now run-time behavior is completely non-deterministic. We should aim for something better.

I believe the synchronization problem should be solved at a higher level, like a render graph. I wrote a blog some time ago about this topic: http://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/

Signalling fences

Granite’s way of signalling fences is very similar to plain Vulkan.

Vulkan::Fence fence;
device.submit(cmd, &fence);

// Somewhere else, potentially in a different thread.
fence->wait();
// fence ref-count goes to 0, queued up for recycling.

There is a pool of VkFence objects which can be reused. Signalling a fence forces a vkQueueSubmit. Once the lifetime of a Vulkan::Fence ends it is recycled back to the frame context. Nothing out of the ordinary here.

Semaphores

Semaphores work very similar to fences and are requested in Device::submit() like fences. Like Vulkan, they have a restriction that they can only be waited on once. Semaphores can be waited on in other queues with Device::add_wait_semaphore() in a particular queue and pipeline stage. This matches pDstWaitStages. Semaphores are also recycled like fences.

Events

Events can be signalled and later waited on in the same queue. Again, we have a pool of VkEvent objects, CommandBuffer::signal_event() requests a fresh event, signals it with vkCmdSetEvent and hands it to the user. VkEvents are recycled using the frame context. There is a similar CommandBuffer::wait_event() which maps 1:1 to vkCmdWaitEvents.

Barriers

Granite has many different methods to inject pipeline barriers, the most common ones are:

cmd->barrier(srcStage, srcAccess, dstStage, dstAccess);

which maps to a vkCmdPipelineBarrier with a VkMemoryBarrier, and image barriers which act on all subresources:

cmd->image_barrier(image, oldLayout, newLayout, srcStage, srcAccess, dstStage, dstAccess);

There are cases where we want to batch barriers or otherwise use more complicated commands than this, so there are also 1:1 interfaces to vkCmdPipelineBarrier where the full structures are passed in, but these are only really used by the render graph since writing out full structures is super painful.

The automatic barriers in Granite

There are a few instances where I think having automatic barriers makes sense. These are cases where it’s convenient to do so, and there is no tracking required, so we can resolve all hazards right away and forget about it. Some of them we’ve seen already, like WSI images and transient images in render passes.

The other major case is static read-only resources. If you pass in initial_data to Device::create_buffer() or Device::create_image(), we generally have a desire to upload some data, and never touch it again.

The general gist of it is that we can upload data with a staging buffer over the transfer queue and inject semaphores which block all possible pipeline stages (based on bufferUsage/imageUsage flags). The downside is that we might end up creating too many submissions if we somehow want to upload a ridiculous amount of buffers or images in one go, but we can opt-out of this automatic behavior by simply not passing initial_data and do all the batching and synchronization work ourselves.

The end goal is that we should be able to call create_buffer or create_image and just use the static resource right away without having to think about synchronization at all.

09 – Rendering to image and reading it back to CPU on transfer queue

I wrote a sample which flexes most of the synchronization APIs. It renders a small 4×4 texture on the graphics queue, synchronizes that with the transfer queue with a semaphore and reads it back to a CACHED host buffer. We spawn threads which wait on a fence, maps the buffer and reads the results.

Conclusion

In these parts of the backend, the low-level explicit nature of Vulkan shines through. I think we have to be fairly low-level, or we inherit most of the problems with the older APIs.

… up next!

In the next installment, we’ll have a look at pipeline creation.

A tour of Granite’s Vulkan backend – Part 4

Optimizing for scratch data allocation

Allocating memory from a heap is fine and all, but very often in an engine we need to allocate throwaway data. Uniform buffers are the perfect use case here. With transient command buffers, certain data is also going to be transient. It’s very common to just want to send some constant data to a draw call and forget about it.

In Vulkan, there is a perfect descriptor type for this use case, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. It’s not just uniform buffers, it’s fairly common that we want to allocate scratch vertex buffers, index buffers and staging data for texture updates.

Being able to implement allocators like these with no API overhead is a huge deal with Vulkan for me. In legacy APIs there are extremely painful limitations where “fire and forget” buffer allocations are very hard to implement well. Buffers generally cannot be mapped when submitting draw calls, so we need to fight really hard and think about copy-on-write behavior, discard behavior, API overhead to call map/unmap all the time (which breaks threaded driver optimizations) or batch up allocations and memcpy data around a couple of extra times. It’s too painful and a lot of CPU performance can go down the drain if we don’t hit all the fast paths.

The only proper solution in legacy APIs I can think of is GL 4.5’s GL_ARB_buffer_storage, which supports persistently mapped buffers like Vulkan, but relying on GL 4.5 (or GLES 3.2 + extensions) just does not seem reasonable to me, since targeting GL should be considered a compatibility option for old GPUs which do not have Vulkan drivers. This feature was a cornerstone of the “AZDO” (approaching zero driver overhead) buzz back in the GL days. D3D11 is still going to be the “compat” option on Windows for a long time, and forget about relying on latest and greatest GLES on Android.

This is the perfect occasion to present a “hello triangle” sample which uses most of these features, but we first need WSI, so let’s start there.

06 – Pushing pixels with SDL2

Granite’s main codebase normally uses GLFW, so to get a less redundant sample working, I wrote this sample to use SDL2’s Vulkan support, which is very similar to GLFW’s support starting with SDL 2.0.8.

Implementing WSI is similar to instance and device creation where there is a lot of boilerplate to churn through, with little room for design considerations. Granite’s WSI implementation has two main paths:

On-screen / VK_KHR_surface

In this mode, The WSI class creates and owns the Vulkan::Context and Vulkan::Device automatically for us and owns a VkSwapchainKHR. The only thing it cannot on its own is create the VkSurfaceKHR, which is platform dependent. Fortunately, the surface is the only platform-dependent object so we can supply an interface implementation to create this surface when Vulkan::WSI needs it. The sample implements an SDL2Platform class which uses SDL2’s built in wrappers for surface creation, nifty!

Off-screen / externally owned swapchains

Granite is also used in cases where we don’t necessarily own a swapchain which is displayed on screen. We might want to supply already created images in lieu of VkSwapchainKHR and provide our own image indices as well as acquire/release semaphores. After completing a frame, we can pass along the fake swapchain image to our consumer. The prime case for this is the libretro API implementation in Granite.

Pumping the main loop

Vulkan’s Acquire/Present model maps directly to a “begin” and “end” model in Granite. We call Vulkan::WSI::begin_frame() to acquire a new image index, advance the frame context and deal with any in-between frame work. We might have to deal with out-of-date swapchains here and various janitorial work which we never had to consider in old APIs.

Semaphores for WSI images are dealt with automatically. WSI images are treated specially and automatically handling synchronization for WSI resources is straight-forward to the point that there is no point in exposing that to the user. (Synchronization in Granite is in general very explicit, but WSI is one of few exceptions.) The main loop looks something like:

wsi.begin_frame(); // <-- acquire image if necessary, advances frame context
auto cmd = device.request_command_buffer();
// do work and render to swapchain
device.submit(cmd);
wsi.end_frame(); // <-- flushes frame, queues up presents if swapchain was rendered to this frame

Overall, WSI code is must to abstract in Vulkan, and I’m happy with the flexibility and simplicity in use I ended up with.

07 – Hello triangle (quad?) with scratch allocated VBO, IBO and UBO

Now that we can get stuff on-screen, now we’re getting to the actual meat of this post. https://github.com/Themaister/Granite-MicroSamples/blob/master/07_linear_allocators.cpp augments the WSI sample with a nice little quad. The VBO, IBO and UBOs are allocated directly on the command buffer.

Linear allocator – allocating memory at the speed of light

This allocator has many names – chain allocator, bump allocator, scratch allocator, stack allocator, etc. This is the perfect allocator for when we want to allocate a lot of small blobs, and just wink it all away at some point in the future. Allocation happens by incrementing an offset, and freeing happens by setting the offset to 0 again, i.e. all memory in one go is just “winked away”.

Buffer pools of linear allocators

Some engine implementations have a strategy where there is only one huge linear allocator in flight and once exhausted, it is considered OOM and a GPU stall is inevitable. This strategy is nice from an “explicit descriptor set” design standpoint if we use UNIFORM_DYNAMIC descriptor type, since we can use a fixed descriptor set for uniform data, as offsets into the UBO are encoded when binding the descriptor set. I find this concept a bit too limiting, since there is no obvious limit to use (very content and scene dependent). I opted for a recycled pool of smaller buffers instead since Granite’s descriptor binding model is very flexible as we saw in the previous post in this series. If I had to deal with explicit descriptor sets, uniform data would be kind of nightmarish to deal with.

Vulkan::CommandBuffer can request a suitable chunk of data from Vulkan::Device, and once exhausted or on submission, the buffers are recycled back again. We can only reuse the buffer once the frame is complete on the GPU, so we also use the frame context to recycle linear allocators back into the “ready for allocation” pool at the right time.

To DMA queue or not to DMA queue …

Discrete GPUs have a property where accessing memory in VRAM is very fast, while host memory can be accessed over PCI-e at a far slower rate. For staging data like vertex, index and uniform buffers, it might be reasonable to assume that we should copy the CPU-side to GPU-side and let the GPU consume the streamed data in fast VRAM. Granite supports two modes where we let the GPU read data read-only from HOST_VISIBLE, and one where we automatically perform staging buffer copies over to GPU from the CPU buffer.

In practice however, I don’t see any gain from doing the staging copy. The extra overhead of submitting a command buffer on the DMA queue which copies data over, and adding the extra synchronization overhead with semaphores and friends just does not seem worthwhile. Discrete GPUs can cache read-only data sourced from PCI-e just fine.

Super-convenient API

Since we have a very free-flowing descriptor binding model, we can have an API like this:

auto cmd = device.request_command_buffer();
MyUBO *ubo = cmd->allocate_typed_constant_data<MyUBO>(set, binding, count);
// Fill in data on persistently allocated buffer.
ubo->data1 = 1;
ubo->data2 = 2;

void *vert_data = cmd->allocate_vertex_data(binding, size, stride);
// Fill in data.
void *index_data = cmd->allocate_index_data(size, VK_INDEX_TYPE_UINT16);
// Fill in data.
cmd->draw_indexed();

// Pointers are now invalidated.
device.submit(cmd);

The allocation functions are just light wrappers which allocate, and bind the buffer at the appropriate offset. It’s perfectly possible to roll your own linear allocation system, e.g. you want to reuse a throwaway allocation in multiple command buffers in the same frame, or something like that.

Conclusion

I think spending time on making temporary allocations as convenient as possible will pay dividends like nothing else. The productivity boost of knowing you can allocate data on the command buffer for near-zero overhead simplifies a lot of code around the callsite, and there is little to no cost of implementing this. Linear allocators are trivial to implement.

… up next!

On the next episode of “this all seems so high-level, where’s my low-level goodness”, we will look at render passes and synchronization in Granite, which is where the low-level aspects of Granite will be exposed.

A tour of Granite’s Vulkan backend – Part 3

Shaders and descriptor sets

This is part 3 of a blog series I’m writing on Granite‘s Vulkan backend. In this episode we are looking at how we deal with shaders and descriptor sets. At this point in our design process, there are many, many choices to make. Especially descriptor sets need to be carefully considered.

Hash all the things

A theme we start to see now is hashmaps and lazy creation of objects. One thing you run into with Vulkan’s pipeline-related types are how much work it is to be explicit all the time. The amount of information we need to provide is staggering. I believe it not healthy for mind and soul to work at low levels here except in special cases, and we should aggressively hide away detail where we can. There is naturally a clock cycles vs. sanity tradeoff to be made here.

You can argue that the lines between high-level GL/D3D11-style design and Granite’s model are quite blurred. The (mental) price to pay to be explicit is just not worth it in my opinion. I will try to explore the obvious alternatives here and provide more context why the design is the way it is.

04 – Shaders and pipeline layouts

The first step in creating a pipeline is of course, to create a VkShaderModule from our SPIR-V code. This is a no-brainer, but next we need a pipeline layout, which in turn requires VkDescriptorSetLayouts. The sample is here https://github.com/Themaister/Granite-MicroSamples/blob/master/04_shaders_and_programs.cpp.

Rather than manually declaring the pipeline layout like a caveman I think using reflection to automatically generate layouts is a good idea. There is no reason for users to copy information which exists in the shaders already. For the reflection, I use SPIRV-Cross. If we never need to compile SPIR-V in runtime (game engine scenario), there is no reason why we cannot shift the reflection step to off-line as well and just pass the side-band data along to remove a runtime dependency. I never got as far as building a nice off-line SPIR-V baking pipeline, so I just compile GLSL on the fly with shaderc. However, the interface in the Vulkan backend just consumes raw SPIR-V.

A common mistake beginners tend to do is to think that names are important in binding interfaces. This is a mistake carried over from the GL and D3D11 days. The only things we should care about are descriptor sets, bindings and location decorations as well as push constant use. This is the only semantic information we need to create binding interfaces, i.e. pipeline layouts and pipelines.

A pipeline layout in Vulkan needs to know all shader stages a-la GL programs, so we also need a step to combine shaders into a Vulkan::Program. Here we take the union of reflection information and request handles for Vulkan::DescriptorSetAllocator and Vulkan::PipelineLayout. This is hashed, but there is no performance concern here since we should do all of this work in load time when possible anyways. These handles are all owned internally in Vulkan::Device, and there is no reason to worry about object lifetime for these objects.

I don’t think there is a reason to deviate far from this design unless you have a very specific scheme in mind with descriptor set allocation. As I’ll explore later, using bindless descriptors extensions or explicit descriptor set allocation could motivate use of a “standard” pipeline layout, in which case reflection gets kind of meaningless anyways.

05 – The binding model – embracing laziness

I never really had a problem with the old-school way of binding resources to binding slots. It just isn’t the part of the old APIs I felt were lacking, so Granite is kind of old school here, but it does have full consideration for descriptor sets and I removed any impedance mismatch with Vulkan (i.e. no translation needed to bridge between Granite and Vulkan). E.g.:

cmd->set_storage_buffer(set, binding, *resource);
cmd->set_texture(set, binding, resource->get_view(), Vulkan::StockSampler::LinearClamp);

The old binding models in GL/D3D11 have flat binding spaces with no separation of per-frame, per-material, or per-draw bindings. In Granite I wanted to take full advantage of the descriptor set feature where we can assign some kind of “frequency” and relation between bindings. Here is an example to illustrate how it is used: https://github.com/Themaister/Granite-MicroSamples/blob/master/05_descriptor_sets_and_binding_model.cpp.

In draw time, we can use the current pipeline layout and pull in the binding points which are active and make sure we bind descriptor sets with the correct resources. This is actually hot code, so I spent time designing a nice system here which tries to be as optimal as possible, given these restrictions.

Because of mobile, we need some conservative limits. I use 4 descriptor sets and 16 (dense) binding points per set (minimum spec of Vulkan). This allows for fairly compact pipeline layout descriptions, and we can loop over bitsets to look at resources. This is also just fine for my use cases.

When it comes to allocation of descriptor sets themselves, I think I have a very different approach to most. A Vulkan::DescriptorSetAllocator is represented as:

  • The VkDescriptorSetLayout
  • A bunch of VkDescriptorPools which can only allocate VkDescriptorSets of this set layout. Pools are added on-demand.
  • A pool of unused VkDescriptorSets which are already allocated and can be freely updated.
  • A temporary hashmap which keeps track of which descriptor sets have been requested recently. This allows us to reuse descriptor sets directly. In the ideal case, we almost never actually need to call vkUpdateDescriptorSets. We end up with hash -> get VkDescriptorSet -> vkCmdBindDescriptorSets. When a descriptor set has not been used for a couple of frames (8), we assume that it is no longer relevant, and the set is recycled, and some other descriptor set can reuse it and just call vkUpdateDescriptorSet. We definitely do not want to keep track of when any buffer or image resources is destroyed, and recycle early. That’s tracking hell which slows everything down.

The temporary hashmap is a data structure I’m quite happy with. It’s used for a few other resources as well. See https://github.com/Themaister/Granite/blob/master/util/temporary_hashmap.hpp for the implementation.

On certain GPUs, allocating descriptor sets is, or at least used to be very costly. The descriptor pools might not be implemented as true pools (sigh …), so every vkAllocateDescriptorSets would mean a global heap allocation, absolutely horrible for performance. This is the reason I’m not a big fan of the “one large pool” design. In this model, we just allocate a massive VkDescriptorPool, and we just allocate from that, for any kind of descriptor set. This means recycling VkDescriptorSet handles over many frames is impractical. The intended use pattern is to call vkResetDescriptorPool and allocate new descriptor sets which are only valid for one frame at a time, just like command buffers. There is also the problem of knowing how to balance the descriptor load for these massive pools, what’s the ratio of image descriptors vs uniform buffer descriptors, etc. With per-descriptor set layout allocators, there is zero guess work involved.

Alternative design – Bindless

Bindless is all the rage right now. The only real complaint I have is that it’s only supported on desktop and requires an EXT extension. It also means writing shaders in a very specific way. I don’t really need it for my use cases, but bindless enables certain complex algorithms which benefit from accessing a huge set of resources dynamically.

Alternative design – persistent explicit VkDescriptorSets

An alternative is exposing descriptor sets directly and only allow users to bind descriptor sets rather than individual resources. The API user would need to build the sets manually. While this is an idea, I think there are too many hurdles to make it practical.

  • We need to know and declare the target imageLayout of textures up front. This is obvious 99% of the time (e.g. a group of material textures which are SHADER_READ_ONLY_OPTIMAL), but in certain cases, especially with depth textures, things can get rather ambiguous. This does seem to me like an API design fault. It is unclear why this information is needed.
  • Some resources are completely transient in nature and it does not make sense to place them in persistent descriptor sets. The perfect example here is uniform buffers. In later samples, we’ll look at the linear allocator system for transient data.
  • Some resources depend on the frame buffer, i.e. input attachments. Baking descriptor sets for these resources is not obvious, since we need to know the combination pipeline layout + frame buffer, which should have nothing to do with each other.
  • We need to know the descriptor set layout (and by extension, the shaders as well) up-front. This is problematic if resources are to be used in more than one shader. The common fix here is to settle on a “standard” pipeline layout so we can decouple shaders and resources. This usually means a lot of padding and redundant descriptor allocations instead. We have a limited amount of descriptor sets when targeting mobile (4). We do not have the luxury of splitting every individual “group” of resources into their own sets, some combinatorial effects are inevitable, making persistent descriptor sets less practical. On desktop, 8 sets is the norm, so that might be something to consider.
  • Hybrid solutions are possible, but complexity is increased for little obvious gain.

Conclusion

I’m happy with my design. It’s very easy to use, but there is a CPU prize I’m willing to pay and I honestly never saw it in the profiler. I think resource binding models are cases where shaving overhead away will shave your sanity away as well, at least if you want to be compatible with a wide range of hardware. It’s much easier if you only cater to high-end desktop where bindless can be deployed.

… up next!

Next up we will explore the linear allocators for uniform, vertex, index and staging data.

A tour of Granite’s Vulkan backend – Part 2

The life and death of objects

This is a part 2 in a series where I explore Granite‘s Vulkan backend. See part 1 for an introduction. In this blog entry we will dive into code, and we will start with the basics. Our focus in this entry will be to discuss object lifetimes and how Granite deals with the Vulkan rule that you cannot delete objects which are in use by the GPU.

Sample code structure

I will be referring to concrete code samples from here on out. I have started a small code repository which contains all the samples. See README.md for how to build, but you won’t need to run the samples to understand where I’m going with these samples. Stepping through the debugger can be rather helpful however.

Sample 01 – Create a Vulkan device

Before we can do anything, we must create a VkDevice. This aspect of Vulkan is quite dull and full of boilerplate, as is the setup code for any graphics API. There is not a lot to cover from an API design perspective, but there are a few things to mention. The sample code for this part is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/01_device_creation.cpp

The API for this is pretty straight forward. I decided to split up how we load the Vulkan loader library, since there are two main use cases here:

  • User wants Granite to load libvulkan.so/dll/dylib from standard locations and bootstrap from there.
  • User wants to load an already provided function pointer to vkGetInstanceProcAddr. This is actually the common case, since GLFW loads the Vulkan loader for you dynamically and you can just use the GLFW provided glfwGetInstanceProcAddr to bootstrap yourself. The volk loader has support for this.

To create the instance and device, we need to do the usual song and dance of creating a VkInstance and VkDevice:

  • Setup Vulkan debug callbacks
  • Identify and enable relevant extensions
  • Enable Vulkan validation layers in debug build
  • Find appropriate VkQueues to cover graphics, async compute, transfer

Vulkan::Context and Vulkan::Device

The Context owns the VkInstance and VkDevice, and Vulkan::Device borrows a VkDevice and manages the big objects which are created from a VkDevice. It is possible to have multiple Vulkan::Device on top of a VkDevice, but we end up sharing the VkQueues and the global heaps for that device, which is a very nice property of Vulkan, since it allows frontend/backend systems like e.g. RetroArch/libretro to share a VkDevice without having hidden global state leak between the API boundary, which is a huge problem with the legacy APIs like GL and D3D11.

Note that this sample, and all other samples in this chapter are completely headless. There is no WSI involved. Vulkan is really nice in that we don’t need to create window system contexts to do any GPU work.

02 – Creating objects

Creating new resources in a graphics API should be very easy, and here I spent a lot of time on convenience. Creating images and uploading data to them in raw Vulkan is a lot of work, since there are so many things you have to think about. Creating staging buffers, copy those, defer deletion of that staging buffer, maybe we copy on the transfer queue, or not? Emit some semaphores to transfer ownership to graphics queue, creating image views, and just so many things which is very painful to write. Just creating an image in a solid way is several hundred lines of code. Fortunately, this kind of code is very easy to wrap in an API. See sample: https://github.com/Themaister/Granite-MicroSamples/blob/master/02_object_creation.cpp, where we create a buffer and image. I think the API is about as simple as you can make it while keeping a reasonable amount of flexibility.

Memory management

When we allocate resources, we allocate it from Granite’s heap allocator for Vulkan. If I had done Granite today, I would just use AMD’s Vulkan Memory Allocator, but it did not exist at the time I designed my allocator, and I’m pretty happy with my design as it stands. Maybe if I need de-fragmentation in the future or some really complex memory management strategy, I’ll have to rethink and use a mature library.

To get a gist of the algorithms, Granite will allocate 64 MB chunks, which are split in 32 chunks. Those 32 chunks can then be subdivided into 32 smaller chunks, etc, all the way down to 256 bytes little chunks. I made a cute little algorithm to allocate effectively from these blocks with CTZ operations and friends. Classic buddy allocator, but you have 32 buddies.

There are also dedicated allocations. I use VK_KHR_dedicated_allocation to query if an image should be allocated with a separate vkAllocateMemory rather than being allocated from the heap. This is generally useful when allocating large frame buffers on certain architectures. Also, for allocations which exceed 64 MB, dedicated allocations are used.

Memory domains

A nice abstraction I made is that rather than dealing with memory types like DEVICE_LOCAL, HOST_VISIBLE, and the combination of all the possible types, I declare up-front where I like my buffers and images to reside. For buffers, there are 4 use cases:

  • Vulkan::BufferDomain::Device – Must reside on DEVICE_LOCAL_BIT memory. May or may not be host visible (discrete vs integrated GPUs).
  • Vulkan::BufferDomain::Host – Must be HOST_VISIBLE, prefer not CACHED. This for uploads to GPU.
  • Vulkan::BufferDomain::CachedHost – Must be HOST_VISIBLE and CACHED. Falls back to non-cached, but should never happen. Might not be COHERENT. Used for readbacks from GPU.
  • Vulkan::BufferDomain::LinkedDeviceHost – HOST_VISIBLE and DEVICE_LOCAL. This maps to AMD’s pinned PCI mapping, which is restricted to 256 MB. I don’t think I’ve ever actually used it, but it’s a niche option if I ever need it.

When uploading initial data to a buffer, and Device is used, we can take advantage of integrated GPUs which share memory with the CPU. In this case, we can avoid any staging buffer, and just memcpy data straight into the new DEVICE_LOCAL memory. Don’t just blindly use staging buffers when you don’t need it. Integrated GPUs will generally have DEVICE_LOCAL and HOST_VISIBLE memory types.

Mapping host memory

While not present in the sample, it makes sense to discuss how we map Vulkan memory to the CPU. A good rule of thumb in general is to keep host memory persistently mapped. vkMapMemory and vkUnmapMemory is quite expensive, especially on mobile, and we can only have one mapping of a VkDeviceMemory (64 MB with tons of suballocations!) active at any time. Rather than Map/Unmap all the time, we implement map/unmap in Vulkan::Device, by checking if we need to perform cache maintenance, with no extra CPU cost. On map() for example, we need to call vkInvalidateMappedRanges if the memory type is not COHERENT, and for unmap, we call vkFlushMappedRanges if the memory is not COHERENT. This is fairly common on mobile when doing readbacks from GPU, since we need CACHED, but we might not get COHERENT. Granite’s backend abstracts all of this away.

Physical and transient image memory

A very powerful feature of Vulkan is the support for TRANSIENT images. These images do not have to be backed by physical memory in Vulkan, and is very nice on tile-based mobile renderers.

In Granite I fully support transient images where I can pass in two different domains for images, Physical and Transient. Since Transient images are generally used for throw-away scenarios, there is a convenient method in Vulkan::Device::get_transient_attachment() to simply request a transient image with a format and resolution for rendering. Transient images are generally never created manually since they are so easy to manage internally.

Handle types

There are many ways to abstract handle types in general, but I went for my own “smart pointer” variant, the trusty intrusive ref-counted pointer. It can basically be thought of a std::shared_ptr, but simpler, and we can pool the allocations of handles very nicely. How we design these handle types are not really important for Vulkan though, but I figured this point would generate some questions, so I’m addressing it here. See https://github.com/Themaister/Granite/blob/master/util/intrusive.hpp for details.

03 – Deferring deletions of GPU resources

Now we’re getting into topics where there can be significant design differences between Vulkan backends. My design philosophy for a middle-level abstraction is convenient, deterministic and good enough at the cost of a theoretical optimal solution.

A common theme you’ll find in Granite is the use of RAII. Once lifetimes of objects end, we automatically clean up resources. This is nothing new to C++ programmers, but the big problem in Vulkan is we’re not managing just memory on CPU with new/delete. We actually need to carefully control when things are deleted, since the GPU might be using the resources we are freeing. The strategy here will be to defer any deletions. The sample is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/03_frame_contexts.cpp

The frame context

In order to handle object lifetimes in Granite, I have a concept of a frame context. The frame context is responsible for holding all resources which belong to a “frame” of work. Normally this corresponds to a frame of work between AcquireNextImage and QueuePresent, but it is not tightly coupled. The Device has an array of frame contexts, usually 2 of them to allow double-buffering between CPU and GPU, (and 3 on Android because TBDR GPUs are a bit more pipelined and tend to prefer a little more buffering). The frame context is basically a huge data structure which holds data like:

  • Which VkFences must be waited on to make sure that all GPU work associated with this queue is done. This is the gatekeeper which holds all our recycling and deletions back.
  • Command pools for each worker thread and queue types.
  • VkBuffers, VkImages, etc, to be deleted once the fences signal.
  • Memory allocations from heap allocator to be freed.
  • … and various other resources.

Basically, we have a central place to chuck any things which need to happen “later”, when the GPU is guaranteed to be done with this frame.

As a special consideration, the big fat “make it go slow” call Device::wait_idle() will automatically clean up everything in one go since it knows at this instant the GPU is not doing anything.

Command buffer lifetime compromise

To make the frame based cleanup work in practice, we need to simplify our notion of what command buffers can do. In Vulkan, we have the flexibility to record command buffers once and reuse them at will at any time. This creates some complications. First of all, it throws the idea of a per-frame command pool out of the window. We can never reset the command pool in that case, since there will be free-floating command buffers out there which might be used later. Command pools work their best in Vulkan when you don’t allow individual command buffers to be freed.

If we have reusable command buffers, we also have the problem of object lifetimes. We end up with a painful situation where GPU resources must be retained until all command buffers which reference them are discarded. This leads to a really difficult situation where you have two options – deep reference-counting per command buffer or just pray all of this works out and make sure objects are kept alive as long as necessary. The former option is very costly and bug-prone, and the latter is juggling with razor blades too much for my taste where a large, meaningless burden is placed on the user.

I generally don’t think reusable command buffers are a worthwhile idea, at least not for interactive applications where we’re not submitting a static workload to the GPU over and over. There just aren’t many reasonable use-cases where this gives you anything meaningful. The avenues where you can submit the same calls over and over are maybe restricted to post-processing, but recording a few draw calls which render a few full-screen quads (or compute dispatches for the cool kids) is not exactly where your draw call overhead is going to matter.

I find that beginners obsess over the idea of aggressive reuse a little too much. In the end I feel it is misguided, and there are many better places to spend your time. Recording command buffers itself in Vulkan is super efficient.

My ideal use for command buffers are where command buffers are light-weight handles which all source their memory from a common command pool linearly. No reuse, so we use ONE_TIME_SUBMIT_BIT and TRANSIENT_BIT on the pool.

In Granite, I greatly simplified the idea of command buffers into transient handles which are requested, recorded and submitted. They must be recorded and submitted in the same frame context you requested the command buffer. This way we remove the whole need for keeping track of objects per-command buffers. Instead, we just tie the resource destruction to a frame context, and that’s it. No need for complicated tracking, it’s very efficient, but we risk destroying the object a little later than is theoretically optimal. This could potentially increase memory pressure in certain situations, but I think the trade-off I made is good. If needed, I can always add explicit “delete this resource now, I know it’s safe” methods, but I haven’t found any need for this. This would only be important if we are truly memory bound.

A design decision I made was that I would never want to do internal ref-counts for resources like images and buffers, and the design would be forced to not rely on fine-grained tracking which you would typically find in legacy API implementations. Any ref-counted operations should be immediately visible to API users and never be hidden behind API implementations. In fact, command buffer arguments for binding resources are plain references or pointers, not ref-counted types.

The memory pressure of very large frames

The main flaw of this design is that there might be cases where there is one spurious frame context that has extreme use of creation and deletions of resources. A prime example here is loading screens or similar. Since Vulkan resources are not freed until the frame context itself is complete, we cannot recycle memory within a frame unless we explicitly iterate the frame context with Device::next_frame_context(). This tradeoff means that the Granite backend does not have to heuristically stall the GPU in order to reclaim memory at suitable times, which adds a lot of complexity and ruins the determinism of Granite’s behavior.

… up next!

In the next episode of Granite shenanigans we will look at the shader pipeline where we discuss VkShaderModule, VkDescriptorSetLayout, VkPipelineLayout and VkPipeline objects.

A tour of Granite’s Vulkan backend – Part 1

Introduction

Since January 2017, I’ve been working on my little engine project, which I call Granite. It’s on Github here. Like many others, I felt I needed to write a little engine for myself to fully learn Vulkan and I needed a test bed to implement various graphics techniques. I’ve been steadily working on it since then and used it as the backbone for many side-projects, but I think for others its value right now is for teaching Vulkan concepts by example.

A while back I wrote a blog about my render graph implementation. The render graph sits on top of the Vulkan implementation, but in this series I would like to focus on the Vulkan layer itself.

The motivation for a useful mid-level abstraction

One thing I’ve noticed in the Twitter-sphere and various panel discussions over the last years is the idea of the mid-level abstraction. Where GL and D3D11 is too high-level and inflexible for our needs in non-trivial applications, Vulkan and D3D12 tend to overshoot in low-level complexity, with the goal of being as low-level and explicit as possible while staying GPU architecture/OS-portable. I think everyone agrees that having a good mid-level abstraction is important, but the problem we always have when designing these layers is where to make the right trade-offs. There will always be those who chase maximum possible performance, even if complexity when using the abstraction shoots through the roof.

For Granite I always wanted to promote convenience, while avoiding the worst penalties in performance. The good old 80/20 rule basically. There are many, many opportunities in Vulkan to not do redundant CPU work, but at what cost? Is it worth architecting yourself into a diamond – a super solid, but in the end, inflexible implementation? I’m noticing a lot of angst in general around this topic, especially among beginners. A general fear of not chasing every last possible performance optimization because it “might be really important later” is probably why we haven’t seen a standard, mid-level graphics API yet in wide use.

I feel that the benefits you gain by designing for maximum possible CPU performance are more theoretical design exercises than practical ones. Even naive, straight forward, single-threaded Vulkan blows GL/GLES out of the water in CPU overhead in my experience, simply because we can pick and choose the extra work we need to do, but legacy driver stacks have built up cruft over a decade or more to support all kinds of weird use cases and heuristics. Add multi-threading on top of that, and then you can think about micro-optimizing API overhead, if you actually need it. I suspect you won’t even need multi-threaded Vulkan. I believe the real challenge with the modern APIs right now is optimizing GPU performance, not CPU.

Metal is getting a lot of praise for its successful trade-off in performance and usability. While I don’t know the API itself in detail to make a judgement (I know all the horrors of Metal Shading Language though cough), I get the impression that the mid-level abstraction is the abstraction level we should be working in 99% of the time.

I think Granite is one such implementation. I am not trying to propose that Granite is the solution, but it is one of them. The design space is massive. There just cannot possibly be a one true graphics API for all users. Rather than suggest you go out and use it directly, I will try to explain how I designed a Vulkan interface which is quite convenient to use and runs well on both desktop and mobile (very few projects consider both), at least for my use cases. Ideally, you should be inspired to make the mid-level abstraction that is right for you and your project. I have gone through a couple of iterations to get where I am now with the design, and used it for various projects, so I think it’s a good starting point at least.

The 3D-accelerated emulation use case

How Granite got started was actually the Vulkan backend in Beetle PSX HW renderer. I wrote up a Vulkan backend, and emulators need very immediate and flexible ways of using graphics APIs. Information is generally known only in the last minute. Being able to implement such projects guided Granite’s initial design process quite a lot. This is also a case where legacy APIs are really painful since you really need the flexibility of modern APIs to do a good job with performance. There are a lot of state changes and draw calls on top of the CPU cost of emulation itself. Creating resources and modifying data on the GPU in weird ways is a common case in emulation, and many drivers simply don’t understand these usage patterns and we hit painful slow-paths everywhere. With Vulkan there is little to no magic, we just implement things how we want, and performance ends up far more predictable.

I think many forget that Vulkan is not just for big (AAA) game engines. We can successfully use it for all kinds of things. We just need the right abstractions and knowledge.

How the design and implementation will be explored

To start off, we will explore the design through commented code samples, which use only the Vulkan portion of Granite as a library. We will write concrete samples of code, and then go through how all of this works, and then discuss how things could be designed differently.

… up next!

I haven’t written up any samples yet, so it makes sense to stop here. Next time, we’ll start with some samples.