Optimizing for scratch data allocation

Allocating memory from a heap is fine and all, but very often in an engine we need to allocate throwaway data. Uniform buffers are the perfect use case here. With transient command buffers, certain data is also going to be transient. It’s very common to just want to send some constant data to a draw call and forget about it.

In Vulkan, there is a perfect descriptor type for this use case, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. It’s not just uniform buffers, it’s fairly common that we want to allocate scratch vertex buffers, index buffers and staging data for texture updates.

Being able to implement allocators like these with no API overhead is a huge deal with Vulkan for me. In legacy APIs there are extremely painful limitations where “fire and forget” buffer allocations are very hard to implement well. Buffers generally cannot be mapped when submitting draw calls, so we need to fight really hard and think about copy-on-write behavior, discard behavior, API overhead to call map/unmap all the time (which breaks threaded driver optimizations) or batch up allocations and memcpy data around a couple of extra times. It’s too painful and a lot of CPU performance can go down the drain if we don’t hit all the fast paths.

The only proper solution in legacy APIs I can think of is GL 4.5’s GL_ARB_buffer_storage, which supports persistently mapped buffers like Vulkan, but relying on GL 4.5 (or GLES 3.2 + extensions) just does not seem reasonable to me, since targeting GL should be considered a compatibility option for old GPUs which do not have Vulkan drivers. This feature was a cornerstone of the “AZDO” (approaching zero driver overhead) buzz back in the GL days. D3D11 is still going to be the “compat” option on Windows for a long time, and forget about relying on latest and greatest GLES on Android.

This is the perfect occasion to present a “hello triangle” sample which uses most of these features, but we first need WSI, so let’s start there.

06 – Pushing pixels with SDL2

Granite’s main codebase normally uses GLFW, so to get a less redundant sample working, I wrote this sample to use SDL2’s Vulkan support, which is very similar to GLFW’s support starting with SDL 2.0.8.

Implementing WSI is similar to instance and device creation where there is a lot of boilerplate to churn through, with little room for design considerations. Granite’s WSI implementation has two main paths:

On-screen / VK_KHR_surface

In this mode, The WSI class creates and owns the Vulkan::Context and Vulkan::Device automatically for us and owns a VkSwapchainKHR. The only thing it cannot on its own is create the VkSurfaceKHR, which is platform dependent. Fortunately, the surface is the only platform-dependent object so we can supply an interface implementation to create this surface when Vulkan::WSI needs it. The sample implements an SDL2Platform class which uses SDL2’s built in wrappers for surface creation, nifty!

Off-screen / externally owned swapchains

Granite is also used in cases where we don’t necessarily own a swapchain which is displayed on screen. We might want to supply already created images in lieu of VkSwapchainKHR and provide our own image indices as well as acquire/release semaphores. After completing a frame, we can pass along the fake swapchain image to our consumer. The prime case for this is the libretro API implementation in Granite.

Pumping the main loop

Vulkan’s Acquire/Present model maps directly to a “begin” and “end” model in Granite. We call Vulkan::WSI::begin_frame() to acquire a new image index, advance the frame context and deal with any in-between frame work. We might have to deal with out-of-date swapchains here and various janitorial work which we never had to consider in old APIs.

Semaphores for WSI images are dealt with automatically. WSI images are treated specially and automatically handling synchronization for WSI resources is straight-forward to the point that there is no point in exposing that to the user. (Synchronization in Granite is in general very explicit, but WSI is one of few exceptions.) The main loop looks something like:

wsi.begin_frame(); // <-- acquire image if necessary, advances frame context
auto cmd = device.request_command_buffer();
// do work and render to swapchain
device.submit(cmd);
wsi.end_frame(); // <-- flushes frame, queues up presents if swapchain was rendered to this frame

Overall, WSI code is must to abstract in Vulkan, and I’m happy with the flexibility and simplicity in use I ended up with.

07 – Hello triangle (quad?) with scratch allocated VBO, IBO and UBO

Now that we can get stuff on-screen, now we’re getting to the actual meat of this post. https://github.com/Themaister/Granite-MicroSamples/blob/master/07_linear_allocators.cpp augments the WSI sample with a nice little quad. The VBO, IBO and UBOs are allocated directly on the command buffer.

Linear allocator – allocating memory at the speed of light

This allocator has many names – chain allocator, bump allocator, scratch allocator, stack allocator, etc. This is the perfect allocator for when we want to allocate a lot of small blobs, and just wink it all away at some point in the future. Allocation happens by incrementing an offset, and freeing happens by setting the offset to 0 again, i.e. all memory in one go is just “winked away”.

Buffer pools of linear allocators

Some engine implementations have a strategy where there is only one huge linear allocator in flight and once exhausted, it is considered OOM and a GPU stall is inevitable. This strategy is nice from an “explicit descriptor set” design standpoint if we use UNIFORM_DYNAMIC descriptor type, since we can use a fixed descriptor set for uniform data, as offsets into the UBO are encoded when binding the descriptor set. I find this concept a bit too limiting, since there is no obvious limit to use (very content and scene dependent). I opted for a recycled pool of smaller buffers instead since Granite’s descriptor binding model is very flexible as we saw in the previous post in this series. If I had to deal with explicit descriptor sets, uniform data would be kind of nightmarish to deal with.

Vulkan::CommandBuffer can request a suitable chunk of data from Vulkan::Device, and once exhausted or on submission, the buffers are recycled back again. We can only reuse the buffer once the frame is complete on the GPU, so we also use the frame context to recycle linear allocators back into the “ready for allocation” pool at the right time.

To DMA queue or not to DMA queue …

Discrete GPUs have a property where accessing memory in VRAM is very fast, while host memory can be accessed over PCI-e at a far slower rate. For staging data like vertex, index and uniform buffers, it might be reasonable to assume that we should copy the CPU-side to GPU-side and let the GPU consume the streamed data in fast VRAM. Granite supports two modes where we let the GPU read data read-only from HOST_VISIBLE, and one where we automatically perform staging buffer copies over to GPU from the CPU buffer.

In practice however, I don’t see any gain from doing the staging copy. The extra overhead of submitting a command buffer on the DMA queue which copies data over, and adding the extra synchronization overhead with semaphores and friends just does not seem worthwhile. Discrete GPUs can cache read-only data sourced from PCI-e just fine.

Super-convenient API

Since we have a very free-flowing descriptor binding model, we can have an API like this:

auto cmd = device.request_command_buffer();
MyUBO *ubo = cmd->allocate_typed_constant_data<MyUBO>(set, binding, count);
// Fill in data on persistently allocated buffer.
ubo->data1 = 1;
ubo->data2 = 2;

void *vert_data = cmd->allocate_vertex_data(binding, size, stride);
// Fill in data.
void *index_data = cmd->allocate_index_data(size, VK_INDEX_TYPE_UINT16);
// Fill in data.
cmd->draw_indexed();

// Pointers are now invalidated.
device.submit(cmd);

The allocation functions are just light wrappers which allocate, and bind the buffer at the appropriate offset. It’s perfectly possible to roll your own linear allocation system, e.g. you want to reuse a throwaway allocation in multiple command buffers in the same frame, or something like that.

Conclusion

I think spending time on making temporary allocations as convenient as possible will pay dividends like nothing else. The productivity boost of knowing you can allocate data on the command buffer for near-zero overhead simplifies a lot of code around the callsite, and there is little to no cost of implementing this. Linear allocators are trivial to implement.

… up next!

On the next episode of “this all seems so high-level, where’s my low-level goodness”, we will look at render passes and synchronization in Granite, which is where the low-level aspects of Granite will be exposed.