Yet another blog explaining Vulkan synchronization

After playing Fire Emblem: Three Houses for an ungodly 160 hours over the past weeks, I guess it’s time to put on my professor hat on the internet instead.

One topic I’ve been meaning to write about for a long time is synchronization in Vulkan. It’s a large hurdle to overcome when learning the API, and rather than mechanically explaining how it works, my goal here is to instill a mental model in the reader. Despite its reputation for maddening complexity, it is actually understandable and quite logical once you get over the initial hurdles.

Where appropriate, I will use terms which match the Vulkan specification.

The Vulkan queue

For this part of the discussion we will only consider a single VkQueue. There is a lot to consider for single-queue synchronization, and dealing with multiple queues is a small extension on top of single-queue synchronization, which is covered at the end when discussing semaphores.

The Vulkan queue is simply an abstraction where command buffers are submitted and the GPU churns through commands. Let’s get some common beginner mistakes out of the way first.

Command buffer misconceptions

Many developers seem to think that command buffer boundaries are somehow special in Vulkan. It is very important to clarify that for purposes of synchronization, everything submitted to a queue is simply a linear stream of commands. Any synchronization applies globally to a VkQueue, there is no concept of a only-inside-this-command-buffer synchronization.

Command overlap

The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write. Unless you add synchronization yourself, all commands in a queue execute out of order. Reordering may happen across command buffers and even vkQueueSubmits. This makes sense, considering that Vulkan only sees a linear stream of commands once you submit, it is a pitfall to assume that splitting command buffers or submits adds some magic synchronization for you.

NOTE: Unlike Vulkan, I do believe D3D12 disables any overlap across queue submits, but don’t quote me on that. Might be something to consider if you’re coming from D3D-land.

NOTE: Frame buffer operations inside a render pass happen in API-order, of course. This is a special exception which the spec calls out.

Pipeline stages

Every command you submit to Vulkan goes through a set of stages. These stages are represented in the VK_PIPELINE_STAGE enum. See chapter 6.1.2 in spec. When we synchronize work in Vulkan, we synchronize work happening in these pipeline stages as a whole, and not individual commands of work.

Draw calls, copy commands and compute dispatches all go through pipeline stages one by one.

The mysterious TOP_OF_PIPE and BOTTOM_OF_PIPE stages

A common stumbling block is the TOP_OF_PIPE and BOTTOM_OF_PIPE stages. These are essentially “helper” stages, which do no actual work, but serve some important purposes. Every command will first execute the TOP_OF_PIPE stage. This is basically the command processor on the GPU parsing the command. BOTTOM_OF_PIPE is where commands retire after all work has been done. TOP_OF_PIPE and BOTTOM_OF_PIPE are useful in specific scenarios, keep them in mind for later, as they are a little tricky and beginners make many mistakes with these.

In-queue execution barriers

Before we tackle memory barriers, we must fully understand execution barriers, as they are a subset of memory barriers. The primary mechanism in Vulkan to introduce execution barriers is the pipeline barrier.

To introduce the simplest form of an execution dependency we use a pipeline barrier:

void vkCmdPipelineBarrier(
    VkCommandBuffer                             commandBuffer,
    VkPipelineStageFlags                        srcStageMask,
    VkPipelineStageFlags                        dstStageMask,
    VkDependencyFlags                           dependencyFlags,
    uint32_t                                    memoryBarrierCount,
    const VkMemoryBarrier*                      pMemoryBarriers,
    uint32_t                                    bufferMemoryBarrierCount,
    const VkBufferMemoryBarrier*                pBufferMemoryBarriers,
    uint32_t                                    imageMemoryBarrierCount,
    const VkImageMemoryBarrier*                 pImageMemoryBarriers);

If we ignore the memory barriers and flags here, we’re essentially left with two arguments, srcStageMask and dstStageMask. This represents the heart of the Vulkan synchronization model. We’re splitting the command stream in two with a barrier, where we consider “everything before” the barrier, and “everything after” the barrier, and these two halves are synchronized in some way.

Section 6.1 lays this out in rather obtuse language, but we boil it down to:

srcStageMask

This represents what we are waiting for. Vulkan does not let you add fine-grained dependencies between individual commands. Instead you get to look at all work which happens in certain pipeline stages. For example, if we were to submit this series of commands starting off a fresh VkDevice:

  • vkCmdDispatch (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)
  • vkCmdCopyBuffer (VK_PIPELINE_STAGE_TRANSFER_BIT)
  • vkCmdDispatch (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)
  • vkCmdPipelineBarrier (srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT)

We would be referring to the two vkCmdDispatch commands, as they perform their work in the COMPUTE stage. Even if we split these 4 commands into 4 different vkQueueSubmits, we would still consider the same commands for synchronization. Essentially, the work we are waiting for is all commands which have ever been submitted to the queue including any previous commands in the command buffer we’re recording. srcStageMask then restricts the scope of what we are waiting for. Only work happening in COMPUTE_SHADER_BIT stage is relevant in this example. srcStageMask is a bit-mask as the name suggests, so it’s perfectly fine to wait for both COMPUTE and TRANSFER work.

There are also flags to refer to “all commands”, ALL_COMMANDS_BIT, which basically drains the entire queue for work. ALL_GRAPHICS_BIT is the same, but only for render passes.

NOTE: Here we will find a potential use case for TOP_OF_PIPE. srcStageMask of TOP_OF_PIPE is basically saying “wait for nothing”, or to be more precise, we’re waiting for the GPU to parse all commands, which is, a complete noop. We had to parse all commands before getting to the pipeline barrier command to begin with. When we get to memory barriers, this can be very useful.

dstStageMask

This represents the second half of the barrier. Any work submitted after this barrier will need to wait for the work represented by srcStageMask before it can execute. Only work in the specified stages are affected. For example, if dstStageMask is FRAGMENT_SHADER_BIT, vertex shading for future commands can begin executing early, we only need to wait once FRAGMENT_SHADER_BIT is reached.

NOTE: As an analog to srcStageMask with TOP_OF_PIPE, for dstStageMask, using BOTTOM_OF_PIPE can be kind of useful. This basically translates to “block the last stage of execution in the pipeline”. Basically, we translate this to mean “no work after this barrier is going to wait for us”. This might seem meaningless, but it will be useful when we discuss semaphores and memory barriers later.

A crude example

Let’s assume we record and submit some commands on a fresh VkDevice:

  1. vkCmdDispatch
  2. vkCmdDispatch
  3. vkCmdDispatch
  4. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = COMPUTE)
  5. vkCmdDispatch
  6. vkCmdDispatch
  7. vkCmdDispatch

With this barrier, the “before” set is commands {1, 2, 3}. The “after” set is {5, 6, 7}. A possible execution order here could be:

  • #3
  • #2
  • #1
  • #7
  • #6
  • #5

{1, 2, 3} can execute out-of-order, and so can {5, 6, 7}, but these two sets of commands can not interleave execution. In spec lingo {1, 2, 3} happens-before {5, 6, 7}.

https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples has some examples of how these stages are used in practice.

Events aka. split barriers

Vulkan provides a way to get overlapping work in-between barriers. The idea of VkEvent is to get some unrelated commands in-between the “before” and “after” set of commands, e.g.:

  1. vkCmdDispatch
  2. vkCmdDispatch
  3. vkCmdSetEvent(event, srcStageMask = COMPUTE)
  4. vkCmdDispatch
  5. vkCmdWaitEvent(event, dstStageMask = COMPUTE)
  6. vkCmdDispatch
  7. vkCmdDispatch

The “before” set is now {1, 2}, and the after set is {6, 7}. 4 here is not affected by any synchronization and it can fill in the parallelism “bubble” we get when draining the GPU of work from 1, 2, 3. For advanced compute, this is a very important thing to know about, but not all GPUs and drivers can take advantage of this feature.

Execution dependency chain

This is a subtle – but very important – point which I don’t think is well enough understood. The general gist of it is that when we use dstStageMask to block stages, the dependencies in srcStageMask are carried forward into the blocked stages. Waiting for dstStageMask later will also wait for any dependencies dstStageMask had. It is easier to show an example here:

  1. vkCmdDispatch
  2. vkCmdDispatch
  3. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)
  4. vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)
  5. vkCmdDispatch
  6. vkCmdDispatch

In this example we actually get a dependency between {1, 2} and {5, 6}. This is because we created a chain of dependencies between COMPUTE -> TRANSFER -> COMPUTE. When we wait for TRANSFER in 4. we must also wait for anything which is currently blocking TRANSFER. This might seem confusing, but it makes sense if we consider a slightly modified example.

  1. vkCmdDispatch
  2. vkCmdDispatch
  3. vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER)
  4. vkCmdMagicDummyTransferOperation
  5. vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE)
  6. vkCmdDispatch
  7. vkCmdDispatch

In this scenario, it’s clear that {4} must wait for {1, 2}. And {6, 7} must wait for {4}. So, we have created a chain where {1, 2} -> {4} -> {6, 7}, and as {4} is noop, {1, 2} -> {6, 7} is achieved. That’s essentially the chain.

This has some uses when you want to “link up” barriers for whatever reason. I kinda wish Vulkan had some special “scoreboard” pipeline stages just for this use case …

Pipeline stages and render passes

COMPUTE and TRANSFER work is very simple when it comes to pipeline stages. The only stages they execute are:

  • TOP_OF_PIPE
  • DRAW_INDIRECT (for indirect compute only)
  • COMPUTE / TRANSFER
  • BOTTOM_OF_PIPE

Render passes are a bit more intricate, and it’s very easy to confuse which pipelines stages do what.

In render passes there are two “families” of pipeline stages, those which concern themselves with geometry processing, and the fragment family, which does rasterization / frame buffer operations.

Aside from TOP_OF_PIPE/BOTTOM_OF_PIPE, we have

Geometry

  • DRAW_INDIRECT – Parses indirect buffers
  • VERTEX_INPUT – Consumes fixed function VBOs and IBOs
  • VERTEX_SHADER – Actual vertex shader
  • TESSELLATION_CONTROL_SHADER
  • TESSELLATION_EVALUATION_SHADER
  • GEOMETRY_SHADER

Fragment

  • EARLY_FRAGMENT_TESTS
  • FRAGMENT_SHADER
  • LATE_FRAGMENT_TESTS
  • COLOR_ATTACHMENT_OUTPUT

For the most part, it’s the Fragment stages which are a bit confusing. Each of them have their own use cases.

EARLY_FRAGMENT_TESTS

This is the stage where early depth/stencil tests happen. This stage isn’t all that useful or meaningful except in some very obscure scenarios with frame buffer self-dependencies (aka, GL_ARB_texture_barrier). This is also where a render pass performs its loadOp of a depth/stencil attachment.

LATE_FRAGMENT_TESTS

This is where late depth-stencil tests take place, and also where depth-stencil attachments are stored with storeOp when a render pass is done.

Helpful tip on fragment test stages

It’s somewhat confusing to have two stages which basically do the same thing. When you’re waiting for a depth map to have been rendered in an earlier render pass, you should use srcStageMask = LATE_FRAGMENT_TESTS_BIT, as that will wait for the storeOp to finish its work.

When blocking a render pass with dstStageMask, just use a mask of EARLY_FRAGMENT_TESTS | LATE_FRAGMENT_TESTS.

NOTE: dstStageMask = EARLY_FRAGMENT_TESTS alone might work since that will block loadOp, but there might be shenanigans with memory barriers if you are 100% pedantic about any memory access happening in LATE_FRAGMENT_TESTS. If you’re blocking an early stage, it never hurts to block a later stage as well.

COLOR_ATTACHMENT_OUTPUT

This one is where loadOp, storeOp, MSAA resolves and frame buffer blend stage takes place, basically anything which touches a color attachment in a render pass in some way. If you’re waiting for a render pass which uses color to be complete, use srcStageMask = COLOR_ATTACHMENT_OUTPUT, and similar for dstStageMask when blocking render passes from execution.

Memory barriers

Now that we have the basics for execution barriers, we can kick it up a notch and consider memory barriers.

Execution order and memory order are two different things. GPUs are notorious for having multiple, incoherent caches which all need to be carefully managed to avoid glitched out rendering. This means that just synchronizing execution alone is not enough to ensure that different units on the GPU can transfer data between themselves.

If you are familiar with how C++11 introduced memory order and atomics, it is a good start, but the C++11 memory model does not consider that memory access can be incoherent to my knowledge. All CPU memory is assumed to be coherent, but memory order is weak on basically anything non-x86. Vulkan expands on this concept.

The two concepts in the Vulkan specification we need to understand is memory being available and memory being visible. This is an abstraction over the fact that GPUs have incoherent caches. To explain this I will describe a mental model of a hypothetical GPU design which should make sense if you are familiar with how caches work.

NOTE: There is a formal Vulkan memory model now which covers all of this in extreme detail. I admit I have not studied it enough to make references to it here, but developers really don’t need to know that level of detail to use Vulkan correctly.

The L2 cache / main memory

We will let the last cache hierarchy represent the “master” memory controller which all caches are connected to. This cache is connected to all other L1 caches, and external DDR memory. The GPU DDR memory is connected to the CPU memory controller in some way (PCI-e or UMA).

When our L2 cache contains the most up-to-date data there is, we can say that memory is available, because L1 caches connected to L2 can pull in the most up-to-date data there is.

Incoherent L1 caches

Vulkan specifies a bunch of flags in the VK_ACCESS_ series of enums. These flags represent memory access which can be performed. Each pipeline stage can perform certain memory access, and thus we take the combination of pipeline stage + access mask and we get potentially a very large number of incoherent caches on the system. Each GPU core has its own set of L1 caches as well.

Of course, real GPUs will only have a fraction of the possible caches here, but as long as we are explicit about this in the API, any GPU driver can simplify this as needed.

Under section 6.1.3, table 4 in the Vulkan spec you can see a list of all possible access masks which can be used with a pipeline stage. These access masks either read from a cache, or write to an L1 cache in our mental model.

We say that memory is visible to a particular stage + access combination if memory has been made available and we then make that memory visible to the relevant stage + access mask.

Once a shader stage writes to memory, the L2 cache no longer has the most up-to-date data there is, so that memory is no longer considered available. If other caches try to read from L2, it will see undefined data. Whatever wrote that data must make those writes available before the data can be made visible again.

Cache flush and invalidate

To be clear, we can say that “making memory available” is all about flushing caches, and “making memory visible” is invalidating caches. This should make it more obvious what is going on. I will use the spec terminology however.

VkMemoryBarrier

If we revisit vkCmdPipelineBarrier, we can pass in a list of global memory barriers.

typedef struct VkMemoryBarrier {
    VkStructureType sType;
    const void* pNext;
    VkAccessFlags srcAccessMask;
    VkAccessFlags dstAccessMask;
} VkMemoryBarrier;

A global memory barrier deals with access to any resource, and it’s the simplest form of a memory barrier. This means that in vkCmdPipelineBarrier, we are specifying 4 things to happen in order:

  • Wait for srcStageMask to complete
  • Make all writes performed in possible combinations of srcStageMask + srcAccessMask available
  • Make available memory visible to possible combinations of dstStageMask + dstAccessMask.
  • Unblock work in dstStageMask.

A common misconception I see is that _READ flags are passed into srcAccessMask, but this is redundant. It does not make sense to make reads available, i.e. you don’t flush caches when you’re done reading data.

Memory access and TOP_OF_PIPE/BOTTOM_OF_PIPE

Never use AccessMask != 0 with these stages. These stages do not perform memory accesses, so any srcAccessMask and dstAccessMask combination with either stage will be meaningless, and spec disallows this. TOP_OF_PIPE and BOTTOM_OF_PIPE are purely there for the sake of execution barriers, not memory barriers.

Split memory barriers

A very important point here is that it’s perfectly possible to split up the “make available” and “make visible” operations.  This is similar to the execution dependency chain discussed earlier.

We can do something silly like:

  • vkCmdDispatch – writes to an SSBO, VK_ACCESS_SHADER_WRITE_BIT
  • vkCmdPipelineBarrier(srcStageMask = COMPUTE, dstStageMask = TRANSFER, srcAccessMask = SHADER_WRITE_BIT, dstAccessMask = 0)
  • vkCmdPipelineBarrier(srcStageMask = TRANSFER, dstStageMask = COMPUTE, srcAccessMask = 0, dstAccessMask = SHADER_READ_BIT)
  • vkCmdDispatch – read from the same SSBO, VK_ACCESS_SHADER_READ_BIT

While StageMask cannot be 0, AccessMask can be 0.

VkBufferMemoryBarrier

This is not very interesting, we’re just restricting memory availability and visibility to a specific buffer. No GPU I know of actually cares, I think it makes more sense to just use VkMemoryBarrier rather than bothering with buffer barriers.

VkImageMemoryBarrier

Unlike VkBufferMemoryBarrier, this one is critical. You have to change image layouts at some point and this is done as part of an image memory barrier.

typedef struct VkImageMemoryBarrier {
    VkStructureType sType;
    const void* pNext;
    VkAccessFlags srcAccessMask;
    VkAccessFlags dstAccessMask;
    VkImageLayout oldLayout;
    VkImageLayout newLayout;
    uint32_t srcQueueFamilyIndex;
    uint32_t dstQueueFamilyIndex;
    VkImage image;
    VkImageSubresourceRange subresourceRange;
} VkImageMemoryBarrier;

The interesting bits are oldLayout and newLayout. The layout transition happens in-between the make available and make visible stages of a memory barrier. The layout transition itself is considered a read/write operation, and the rules are basically that memory for the image must be available before the layout transition takes place. After a layout transition, that memory is automatically made available (but not visible!). Basically, think of the layout transition as some kind of in-place data munging which happens in L2 cache somehow.

A practical TOP_OF_PIPE example

Now we can actually make a practical example with TOP_OF_PIPE. If we just allocated an image and want to start using it, what we want to do is to just perform a layout transition, but we don’t need to wait for anything in order to do this transition. This is where TOP_OF_PIPE is useful. Let’s say that we’re allocating a fresh image, and we’re going to use it in a compute shader as a storage image. The pipeline barrier looks like:

  • srcStageMask = TOP_OF_PIPE – Wait for nothing
  • dstStageMask = COMPUTE – Unblock compute after the layout transition is done
  • srcAccessMask = 0 – This is key, there are no pending writes to flush out. This is the only way to use TOP_OF_PIPE in a memory barrier. It’s important to note that freshly allocated memory in Vulkan is always considered available and visible to all stages and access types. You cannot have stale caches when the memory was never accessed … What about recycled/aliased memory you ask? Excellent question, we’ll cover that too later.
  • oldLayout = UNDEFINED – Input is garbage
  • newLayout = GENERAL – Storage image compatible layout
  • dstAccessMask = SHADER_READ | SHADER_WRITE
A practical BOTTOM_OF_PIPE example

My favourite example here is swapchain images. We have to transition them into VK_IMAGE_LAYOUT_PRESENT_SRC_KHR before passing the image over to the presentation engine.

After transitioning into this PRESENT layout, we’re not going to touch the image again until we reacquire the image, so dstStageMask = BOTTOM_OF_PIPE is appropriate.

  • srcStageMask = COLOR_ATTACHMENT_OUTPUT (assuming we rendered to swapchain in a render pass)
  • srcAccessMask = COLOR_ATTACHMENT_WRITE
  • oldLayout = COLOR_ATTACHMENT_OPTIMAL
  • newLayout = PRESENT_SRC_KHR
  • dstStageMask = BOTTOM_OF_PIPE
  • dstAccessMask = 0

Having dstStageMask = BOTTOM_OF_PIPE and access mask being 0 is perfectly fine. We don’t care about making this memory visible to any stage beyond this point. We will use semaphores to synchronize with the presentation engine anyways.

Implicit memory ordering – semaphores and fences

Semaphores and fences are quite similar things in Vulkan, but serve a different purpose. Semaphores facilitate GPU <-> GPU synchronization across Vulkan queues, and fences facilitate GPU -> CPU synchronization.

These objects are signaled as part of a vkQueueSubmit. However, one very important thing to note about semaphores and fences is how they interact with memory. To signal a semaphore or fence, all previously submitted commands to the queue must complete. If this were a regular pipeline barrier, we would have srcStageMask = ALL_COMMANDS_BIT. However, we also get a full memory barrier, in the sense that all pending writes are made available. Essentially, srcAccessMask = MEMORY_WRITE_BIT.

Implicit memory guarantees on vkQueueSubmit

While signaling a fence or semaphore works like a full cache flush, submitting commands to the Vulkan queue, makes all memory access performed by host visible to all stages and access masks. Basically, submitting a batch issues a cache invalidation on host visible memory. A common mistake is to think that you need to do this invalidation manually when the CPU is writing into staging buffers or similar. Something like:

  • srcStageMask = HOST
  • srcAccessMask = HOST_WRITE_BIT
  • dstStageMask = TRANSFER
  • dstAccessMask = TRANSFER_READ

If the write happened before vkQueueSubmit, this is automatically done for you.

NOTE: This kind of barrier is necessary if you are using vkCmdWaitEvents where you wait for host to signal the event with vkSetEvent. In that case, you might be writing the necessary host data after vkQueueSubmit was called, which means you need a pipeline barrier like this. This is not exactly a common use case, but it’s important to understand when these API constructs are useful.

Implicit memory guarantees when waiting for a semaphore

While signalling a semaphore makes all memory available, waiting for a semaphore makes memory visible. This basically means you do not need a memory barrier if you use synchronization with semaphores since signal/wait pairs of semaphores works like a full memory barrier. Let’s see an example where queue 1 writes to an SSBO in compute, and consumes that buffer as a UBO in a fragment shader in queue 2. We’re going to assume the buffer was created with QUEUE_FAMILY_CONCURRENT.

NOTE: If you need to transfer ownership to a different queue family, you need memory barriers, one in each queue to release/acquire ownership.

Queue 1

  • vkCmdDispatch
  • vkQueueSubmit(signal = my_semaphore)

There is no pipeline barrier needed here. Signalling the semaphore waits for all commands, and all writes in the dispatch are made available to the device before the semaphore is actually signaled.

Queue 2

  • vkCmdBeginRenderPass
  • vkCmdDraw
  • vkCmdEndRenderPass
  • vkQueueSubmit(wait = my_semaphore, pDstWaitStageMask = FRAGMENT_SHADER)

When we wait for the semaphore, we specify which stages should wait for this semaphore, in this case the FRAGMENT_SHADER stage. All relevant memory access is automatically made visible, so we can safely access UNIFORM_READ_BIT in FRAGMENT_SHADER stage, without having extra barriers. The semaphores take care of this automatically, nice!

Execution dependency chain with semaphore

Just like pipeline barriers having execution dependency chains, we can create execution dependency chains with semaphores as well. pDstWaitStageMask in vkQueueSubmit blocks certain stages from executing.

If we create a pipeline barrier with srcStageMask targeting one of the stages in the wait stage mask, we also wait for the semaphore to be signaled. This is extremely useful for doing image layout transitions on swapchain images. We need to wait for the image to be acquired, and only then can we perform a layout transition. The best way to do this is to use pDstWaitStageMask = COLOR_ATTACHMENT_OUTPUT_BIT, and then use srcStageMask = COLOR_ATTACHMENT_OUTPUT_BIT in a pipeline barrier which transitions the swapchain image after semaphore is signaled.

Host memory reads

While signalling a fence makes all memory available, it does not make them available to the CPU, just within the device. This is where dstStageMask = PIPELINE_STAGE_HOST and dstAccessMask = ACCESS_HOST_READ_BIT flags come in. If you intend to read back data to the CPU, you must issue a pipeline barrier which makes memory available to the HOST as well. In our mental model, we can think of this as flushing the GPU L2 cache out to GPU main memory, so that CPU can access it over some bus interface.

Safely recycling memory and aliasing memory

We earlier had an example with creating a fresh VkImage and transitioning it from UNDEFINED, and waiting for TOP_OF_PIPE. As explained, we did not need to specify any srcAccessMask since we knew that memory was guaranteed to be available. The reason for this is because of the implied guarantee of signalling a fence. In order to recycle memory, we must have observed that the GPU was done using the image with a fence. In order to signal that fence, any pending writes to that memory must have been made available, so even recycled memory can be safely reused without a memory barrier. This point is kind of subtle, but it really helps your sanity not having to inject memory barriers everywhere.

However, what if we consider we want to alias memory inside a command buffer? The rule here is that in order to safely alias, all memory access from the active alias must be made available before a new alias can take its place. Here’s an example for a case where we have two VkImages which are used in two render passes, and they alias memory. When one image alias is written to, all other images immediately become “undefined”. There are some exceptions in the specification for when multiple aliases can be valid at the same time, but for now we assume that is not the case.

  • vkCmdPipelineBarrier(image = image1, oldLayout = UNDEFINED, newLayout = COLOR_ATTACHMENT_OPTIMAL, srcStageMask = COLOR_ATTACHMENT_OUTPUT, srcAccessMask = COLOR_ATTACHMENT_WRITE, dstStageMask = COLOR_ATTACHMENT_OUTPUT, dstAccessMask = COLOR_ATTACHMENT_WRITE|READ)

image1 will contain garbage here so we need to transition away from UNDEFINED. We need to make any pending writes to COLOR_ATTACHMENT_WRITE available before the layout transition takes place, assuming that we’re running these commands every frame. The following render pass will wait for this transition to take place using dstStageMask/dstAccessMask.

  • vkCmdBeginRenderPass/EndRenderPass
  • vkCmdPipelineBarrier(image = image2,  …)
  • vkCmdBeginRenderPass/EndRenderPass

image1 was written to, so image2 was invalidated. Similar to the pipeline barrier for image1, we need to transition away from UNDEFINED. We need to make sure any write to image1 is made available before we can perform the transition. Next frame, image1 needs to take ownership again, and so on.

External subpass dependencies

Render passes in Vulkan have a concept of EXTERNAL subpass dependencies. This is arguably the most misunderstood aspect of Vulkan sync. I’d like to dedicate a section to this, because too many developers are lured into using it in cases where it’s not terribly useful and very likely to cause bugs.

The main purpose of external subpass dependencies is to deal with initialLayout and finalLayout of an attachment reference. If initialLayout != layout used in the first subpass, the render pass is forced to perform a layout transition.

If you don’t specify anything else, that layout transition will wait for nothing before it performs the transition. Or rather, the driver will inject a dummy subpass dependency for you with srcStageMask = TOP_OF_PIPE_BIT. This is not what you want since it’s almost certainly going to be a race condition. You can set up a subpass dependency with the appropriate srcStageMask and srcAcessMask. The external subpass dependency is basically just a vkCmdPipelineBarrier injected for you by the driver. The whole premise here is that it’s theoretically better to do it this way because the driver has more information, but this is questionable, at least on current hardware and drivers.

There is a very similar external subpass dependency setup for finalLayout. If finalLayout differs from the last use in a subpass, driver will transition into the final layout automatically. Here you get to change dstStageMask/dstAccessMask. If you do nothing here, you get BOTTOM_OF_PIPE/0, which can actually be just fine. A prime use case here is swapchain images which have finalLayout = PRESENT_SRC_KHR.

Essentially, you can ignore external subpass dependencies. Their added complexity give very little gain. Render pass compatibility rules also imply that if you change even minor things like which stages to wait for, you need to create new pipelines! This is dumb, and will hopefully be fixed at some point in the spec.

However, while the usefulness of external subpass dependencies is questionable, they have some convenient use cases I’d like to go over:

Automatically transitioning TRANSIENT_ATTACHMENT images

If you’re on mobile, you should be using transient images where possible. When using these attachments in a render pass, it makes sense to always have them as initialLayout = UNDEFINED. Since we know that these images can only ever be used in COLOR_ATTACHMENT_OUTPUT or EARLY/LATE_FRAGMENT_TEST stages depending on their image format, the external subpass dependency writes itself, and we can just use transient attachments without having to think too hard about how to synchronize them. This is what I do in my Granite engine, and it’s quite useful. Of course, we could just inject a pipeline barrier for this exact same purpose, but that’s more boilerplate.

Automatically transitioning swapchain images

Typically, swapchain images are always just used once per frame, and we can deal with all synchronization using external subpass dependencies. We want initialLayout = UNDEFINED, and finalLayout = PRESENT_SRC_KHR.

srcStageMask is COLOR_ATTACHMENT_OUTPUT which lets us link up with the swapchain acquire semaphore. For this case, we will need an external subpass dependency. For the finalLayout transition after the render pass, we are fine with BOTTOM_OF_PIPE being used. We’re going to use semaphores here anyways.

I also do this in Granite.

Conclusion

I hope this was useful. The post got a bit more mechanical than I hoped for, but it should be a more distilled version of the specification.

A tour of Granite’s Vulkan backend – Part 6

Pipelines – what is your pain tolerance?

A lot of thought goes into pipelines. Eager or lazy creation, dynamic or static render state. Forget about one size fits all. How close will you approach the volcano? Make sure there is no lava under your feet when you’re done.

My pain tolerance is kinda low, I’d rather watch it on TV. Granite is a bit similar, it prefers to be cooled off magma instead.

The ideal case

Vulkan is designed to let you forget about filthy, filthy render state management and work exclusively with pristine VkPipeline objects. These objects encode every possible choice you can make when flipping the fixed-function bits and bobs on the GPU.

Getting to a point when you only think in terms of VkPipelines, and all pipelines are compiled up front in load-time is a holy grail of modern graphics API implementation. Gone are the stutters, the hitches, the sad 100 ms glitches which throw you off guard when you peek around the wall.

To get there, you must sacrifice all notions of flexibility, no last minute decisions, everything must be planned out in detail ahead of time. There is a lot of state which is pulled together to form a VkPipeline, an all-star cast of colorful characters and a plot with a lot of depth.

… ahem, that got a bit weird.

Shader modules

Obviously, the core part of a pipeline is the shader modules, the Vulkan::Program in Granite. From the program we automatically know the VkPipelineLayout because of reflection, so no problems there.

Render pass

We also need to know the render pass (and subpass index!) in order to create a pipeline. This one can be really counter-intuitive. The shader compiler often needs to know which render target formats are in use in order to generate final ISA. This is where we start running into problems. There is no obvious reason to combine a render pass and shader modules together. In my mental model these two should not know about each other, but drivers would really like that to be the case. For example, if I were to render a scene it would look something like:

  • Start rendering to some attachments (VkRenderPass is known here)
  • Set up the default rendering state appropriate for the pass. There are different “default” states for depth-only, opaque, lighting, and transparency rendering. Part of the render state vector is determined here.
  • Ask the renderer to render some list of visible objects which survived culling. Shader modules are known at this level, and some render state might be per-material, like two-sided rendering, etc.

There are a few ways to make this work, but somewhere you must have higher-level knowledge which shader modules are used in which render passes. If an application has a baking step during build, that might be a nice place to do it, but not all graphics API use cases work this way. Emulation comes to mind where you cannot know what an application will do until you execute it. User scripting could be a nightmare as well …

Render passes also have a lot of combinatorial explosion. If we just change from MSAA 2x to MSAA 4x, that means new render passes, and new pipelines which are compatible with those render passes. Clearly we see that something trivial like changing a setting in the options menu of most games will trickle down into a completely different set of pipelines for all materials. This kind of coupling isn’t what I call clean, but sometimes sanity must be sacrificed for performance. I’d prefer to keep my sanity.

Fixed-function vertex bindings

This consists of attributes, bindings, strides and input rates. This one is usually not a problem if you control the asset pipeline. You can decide on a “standard” vertex buffer layout and forget about it. There is some slight annoyance here if we want to support glTF or other scene transmission formats unless we’re prepared to rewrite all vertex buffers to match the standard layout.

Shader compilers like to know about this information since some ISAs need to fetch vertices in software, and therefore need to be able to compute correct offsets based on VertexIndex/InstanceIndex.

10 – Fixed-function render state

When rendering triangles in Vulkan, there is still a ton of state to deal with. Vulkan takes all the gunk you’d set in glEnable/glDisable and various other functions and bundles it together into one massive struct. I wrote up a sample which demonstrates how render state is set, saved and restored.

I have to admit I kinda like the old-school way of setting state individually. Isolating render state to a command buffer avoid almost all the horrifying issues with state management in OpenGL. In GL, the state is global, and leaked between modules and render passes. This is really scary, and you’re basically forced to make a custom state tracker on top of GL to keep yourself sane. There was also no good way of “saving” just the state you cared about and restoring it without writing a lot of custom code. I like the idea of setting some “standard” state which clears out any possible leakage of state. Overall, Granite’s model is maximum convenience.

A concept I’ve seen in other projects is the idea of creating big structures on the user side which mimics a pipeline, but I don’t think this is very useful unless it’s basically a full VkGraphicsPiplineCreateInfo with all the bells and whistles. If we don’t, we still don’t have the information we need to create a pipeline anyways, like render pass information for example, and we’re back to hashing with lazy creation.

Even just render state tends to be split in two halves for me. Some state tends to be “global” in nature and some tends to be “local”. This is state which is set by the higher level renderer which thinks in terms of:

  • Opaque pass vs transparent pass (alpha blending)
  • Depth-only? (depth write enable, depth bias?, equal test?)
  • Lighting pass? (additive blending?)
  • Stencil? (for deferred)

This state is saved and restored as necessary, then we have the objects which are rendered in a render pass which typically think in terms of:

  • Two sided mesh? (face culling)
  • Primitive restart?
  • Topology?
  • Shader program?
  • Vertex attributes?

I don’t like to couple these parts of the renderer together, so a tightly packed blob of state in Vulkan::CommandBuffer does the job for me. At the end of the day, the only real cost of this flexibility is some extra hashing cost. It doesn’t light up in the profile for me.

Overall, I like the “immediate” nature of the CommandBuffer interface. There’s always a hybrid solution if that is ever needed where I would set the state I’m interested in, then pull out a persistent VkPipeline handle which can be used later and bypasses any hashing of state when bound.

Avoiding stutters

The real problem with lazily creating pipelines is vkCreateGraphicsPipelines in my opinion. Doing this at the last minute is almost a guaranteed hitch, and it should be avoided at all cost. Avoiding last minute pipeline compilation is the real reason we should know all state combinations up front, not because we get to bind VkPipelines directly and avoid some small hashing cost.

My strategy for dealing with this problem is pre-warming the hashmaps with previously seen data. Granite integrates the Fossilize project to solve the problem of serializing all information needed to create pipelines in a GPU and driver independent way. In theory, I would be able to ship a Fossilize database as part of an application and use that to pre-warm all historically observed pipelines and their dependent objects at Vulkan::Device creation time.

To my knowledge, this is basically how all GL and D3D11 drivers behave. Cache all the things.

Conclusion

Granite’s render state management is old-school, but I like it. Pre-warming the various hashmaps in Vulkan::Device is the strategy I used to avoid any pipeline compilation stutters.

There are many alternatives for any graphics API abstraction. There are things I like in legacy APIs, and things I hate. I wanted to keep the parts I liked, and avoid the parts I disliked.

… that’s all folks!

I think this is the end of this series for now. I’ve gone over the Vulkan backend in broad strokes, and I hope it was interesting and useful.

A tour of Granite’s Vulkan backend – Part 5

Render passes and synchronization

This is part 5 in the tour of Granite‘s Vulkan backend. We’re going to get knee-deep in the aspects of Vulkan which are the most difficult to learn in my opinion, and mastering these topics of Vulkan is the real hurdle towards mastery. This level of understanding is something high-level APIs will prevent you from reaching.

This post isn’t intended to be a tutorial on Vulkan synchronization, so I’ll assume some basic level of knowledge.

Render passes

Render passes is a new fundamental part of Vulkan which does not exist in any of the legacy APIs. In older APIs you can freewheel how you render to frame buffers, but that approach was always terrible on tile-based GPUs, and these days with hybrid tilers, it’s probably terrible on desktop as well. Vulkan forces you to think about rendering all you need in one go to a frame buffer and then proceed to the next.

In Granite, I wanted to make sure most of the flexibility and explicitness of Vulkan render passes could be expressed with minimal boilerplate. Most projects don’t seem to pay attention to this part except that it’s something you just have to do, and very few see the benefits they bring. That is probably a reasonable stance for 2019 if you do not care about mobile performance. If you need to target mobile though, it is worth the extra work. As of writing, the feature is quite mobile-centric, but desktop GPUs seem to be inching towards tile-based architectures, so it will be interesting to see if this view on render passes will shift in the future. Even D3D12 recently got render passes too, albeit in a simplified form.

In the most basic form, render passes in Vulkan can be rather daunting to set up, and it’s one of the many battles you have to fight to get hello triangle on screen. If we take a render pass with just one sub-pass (the case we care about 99% of the time), we need to specify up front:

  • How many attachments?
  • Which formats are used?
  • How many MSAA samples?
  • initialLayout and finalLayout
  • Which image layouts to use while rendering?
  • Do we load from memory or clear the attachment on render pass begin?
  • Do we bother storing the attachments to memory?

Most of this information is boilerplate we can automate, but things like load/clear/store we cannot deduce in the backend before it is too late. Knowing this kind of information up-front can be very beneficial for bandwidth consumption, at least on mobile.

The ugly framebuffer objects

An ugly aspect of Vulkan is the use of VkFramebuffer. I want an API where I just say “start a render pass where we render to these attachments”. Creating “FBOs” up front was really ugly in GL, and I think it’s a bad abstraction to have API users carry around ownership of objects which represent little to no useful work. FBOs are empty husks which might as well just be an array of image views.

We could just create VkFramebuffers every render pass we begin and defer the deletion of it right away, but creating these objects have some cost. There’s a handle allocation in the driver at minimum, and probably a little more on certain drivers. Here I just reuse the temporary hashmap allocator which I introduced in the descriptor set model post. VkFramebuffer objects can be reused over multiple frames, but if they aren’t used for a while, they are just deleted since VkFramebuffer objects are immutable.

Automating VkRenderPass creation

This topic is actually quite complicated when we start diving into the deep end of Vulkan render passes, but we can start with the trivial cases. The API in Granite looks something like:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 1;
rp.color_attachments[0] = &rt->get_view();
rp.store_attachments = 1 << 0;
rp.clear_attachments = 1 << 0;

rp.clear_color[0].float32[0] = 1.0f;
rp.clear_color[0].float32[1] = 0.0f;
rp.clear_color[0].float32[2] = 1.0f;
rp.clear_color[0].float32[3] = 0.0f;

cmd->begin_render_pass(rp);

This is an immediate way of starting a render pass, no reason to create frame buffers up front and all that. VkRenderPass can be created lazily on-demand like everything else.

Formats / MSAA sample counts

Render passes need to know formats and sample counts, and since we pass the concrete attachments directly in begin_render_pass(), we have the information we need right here.

Image layouts and VK_SUBPASS_EXTERNAL dependencies

There are three kinds of attachments in Granite:

  • User-created. These attachments are render targets which are created with Device::create_image() and the backend does not own the resource or knows anything about how long the resource will live. Common case for user-created render targets.
  • WSI images. These images are special since they came from VkSwapchainKHR or some equivalent mechanism. We know that these images are only used for rendering and they are only consumed by the presentation engine, or some other mechanism.
  • Transient images. Images with transient usage flags only live inside render passes. They cannot be sampled from, their memory does not necessarily exist except in page tables which point to /dev/null. We don’t care what happens to these images once the render pass is done.

To deduce image layouts for a render pass we have a few different paths.

wsi images

I don’t care about preserving WSI images over multiple frames, and I don’t care about sampling from WSI images or any such weird use case after rendering, so the flow of image layouts is:

  • initialLayout = UNDEFINED (discard)
  • VkAttachmentReference -> COLOR_ATTACHMENT_OPTIMAL or whatever is required for the subpass
  • finalLayout = PRESENT_SRC_KHR or whatever layout we need when using external WSI. For something like libretro, this will be SHADER_READ_ONLY_OPTIMAL since the image will be handed off to some other render pass which we don’t know or care about. For headless PNG/video dumping, it might be TRANSFER_SRC_OPTIMAL.

When initialLayout != the layout used in the first subpass, vkCmdBeginRenderPass will actually need to perform a memory barrier implicitly to make this work. The big question is when this memory barrier takes place, and the answer is “as soon as possible” (TOP_OF_PIPE_BIT) if we don’t specify it anywhere. For this case, Granite will add a subpass dependency which waits for VK_SUBPASS_EXTERNAL in the COLOR_ATTACHMENT_OUTPUT stage. This latches onto the WSI acquire semaphore, more on that later.

Final layout != last layout is used, so we get a transition at the end of the render pass, but we don’t need to care about external subpass dependencies here. The automatically generated one is perfect, and we’re going to use the WSI release semaphore to properly synchronize this image anyways.

When we see a WSI image in a render pass, we can trivially mark this command buffer as “touching WSI”. This will affect command buffer submission later. This is indeed the kind of tracking which I have been arguing against in earlier posts, but it’s so trivial that it’s a no-brainer to me in this case.

Transient images

For transient images, we automate it just like WSI images, except that finalLayout will match last used layout in the render pass. This way we avoid some useless image layout transition at the end of the render pass. Next time we use the image, it’s going to be discarded anyways.

Because we deal with transitions automatically, users can freely pull images from Vulkan::Device with get_transient_attachment(), render to it, and forget about it. This is super convenient for things like MSAA rendering where the multi-sampled attachment just needs to exist temporarily for purposes of being resolved into the single-sampled attachment we care about. Having to care about synchronization for resources we don’t own is weird I think.

other images

For any other image, we need to avoid any implicit layout transition, so we simply force initialLayout to match the first use in the render pass, and finalLayout will match the last use. In our small sample, it’s all going to be COLOR_ATTACHMENT_OPTIMAL. It’s up to the API user to know what layouts a render pass will expect, but it’s straight forward to map a render pass to expected layout. Color attachments are COLOR_ATTACHMENT_OPTIMAL, depth-stencil is DEPTH_STENCIL_OPTIMAL or DEPTH_STENCIL_READ_ONLY_OPTIMAL based on the read/write mode for depth, input attachments are SHADER_READ_ONLY, etc. It’s possible to use an attachment for multiple purposes in a subpass, and Granite supports that as well. Some examples:

  • Color attachment + input attachment: Feedback loop ala GL_ARB_texture_barrier (super useful for certain emulators) -> GENERAL
  • RW Depth attachment + input attachment (some weird decal algorithm?) -> GENERAL
  • Read-Only depth + input attachment (deferred shading use case) -> DEPTH_STENCIL_READ_ONLY_OPTIMAL

All of this is analyzed when a newly observed VkRenderPass is created, and subpass dependencies are set up automatically. Anything which happens outside the render pass is the user’s responsibility.

08 – Bare-bones “deferred rendering” sample

I made a cut-down sample to show how the API expresses multi-pass in the context of deferred rendering with transient gbuffer + depth. The meat of it is:

Vulkan::RenderPassInfo rp;
rp.num_color_attachments = 3;
rp.color_attachments[0] = &device.get_swapchain_view();

rp.color_attachments[1] = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		VK_FORMAT_R8G8B8A8_UNORM,
		0);
rp.color_attachments[2] = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		VK_FORMAT_R8G8B8A8_UNORM,
		1);

rp.depth_stencil = &device.get_transient_attachment(
		device.get_swapchain_view().get_image().get_width(),
		device.get_swapchain_view().get_image().get_height(),
		device.get_default_depth_format());

rp.store_attachments = 1 << 0;
rp.clear_attachments = (1 << 0) | (1 << 1) | (1 << 2);
rp.op_flags = Vulkan::RENDER_PASS_OP_CLEAR_DEPTH_STENCIL_BIT;

Vulkan::RenderPassInfo::Subpass subpasses[2];
rp.num_subpasses = 2;
rp.subpasses = subpasses;

rp.clear_depth_stencil.depth = 1.0f;

subpasses[0].num_color_attachments = 2;
subpasses[0].color_attachments[0] = 1;
subpasses[0].color_attachments[1] = 2;

subpasses[0].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadWrite;

subpasses[1].num_color_attachments = 1;
subpasses[1].color_attachments[0] = 0;
subpasses[1].num_input_attachments = 3;
subpasses[1].input_attachments[0] = 1;
subpasses[1].input_attachments[1] = 2;
subpasses[1].input_attachments[2] = 3;  // Depth attachment
subpasses[1].depth_stencil_mode = Vulkan::RenderPassInfo::DepthStencil::ReadOnly;

cmd->begin_render_pass(rp);
// Do work
cmd->next_subpass();
// Do work
cmd->end_render_pass();

See code comments in sample for more detail. To write this kind of sample in raw Vulkan would be almost a full day’s project.

Synchronization

Unlike many aspects of Granite which are reasonably high-level, synchronization in Granite is almost 100% explicit. The general philosophy of Granite is that excessive tracking of resource use is a no-no, unless it is trivial to do so (e.g. WSI images). Synchronization is a case where you need a lot of tracking to do a good job, and it is impossible to do a perfect job since you end up relying on heuristics, at least if you are to implement automatic synchronization on top of Vulkan. GL and D3D11 drivers have an advantage here since they can tap into GPU-specific synchronization mechanisms which might be better suited for implicit synchronization. A good example here is the i915 driver stack in the Linux DRM stack which can do automatic synchronization in kernel space somehow. I’m sure that simplifies the Mesa GL driver a lot, but I don’t know the details.

Let’s go through a thought experiment where we look at the big problems we run into if we are to implement a fully automatic barrier system. (I have tried :p)

Problem #1 – We cannot rewind time

When touching a resource, we must ask ourselves: “When and where was this resource accessed last?” We have three potential solutions to resolve a hazard:

  • Pipeline barrier (was used just now)
  • Event (was used some time ago in this queue)
  • Semaphore (was used in a different queue)

Ideally, we need to inject a barrier at the exact point where a resource was last used, but we cannot inject new commands in the middle of a command buffer which has already been recorded.

There is no winning this fight, either we eagerly inject barriers after every command in the hope that some future command will need to synchronize against it (VkEvent is nice here), or we inject barriers too late, stalling the GPU needlessly.

Eagerly injecting barriers is pure insanity if we take into the account that the resource might be used on a different queue in the future. That means signalling a heavy semaphore after every render pass or command. We could simply ignore supporting multiple queues, but that’s a huge compromise to make.

Problem #2 – Redundant tracking of read-only resources

A problem I found while trying to implement automatic barrier tracking was that static resources might be written in the future, so we need to keep track of them. This is a waste of CPU time, but it might be possible to explicitly mark these resources as “do not track, they’re truly static, pinky promise”, but I feel this is bolting on hacks.

Problem #3 – Multi-threading

The question of “where was this resource touched last” might not actually be possible to answer in a multi-threaded scenario. If we are recording command buffers in parallel, the backend has no idea what is going on until we serialize execution in vkQueueSubmit. A common solution I have seen for this problem is to resolve synchronization internally in command buffers as they are recorded, and on command buffer submission time, we can look at all used resources and resolve any cross-command buffer synchronization points right before submitting the command buffers in vkQueueSubmit. The complexity starts to shoot through the roof though. That’s a good sign we need to rethink.

I think this is the kind of solution you end up with when you have no choice but to port some old legacy API to Vulkan, and breaking the abstraction API is not an option.

Render graphs

A Vulkan backend which solves synchronization can only look back in time and deal with hazards at the last minute, but that is only because we do not have any context about what the application is doing. At a higher level, we can know what is going to happen in the future, and we can make automated decisions at that level, where we actually have context about what is going on. This is another reason why I do not want to have automatic synchronization in a Vulkan backend. Either we get a sub-optimal solution, or we try to close the gap with heuristics, but now run-time behavior is completely non-deterministic. We should aim for something better.

I believe the synchronization problem should be solved at a higher level, like a render graph. I wrote a blog some time ago about this topic: http://themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/

Signalling fences

Granite’s way of signalling fences is very similar to plain Vulkan.

Vulkan::Fence fence;
device.submit(cmd, &fence);

// Somewhere else, potentially in a different thread.
fence->wait();
// fence ref-count goes to 0, queued up for recycling.

There is a pool of VkFence objects which can be reused. Signalling a fence forces a vkQueueSubmit. Once the lifetime of a Vulkan::Fence ends it is recycled back to the frame context. Nothing out of the ordinary here.

Semaphores

Semaphores work very similar to fences and are requested in Device::submit() like fences. Like Vulkan, they have a restriction that they can only be waited on once. Semaphores can be waited on in other queues with Device::add_wait_semaphore() in a particular queue and pipeline stage. This matches pDstWaitStages. Semaphores are also recycled like fences.

Events

Events can be signalled and later waited on in the same queue. Again, we have a pool of VkEvent objects, CommandBuffer::signal_event() requests a fresh event, signals it with vkCmdSetEvent and hands it to the user. VkEvents are recycled using the frame context. There is a similar CommandBuffer::wait_event() which maps 1:1 to vkCmdWaitEvents.

Barriers

Granite has many different methods to inject pipeline barriers, the most common ones are:

cmd->barrier(srcStage, srcAccess, dstStage, dstAccess);

which maps to a vkCmdPipelineBarrier with a VkMemoryBarrier, and image barriers which act on all subresources:

cmd->image_barrier(image, oldLayout, newLayout, srcStage, srcAccess, dstStage, dstAccess);

There are cases where we want to batch barriers or otherwise use more complicated commands than this, so there are also 1:1 interfaces to vkCmdPipelineBarrier where the full structures are passed in, but these are only really used by the render graph since writing out full structures is super painful.

The automatic barriers in Granite

There are a few instances where I think having automatic barriers makes sense. These are cases where it’s convenient to do so, and there is no tracking required, so we can resolve all hazards right away and forget about it. Some of them we’ve seen already, like WSI images and transient images in render passes.

The other major case is static read-only resources. If you pass in initial_data to Device::create_buffer() or Device::create_image(), we generally have a desire to upload some data, and never touch it again.

The general gist of it is that we can upload data with a staging buffer over the transfer queue and inject semaphores which block all possible pipeline stages (based on bufferUsage/imageUsage flags). The downside is that we might end up creating too many submissions if we somehow want to upload a ridiculous amount of buffers or images in one go, but we can opt-out of this automatic behavior by simply not passing initial_data and do all the batching and synchronization work ourselves.

The end goal is that we should be able to call create_buffer or create_image and just use the static resource right away without having to think about synchronization at all.

09 – Rendering to image and reading it back to CPU on transfer queue

I wrote a sample which flexes most of the synchronization APIs. It renders a small 4×4 texture on the graphics queue, synchronizes that with the transfer queue with a semaphore and reads it back to a CACHED host buffer. We spawn threads which wait on a fence, maps the buffer and reads the results.

Conclusion

In these parts of the backend, the low-level explicit nature of Vulkan shines through. I think we have to be fairly low-level, or we inherit most of the problems with the older APIs.

… up next!

In the next installment, we’ll have a look at pipeline creation.

A tour of Granite’s Vulkan backend – Part 4

Optimizing for scratch data allocation

Allocating memory from a heap is fine and all, but very often in an engine we need to allocate throwaway data. Uniform buffers are the perfect use case here. With transient command buffers, certain data is also going to be transient. It’s very common to just want to send some constant data to a draw call and forget about it.

In Vulkan, there is a perfect descriptor type for this use case, VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC. It’s not just uniform buffers, it’s fairly common that we want to allocate scratch vertex buffers, index buffers and staging data for texture updates.

Being able to implement allocators like these with no API overhead is a huge deal with Vulkan for me. In legacy APIs there are extremely painful limitations where “fire and forget” buffer allocations are very hard to implement well. Buffers generally cannot be mapped when submitting draw calls, so we need to fight really hard and think about copy-on-write behavior, discard behavior, API overhead to call map/unmap all the time (which breaks threaded driver optimizations) or batch up allocations and memcpy data around a couple of extra times. It’s too painful and a lot of CPU performance can go down the drain if we don’t hit all the fast paths.

The only proper solution in legacy APIs I can think of is GL 4.5’s GL_ARB_buffer_storage, which supports persistently mapped buffers like Vulkan, but relying on GL 4.5 (or GLES 3.2 + extensions) just does not seem reasonable to me, since targeting GL should be considered a compatibility option for old GPUs which do not have Vulkan drivers. This feature was a cornerstone of the “AZDO” (approaching zero driver overhead) buzz back in the GL days. D3D11 is still going to be the “compat” option on Windows for a long time, and forget about relying on latest and greatest GLES on Android.

This is the perfect occasion to present a “hello triangle” sample which uses most of these features, but we first need WSI, so let’s start there.

06 – Pushing pixels with SDL2

Granite’s main codebase normally uses GLFW, so to get a less redundant sample working, I wrote this sample to use SDL2’s Vulkan support, which is very similar to GLFW’s support starting with SDL 2.0.8.

Implementing WSI is similar to instance and device creation where there is a lot of boilerplate to churn through, with little room for design considerations. Granite’s WSI implementation has two main paths:

On-screen / VK_KHR_surface

In this mode, The WSI class creates and owns the Vulkan::Context and Vulkan::Device automatically for us and owns a VkSwapchainKHR. The only thing it cannot on its own is create the VkSurfaceKHR, which is platform dependent. Fortunately, the surface is the only platform-dependent object so we can supply an interface implementation to create this surface when Vulkan::WSI needs it. The sample implements an SDL2Platform class which uses SDL2’s built in wrappers for surface creation, nifty!

Off-screen / externally owned swapchains

Granite is also used in cases where we don’t necessarily own a swapchain which is displayed on screen. We might want to supply already created images in lieu of VkSwapchainKHR and provide our own image indices as well as acquire/release semaphores. After completing a frame, we can pass along the fake swapchain image to our consumer. The prime case for this is the libretro API implementation in Granite.

Pumping the main loop

Vulkan’s Acquire/Present model maps directly to a “begin” and “end” model in Granite. We call Vulkan::WSI::begin_frame() to acquire a new image index, advance the frame context and deal with any in-between frame work. We might have to deal with out-of-date swapchains here and various janitorial work which we never had to consider in old APIs.

Semaphores for WSI images are dealt with automatically. WSI images are treated specially and automatically handling synchronization for WSI resources is straight-forward to the point that there is no point in exposing that to the user. (Synchronization in Granite is in general very explicit, but WSI is one of few exceptions.) The main loop looks something like:

wsi.begin_frame(); // <-- acquire image if necessary, advances frame context
auto cmd = device.request_command_buffer();
// do work and render to swapchain
device.submit(cmd);
wsi.end_frame(); // <-- flushes frame, queues up presents if swapchain was rendered to this frame

Overall, WSI code is must to abstract in Vulkan, and I’m happy with the flexibility and simplicity in use I ended up with.

07 – Hello triangle (quad?) with scratch allocated VBO, IBO and UBO

Now that we can get stuff on-screen, now we’re getting to the actual meat of this post. https://github.com/Themaister/Granite-MicroSamples/blob/master/07_linear_allocators.cpp augments the WSI sample with a nice little quad. The VBO, IBO and UBOs are allocated directly on the command buffer.

Linear allocator – allocating memory at the speed of light

This allocator has many names – chain allocator, bump allocator, scratch allocator, stack allocator, etc. This is the perfect allocator for when we want to allocate a lot of small blobs, and just wink it all away at some point in the future. Allocation happens by incrementing an offset, and freeing happens by setting the offset to 0 again, i.e. all memory in one go is just “winked away”.

Buffer pools of linear allocators

Some engine implementations have a strategy where there is only one huge linear allocator in flight and once exhausted, it is considered OOM and a GPU stall is inevitable. This strategy is nice from an “explicit descriptor set” design standpoint if we use UNIFORM_DYNAMIC descriptor type, since we can use a fixed descriptor set for uniform data, as offsets into the UBO are encoded when binding the descriptor set. I find this concept a bit too limiting, since there is no obvious limit to use (very content and scene dependent). I opted for a recycled pool of smaller buffers instead since Granite’s descriptor binding model is very flexible as we saw in the previous post in this series. If I had to deal with explicit descriptor sets, uniform data would be kind of nightmarish to deal with.

Vulkan::CommandBuffer can request a suitable chunk of data from Vulkan::Device, and once exhausted or on submission, the buffers are recycled back again. We can only reuse the buffer once the frame is complete on the GPU, so we also use the frame context to recycle linear allocators back into the “ready for allocation” pool at the right time.

To DMA queue or not to DMA queue …

Discrete GPUs have a property where accessing memory in VRAM is very fast, while host memory can be accessed over PCI-e at a far slower rate. For staging data like vertex, index and uniform buffers, it might be reasonable to assume that we should copy the CPU-side to GPU-side and let the GPU consume the streamed data in fast VRAM. Granite supports two modes where we let the GPU read data read-only from HOST_VISIBLE, and one where we automatically perform staging buffer copies over to GPU from the CPU buffer.

In practice however, I don’t see any gain from doing the staging copy. The extra overhead of submitting a command buffer on the DMA queue which copies data over, and adding the extra synchronization overhead with semaphores and friends just does not seem worthwhile. Discrete GPUs can cache read-only data sourced from PCI-e just fine.

Super-convenient API

Since we have a very free-flowing descriptor binding model, we can have an API like this:

auto cmd = device.request_command_buffer();
MyUBO *ubo = cmd->allocate_typed_constant_data<MyUBO>(set, binding, count);
// Fill in data on persistently allocated buffer.
ubo->data1 = 1;
ubo->data2 = 2;

void *vert_data = cmd->allocate_vertex_data(binding, size, stride);
// Fill in data.
void *index_data = cmd->allocate_index_data(size, VK_INDEX_TYPE_UINT16);
// Fill in data.
cmd->draw_indexed();

// Pointers are now invalidated.
device.submit(cmd);

The allocation functions are just light wrappers which allocate, and bind the buffer at the appropriate offset. It’s perfectly possible to roll your own linear allocation system, e.g. you want to reuse a throwaway allocation in multiple command buffers in the same frame, or something like that.

Conclusion

I think spending time on making temporary allocations as convenient as possible will pay dividends like nothing else. The productivity boost of knowing you can allocate data on the command buffer for near-zero overhead simplifies a lot of code around the callsite, and there is little to no cost of implementing this. Linear allocators are trivial to implement.

… up next!

On the next episode of “this all seems so high-level, where’s my low-level goodness”, we will look at render passes and synchronization in Granite, which is where the low-level aspects of Granite will be exposed.

A tour of Granite’s Vulkan backend – Part 3

Shaders and descriptor sets

This is part 3 of a blog series I’m writing on Granite‘s Vulkan backend. In this episode we are looking at how we deal with shaders and descriptor sets. At this point in our design process, there are many, many choices to make. Especially descriptor sets need to be carefully considered.

Hash all the things

A theme we start to see now is hashmaps and lazy creation of objects. One thing you run into with Vulkan’s pipeline-related types are how much work it is to be explicit all the time. The amount of information we need to provide is staggering. I believe it not healthy for mind and soul to work at low levels here except in special cases, and we should aggressively hide away detail where we can. There is naturally a clock cycles vs. sanity tradeoff to be made here.

You can argue that the lines between high-level GL/D3D11-style design and Granite’s model are quite blurred. The (mental) price to pay to be explicit is just not worth it in my opinion. I will try to explore the obvious alternatives here and provide more context why the design is the way it is.

04 – Shaders and pipeline layouts

The first step in creating a pipeline is of course, to create a VkShaderModule from our SPIR-V code. This is a no-brainer, but next we need a pipeline layout, which in turn requires VkDescriptorSetLayouts. The sample is here https://github.com/Themaister/Granite-MicroSamples/blob/master/04_shaders_and_programs.cpp.

Rather than manually declaring the pipeline layout like a caveman I think using reflection to automatically generate layouts is a good idea. There is no reason for users to copy information which exists in the shaders already. For the reflection, I use SPIRV-Cross. If we never need to compile SPIR-V in runtime (game engine scenario), there is no reason why we cannot shift the reflection step to off-line as well and just pass the side-band data along to remove a runtime dependency. I never got as far as building a nice off-line SPIR-V baking pipeline, so I just compile GLSL on the fly with shaderc. However, the interface in the Vulkan backend just consumes raw SPIR-V.

A common mistake beginners tend to do is to think that names are important in binding interfaces. This is a mistake carried over from the GL and D3D11 days. The only things we should care about are descriptor sets, bindings and location decorations as well as push constant use. This is the only semantic information we need to create binding interfaces, i.e. pipeline layouts and pipelines.

A pipeline layout in Vulkan needs to know all shader stages a-la GL programs, so we also need a step to combine shaders into a Vulkan::Program. Here we take the union of reflection information and request handles for Vulkan::DescriptorSetAllocator and Vulkan::PipelineLayout. This is hashed, but there is no performance concern here since we should do all of this work in load time when possible anyways. These handles are all owned internally in Vulkan::Device, and there is no reason to worry about object lifetime for these objects.

I don’t think there is a reason to deviate far from this design unless you have a very specific scheme in mind with descriptor set allocation. As I’ll explore later, using bindless descriptors extensions or explicit descriptor set allocation could motivate use of a “standard” pipeline layout, in which case reflection gets kind of meaningless anyways.

05 – The binding model – embracing laziness

I never really had a problem with the old-school way of binding resources to binding slots. It just isn’t the part of the old APIs I felt were lacking, so Granite is kind of old school here, but it does have full consideration for descriptor sets and I removed any impedance mismatch with Vulkan (i.e. no translation needed to bridge between Granite and Vulkan). E.g.:

cmd->set_storage_buffer(set, binding, *resource);
cmd->set_texture(set, binding, resource->get_view(), Vulkan::StockSampler::LinearClamp);

The old binding models in GL/D3D11 have flat binding spaces with no separation of per-frame, per-material, or per-draw bindings. In Granite I wanted to take full advantage of the descriptor set feature where we can assign some kind of “frequency” and relation between bindings. Here is an example to illustrate how it is used: https://github.com/Themaister/Granite-MicroSamples/blob/master/05_descriptor_sets_and_binding_model.cpp.

In draw time, we can use the current pipeline layout and pull in the binding points which are active and make sure we bind descriptor sets with the correct resources. This is actually hot code, so I spent time designing a nice system here which tries to be as optimal as possible, given these restrictions.

Because of mobile, we need some conservative limits. I use 4 descriptor sets and 16 (dense) binding points per set (minimum spec of Vulkan). This allows for fairly compact pipeline layout descriptions, and we can loop over bitsets to look at resources. This is also just fine for my use cases.

When it comes to allocation of descriptor sets themselves, I think I have a very different approach to most. A Vulkan::DescriptorSetAllocator is represented as:

  • The VkDescriptorSetLayout
  • A bunch of VkDescriptorPools which can only allocate VkDescriptorSets of this set layout. Pools are added on-demand.
  • A pool of unused VkDescriptorSets which are already allocated and can be freely updated.
  • A temporary hashmap which keeps track of which descriptor sets have been requested recently. This allows us to reuse descriptor sets directly. In the ideal case, we almost never actually need to call vkUpdateDescriptorSets. We end up with hash -> get VkDescriptorSet -> vkCmdBindDescriptorSets. When a descriptor set has not been used for a couple of frames (8), we assume that it is no longer relevant, and the set is recycled, and some other descriptor set can reuse it and just call vkUpdateDescriptorSet. We definitely do not want to keep track of when any buffer or image resources is destroyed, and recycle early. That’s tracking hell which slows everything down.

The temporary hashmap is a data structure I’m quite happy with. It’s used for a few other resources as well. See https://github.com/Themaister/Granite/blob/master/util/temporary_hashmap.hpp for the implementation.

On certain GPUs, allocating descriptor sets is, or at least used to be very costly. The descriptor pools might not be implemented as true pools (sigh …), so every vkAllocateDescriptorSets would mean a global heap allocation, absolutely horrible for performance. This is the reason I’m not a big fan of the “one large pool” design. In this model, we just allocate a massive VkDescriptorPool, and we just allocate from that, for any kind of descriptor set. This means recycling VkDescriptorSet handles over many frames is impractical. The intended use pattern is to call vkResetDescriptorPool and allocate new descriptor sets which are only valid for one frame at a time, just like command buffers. There is also the problem of knowing how to balance the descriptor load for these massive pools, what’s the ratio of image descriptors vs uniform buffer descriptors, etc. With per-descriptor set layout allocators, there is zero guess work involved.

Alternative design – Bindless

Bindless is all the rage right now. The only real complaint I have is that it’s only supported on desktop and requires an EXT extension. It also means writing shaders in a very specific way. I don’t really need it for my use cases, but bindless enables certain complex algorithms which benefit from accessing a huge set of resources dynamically.

Alternative design – persistent explicit VkDescriptorSets

An alternative is exposing descriptor sets directly and only allow users to bind descriptor sets rather than individual resources. The API user would need to build the sets manually. While this is an idea, I think there are too many hurdles to make it practical.

  • We need to know and declare the target imageLayout of textures up front. This is obvious 99% of the time (e.g. a group of material textures which are SHADER_READ_ONLY_OPTIMAL), but in certain cases, especially with depth textures, things can get rather ambiguous. This does seem to me like an API design fault. It is unclear why this information is needed.
  • Some resources are completely transient in nature and it does not make sense to place them in persistent descriptor sets. The perfect example here is uniform buffers. In later samples, we’ll look at the linear allocator system for transient data.
  • Some resources depend on the frame buffer, i.e. input attachments. Baking descriptor sets for these resources is not obvious, since we need to know the combination pipeline layout + frame buffer, which should have nothing to do with each other.
  • We need to know the descriptor set layout (and by extension, the shaders as well) up-front. This is problematic if resources are to be used in more than one shader. The common fix here is to settle on a “standard” pipeline layout so we can decouple shaders and resources. This usually means a lot of padding and redundant descriptor allocations instead. We have a limited amount of descriptor sets when targeting mobile (4). We do not have the luxury of splitting every individual “group” of resources into their own sets, some combinatorial effects are inevitable, making persistent descriptor sets less practical. On desktop, 8 sets is the norm, so that might be something to consider.
  • Hybrid solutions are possible, but complexity is increased for little obvious gain.

Conclusion

I’m happy with my design. It’s very easy to use, but there is a CPU prize I’m willing to pay and I honestly never saw it in the profiler. I think resource binding models are cases where shaving overhead away will shave your sanity away as well, at least if you want to be compatible with a wide range of hardware. It’s much easier if you only cater to high-end desktop where bindless can be deployed.

… up next!

Next up we will explore the linear allocators for uniform, vertex, index and staging data.

A tour of Granite’s Vulkan backend – Part 2

The life and death of objects

This is a part 2 in a series where I explore Granite‘s Vulkan backend. See part 1 for an introduction. In this blog entry we will dive into code, and we will start with the basics. Our focus in this entry will be to discuss object lifetimes and how Granite deals with the Vulkan rule that you cannot delete objects which are in use by the GPU.

Sample code structure

I will be referring to concrete code samples from here on out. I have started a small code repository which contains all the samples. See README.md for how to build, but you won’t need to run the samples to understand where I’m going with these samples. Stepping through the debugger can be rather helpful however.

Sample 01 – Create a Vulkan device

Before we can do anything, we must create a VkDevice. This aspect of Vulkan is quite dull and full of boilerplate, as is the setup code for any graphics API. There is not a lot to cover from an API design perspective, but there are a few things to mention. The sample code for this part is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/01_device_creation.cpp

The API for this is pretty straight forward. I decided to split up how we load the Vulkan loader library, since there are two main use cases here:

  • User wants Granite to load libvulkan.so/dll/dylib from standard locations and bootstrap from there.
  • User wants to load an already provided function pointer to vkGetInstanceProcAddr. This is actually the common case, since GLFW loads the Vulkan loader for you dynamically and you can just use the GLFW provided glfwGetInstanceProcAddr to bootstrap yourself. The volk loader has support for this.

To create the instance and device, we need to do the usual song and dance of creating a VkInstance and VkDevice:

  • Setup Vulkan debug callbacks
  • Identify and enable relevant extensions
  • Enable Vulkan validation layers in debug build
  • Find appropriate VkQueues to cover graphics, async compute, transfer

Vulkan::Context and Vulkan::Device

The Context owns the VkInstance and VkDevice, and Vulkan::Device borrows a VkDevice and manages the big objects which are created from a VkDevice. It is possible to have multiple Vulkan::Device on top of a VkDevice, but we end up sharing the VkQueues and the global heaps for that device, which is a very nice property of Vulkan, since it allows frontend/backend systems like e.g. RetroArch/libretro to share a VkDevice without having hidden global state leak between the API boundary, which is a huge problem with the legacy APIs like GL and D3D11.

Note that this sample, and all other samples in this chapter are completely headless. There is no WSI involved. Vulkan is really nice in that we don’t need to create window system contexts to do any GPU work.

02 – Creating objects

Creating new resources in a graphics API should be very easy, and here I spent a lot of time on convenience. Creating images and uploading data to them in raw Vulkan is a lot of work, since there are so many things you have to think about. Creating staging buffers, copy those, defer deletion of that staging buffer, maybe we copy on the transfer queue, or not? Emit some semaphores to transfer ownership to graphics queue, creating image views, and just so many things which is very painful to write. Just creating an image in a solid way is several hundred lines of code. Fortunately, this kind of code is very easy to wrap in an API. See sample: https://github.com/Themaister/Granite-MicroSamples/blob/master/02_object_creation.cpp, where we create a buffer and image. I think the API is about as simple as you can make it while keeping a reasonable amount of flexibility.

Memory management

When we allocate resources, we allocate it from Granite’s heap allocator for Vulkan. If I had done Granite today, I would just use AMD’s Vulkan Memory Allocator, but it did not exist at the time I designed my allocator, and I’m pretty happy with my design as it stands. Maybe if I need de-fragmentation in the future or some really complex memory management strategy, I’ll have to rethink and use a mature library.

To get a gist of the algorithms, Granite will allocate 64 MB chunks, which are split in 32 chunks. Those 32 chunks can then be subdivided into 32 smaller chunks, etc, all the way down to 256 bytes little chunks. I made a cute little algorithm to allocate effectively from these blocks with CTZ operations and friends. Classic buddy allocator, but you have 32 buddies.

There are also dedicated allocations. I use VK_KHR_dedicated_allocation to query if an image should be allocated with a separate vkAllocateMemory rather than being allocated from the heap. This is generally useful when allocating large frame buffers on certain architectures. Also, for allocations which exceed 64 MB, dedicated allocations are used.

Memory domains

A nice abstraction I made is that rather than dealing with memory types like DEVICE_LOCAL, HOST_VISIBLE, and the combination of all the possible types, I declare up-front where I like my buffers and images to reside. For buffers, there are 4 use cases:

  • Vulkan::BufferDomain::Device – Must reside on DEVICE_LOCAL_BIT memory. May or may not be host visible (discrete vs integrated GPUs).
  • Vulkan::BufferDomain::Host – Must be HOST_VISIBLE, prefer not CACHED. This for uploads to GPU.
  • Vulkan::BufferDomain::CachedHost – Must be HOST_VISIBLE and CACHED. Falls back to non-cached, but should never happen. Might not be COHERENT. Used for readbacks from GPU.
  • Vulkan::BufferDomain::LinkedDeviceHost – HOST_VISIBLE and DEVICE_LOCAL. This maps to AMD’s pinned PCI mapping, which is restricted to 256 MB. I don’t think I’ve ever actually used it, but it’s a niche option if I ever need it.

When uploading initial data to a buffer, and Device is used, we can take advantage of integrated GPUs which share memory with the CPU. In this case, we can avoid any staging buffer, and just memcpy data straight into the new DEVICE_LOCAL memory. Don’t just blindly use staging buffers when you don’t need it. Integrated GPUs will generally have DEVICE_LOCAL and HOST_VISIBLE memory types.

Mapping host memory

While not present in the sample, it makes sense to discuss how we map Vulkan memory to the CPU. A good rule of thumb in general is to keep host memory persistently mapped. vkMapMemory and vkUnmapMemory is quite expensive, especially on mobile, and we can only have one mapping of a VkDeviceMemory (64 MB with tons of suballocations!) active at any time. Rather than Map/Unmap all the time, we implement map/unmap in Vulkan::Device, by checking if we need to perform cache maintenance, with no extra CPU cost. On map() for example, we need to call vkInvalidateMappedRanges if the memory type is not COHERENT, and for unmap, we call vkFlushMappedRanges if the memory is not COHERENT. This is fairly common on mobile when doing readbacks from GPU, since we need CACHED, but we might not get COHERENT. Granite’s backend abstracts all of this away.

Physical and transient image memory

A very powerful feature of Vulkan is the support for TRANSIENT images. These images do not have to be backed by physical memory in Vulkan, and is very nice on tile-based mobile renderers.

In Granite I fully support transient images where I can pass in two different domains for images, Physical and Transient. Since Transient images are generally used for throw-away scenarios, there is a convenient method in Vulkan::Device::get_transient_attachment() to simply request a transient image with a format and resolution for rendering. Transient images are generally never created manually since they are so easy to manage internally.

Handle types

There are many ways to abstract handle types in general, but I went for my own “smart pointer” variant, the trusty intrusive ref-counted pointer. It can basically be thought of a std::shared_ptr, but simpler, and we can pool the allocations of handles very nicely. How we design these handle types are not really important for Vulkan though, but I figured this point would generate some questions, so I’m addressing it here. See https://github.com/Themaister/Granite/blob/master/util/intrusive.hpp for details.

03 – Deferring deletions of GPU resources

Now we’re getting into topics where there can be significant design differences between Vulkan backends. My design philosophy for a middle-level abstraction is convenient, deterministic and good enough at the cost of a theoretical optimal solution.

A common theme you’ll find in Granite is the use of RAII. Once lifetimes of objects end, we automatically clean up resources. This is nothing new to C++ programmers, but the big problem in Vulkan is we’re not managing just memory on CPU with new/delete. We actually need to carefully control when things are deleted, since the GPU might be using the resources we are freeing. The strategy here will be to defer any deletions. The sample is here: https://github.com/Themaister/Granite-MicroSamples/blob/master/03_frame_contexts.cpp

The frame context

In order to handle object lifetimes in Granite, I have a concept of a frame context. The frame context is responsible for holding all resources which belong to a “frame” of work. Normally this corresponds to a frame of work between AcquireNextImage and QueuePresent, but it is not tightly coupled. The Device has an array of frame contexts, usually 2 of them to allow double-buffering between CPU and GPU, (and 3 on Android because TBDR GPUs are a bit more pipelined and tend to prefer a little more buffering). The frame context is basically a huge data structure which holds data like:

  • Which VkFences must be waited on to make sure that all GPU work associated with this queue is done. This is the gatekeeper which holds all our recycling and deletions back.
  • Command pools for each worker thread and queue types.
  • VkBuffers, VkImages, etc, to be deleted once the fences signal.
  • Memory allocations from heap allocator to be freed.
  • … and various other resources.

Basically, we have a central place to chuck any things which need to happen “later”, when the GPU is guaranteed to be done with this frame.

As a special consideration, the big fat “make it go slow” call Device::wait_idle() will automatically clean up everything in one go since it knows at this instant the GPU is not doing anything.

Command buffer lifetime compromise

To make the frame based cleanup work in practice, we need to simplify our notion of what command buffers can do. In Vulkan, we have the flexibility to record command buffers once and reuse them at will at any time. This creates some complications. First of all, it throws the idea of a per-frame command pool out of the window. We can never reset the command pool in that case, since there will be free-floating command buffers out there which might be used later. Command pools work their best in Vulkan when you don’t allow individual command buffers to be freed.

If we have reusable command buffers, we also have the problem of object lifetimes. We end up with a painful situation where GPU resources must be retained until all command buffers which reference them are discarded. This leads to a really difficult situation where you have two options – deep reference-counting per command buffer or just pray all of this works out and make sure objects are kept alive as long as necessary. The former option is very costly and bug-prone, and the latter is juggling with razor blades too much for my taste where a large, meaningless burden is placed on the user.

I generally don’t think reusable command buffers are a worthwhile idea, at least not for interactive applications where we’re not submitting a static workload to the GPU over and over. There just aren’t many reasonable use-cases where this gives you anything meaningful. The avenues where you can submit the same calls over and over are maybe restricted to post-processing, but recording a few draw calls which render a few full-screen quads (or compute dispatches for the cool kids) is not exactly where your draw call overhead is going to matter.

I find that beginners obsess over the idea of aggressive reuse a little too much. In the end I feel it is misguided, and there are many better places to spend your time. Recording command buffers itself in Vulkan is super efficient.

My ideal use for command buffers are where command buffers are light-weight handles which all source their memory from a common command pool linearly. No reuse, so we use ONE_TIME_SUBMIT_BIT and TRANSIENT_BIT on the pool.

In Granite, I greatly simplified the idea of command buffers into transient handles which are requested, recorded and submitted. They must be recorded and submitted in the same frame context you requested the command buffer. This way we remove the whole need for keeping track of objects per-command buffers. Instead, we just tie the resource destruction to a frame context, and that’s it. No need for complicated tracking, it’s very efficient, but we risk destroying the object a little later than is theoretically optimal. This could potentially increase memory pressure in certain situations, but I think the trade-off I made is good. If needed, I can always add explicit “delete this resource now, I know it’s safe” methods, but I haven’t found any need for this. This would only be important if we are truly memory bound.

A design decision I made was that I would never want to do internal ref-counts for resources like images and buffers, and the design would be forced to not rely on fine-grained tracking which you would typically find in legacy API implementations. Any ref-counted operations should be immediately visible to API users and never be hidden behind API implementations. In fact, command buffer arguments for binding resources are plain references or pointers, not ref-counted types.

The memory pressure of very large frames

The main flaw of this design is that there might be cases where there is one spurious frame context that has extreme use of creation and deletions of resources. A prime example here is loading screens or similar. Since Vulkan resources are not freed until the frame context itself is complete, we cannot recycle memory within a frame unless we explicitly iterate the frame context with Device::next_frame_context(). This tradeoff means that the Granite backend does not have to heuristically stall the GPU in order to reclaim memory at suitable times, which adds a lot of complexity and ruins the determinism of Granite’s behavior.

… up next!

In the next episode of Granite shenanigans we will look at the shader pipeline where we discuss VkShaderModule, VkDescriptorSetLayout, VkPipelineLayout and VkPipeline objects.

A tour of Granite’s Vulkan backend – Part 1

Introduction

Since January 2017, I’ve been working on my little engine project, which I call Granite. It’s on Github here. Like many others, I felt I needed to write a little engine for myself to fully learn Vulkan and I needed a test bed to implement various graphics techniques. I’ve been steadily working on it since then and used it as the backbone for many side-projects, but I think for others its value right now is for teaching Vulkan concepts by example.

A while back I wrote a blog about my render graph implementation. The render graph sits on top of the Vulkan implementation, but in this series I would like to focus on the Vulkan layer itself.

The motivation for a useful mid-level abstraction

One thing I’ve noticed in the Twitter-sphere and various panel discussions over the last years is the idea of the mid-level abstraction. Where GL and D3D11 is too high-level and inflexible for our needs in non-trivial applications, Vulkan and D3D12 tend to overshoot in low-level complexity, with the goal of being as low-level and explicit as possible while staying GPU architecture/OS-portable. I think everyone agrees that having a good mid-level abstraction is important, but the problem we always have when designing these layers is where to make the right trade-offs. There will always be those who chase maximum possible performance, even if complexity when using the abstraction shoots through the roof.

For Granite I always wanted to promote convenience, while avoiding the worst penalties in performance. The good old 80/20 rule basically. There are many, many opportunities in Vulkan to not do redundant CPU work, but at what cost? Is it worth architecting yourself into a diamond – a super solid, but in the end, inflexible implementation? I’m noticing a lot of angst in general around this topic, especially among beginners. A general fear of not chasing every last possible performance optimization because it “might be really important later” is probably why we haven’t seen a standard, mid-level graphics API yet in wide use.

I feel that the benefits you gain by designing for maximum possible CPU performance are more theoretical design exercises than practical ones. Even naive, straight forward, single-threaded Vulkan blows GL/GLES out of the water in CPU overhead in my experience, simply because we can pick and choose the extra work we need to do, but legacy driver stacks have built up cruft over a decade or more to support all kinds of weird use cases and heuristics. Add multi-threading on top of that, and then you can think about micro-optimizing API overhead, if you actually need it. I suspect you won’t even need multi-threaded Vulkan. I believe the real challenge with the modern APIs right now is optimizing GPU performance, not CPU.

Metal is getting a lot of praise for its successful trade-off in performance and usability. While I don’t know the API itself in detail to make a judgement (I know all the horrors of Metal Shading Language though cough), I get the impression that the mid-level abstraction is the abstraction level we should be working in 99% of the time.

I think Granite is one such implementation. I am not trying to propose that Granite is the solution, but it is one of them. The design space is massive. There just cannot possibly be a one true graphics API for all users. Rather than suggest you go out and use it directly, I will try to explain how I designed a Vulkan interface which is quite convenient to use and runs well on both desktop and mobile (very few projects consider both), at least for my use cases. Ideally, you should be inspired to make the mid-level abstraction that is right for you and your project. I have gone through a couple of iterations to get where I am now with the design, and used it for various projects, so I think it’s a good starting point at least.

The 3D-accelerated emulation use case

How Granite got started was actually the Vulkan backend in Beetle PSX HW renderer. I wrote up a Vulkan backend, and emulators need very immediate and flexible ways of using graphics APIs. Information is generally known only in the last minute. Being able to implement such projects guided Granite’s initial design process quite a lot. This is also a case where legacy APIs are really painful since you really need the flexibility of modern APIs to do a good job with performance. There are a lot of state changes and draw calls on top of the CPU cost of emulation itself. Creating resources and modifying data on the GPU in weird ways is a common case in emulation, and many drivers simply don’t understand these usage patterns and we hit painful slow-paths everywhere. With Vulkan there is little to no magic, we just implement things how we want, and performance ends up far more predictable.

I think many forget that Vulkan is not just for big (AAA) game engines. We can successfully use it for all kinds of things. We just need the right abstractions and knowledge.

How the design and implementation will be explored

To start off, we will explore the design through commented code samples, which use only the Vulkan portion of Granite as a library. We will write concrete samples of code, and then go through how all of this works, and then discuss how things could be designed differently.

… up next!

I haven’t written up any samples yet, so it makes sense to stop here. Next time, we’ll start with some samples.

Recreating the tone filter from NieR:Automata

Audio tech in games is rarely particularly interesting. Sadly, most of it seems to have been made into commodity over the last couple of decades. Once in while, some games have very clever audio tech, and in this case it was NieR:Automata’s tone filter which caught my attention. The blog post explaining this tech is found here. If you haven’t played the game, it is highly recommended to read the post and watch the videos to understand what it’s doing. That saves me a lot of explanation. Their blog is very sparse on technical implementation details, but I wanted to try recreating it as there was just enough high-level detail in there to get me started.

Outside graphics, I’ve done a fair bit of audio programming in the past. It’s been too long since I did any significant audio DSP programming.

In short, the goal of the filter is to attempt to turn normal high-fidelity soundtracks into something with an 8-bit feel on-demand. Being able to introduce dynamic aspects to the music with a pure filter is interesting.

The tone filter as explained

The goal of the filter is to extract musical notes, and emphasize them. By having a few notes playing with a classic waveform like square or saw waves, we can recreate a retro 8-bit feel.

The blog describes 48 filters, spanning 4 octaves. Each octave in the (western music) scale is divided into 12 tones. What I deduced from this is that the 48 filters should be 48 very sharp bandpass filters.

Whiteboard description from blog

A theoretically perfect bandpass filter will output a pure sine wave if the tone exists, and nothing if it doesn’t.

The distortion is tough to do well, and I spent a lot of time fiddling with this. We want to try making the sine wave become something like a square wave. I tried many variants but I ended up with something very simple like

https://www.desmos.com/calculator/qvymx5qf8t

If you’ve done HDR tone-mapping, this formula will look very familiar to you. Sometimes cross-domain knowledge comes in handy.

After distorting, there is the levelling stage, which I spent a lot of time fiddling with. Basically, if we run this system as is, we end up with a ton of noise in the signal with all the chromatic tones playing over each other. Needless to say, this sounded absolutely terrible.

There are little to no details on how this should be implemented, so I tried a very crude model which seems to work reasonably well. Basically, each of the 48 channels have a running power estimate which is computed right after the filter. We can compare that against the running power estimate of the unfiltered audio. This lets us get an idea how much of the audio energy is concentrated into each individual tone. If the energy is low enough, it falls off in a power-of-4 fashion to avoid leaking in audio from completely unrelated tones. Percussion sounds will generally have energy in almost the entire audio spectrum, and we need to filter that out as well as we can. If the ratio is too high, we just cap it. This is the fiddly part. There’s a lot of magic constants to tweak to get it sounding pleasing.

At the end we mix our mono output signal back into the original audio, and when it works well, it gives a nice harmonic edge. I believe it’s reasonably close to the original game now. Here’s an example from the NieR:Automata OST. I’m visualizing all the 48 bands, and the colors are:

  • Blue: Below threshold, severely muted.
  • Green: Over threshold, should be heard.
  • Red: Saturated, hitting max threshold.

The tones in one octave form one row, and the four octaves are stacked on top of each other. The top-left starts at A3 – 220 Hz. If you know some music theory, maybe you can figure out which key the tune is in? 🙂

Implementation

First we mix stereo down to mono. This is kind of trivial. Just take the average of left and right channels.

Ultra-sharp bandpass, resonance filters

I went through a few failed iterations to get here. My first attempts were to do all of this in the frequency domain with FFTs, but that plan failed very quickly. What I ended up with in the end was a simple biquad resonance filter. This filter is characterized by having two zeroes and two poles in DSP parlance, or in other words, FIR (finite impulse response) and IIR (infinite impulse response). In code, this would look something like:

y[t] = n0 * x[t] + n1 * x[t - 1] + n2 * x[t - 2] - d0 * y[t - 1] - d1 * y[t - 2]

In the Z-domain, this looks like

H(z) = 1/n0 * (1 + z^-1 * n1/g + z^-2 * n2/g) / (1 + z^-1 * d0 + z^-2 * d1)

The zeroes and poles occur where the roots of the polynomials go to zero in the numerator and denominator respectively. Basically, I designed the filter by deliberately placing zeroes and poles in the Z-domain, factoring the expressions out and converting it back to a normal FIR and IIR form.

I placed a zero at DC and the Nyquist frequency (w = pi). The poles were placed very close to the unit circle at w = +/- 2 * pi * freq / samplerate, and amplitude 0.9999. Then I evaluated the filter response at the resonance frequency and adjusted the FIR portion of the filter so that we got an estimated unit gain at the resonance frequency.

Basically, the frequency response at the resonance frequency will be very close to dividing by zero, so near-infinite response, but not quite. Numerical stability can easily throw off the filter if we’re not careful. This is one of the major issues with IIR filters in general. I initially tried an 8-pole filter but it was impossible to get this stable even in FP64, so I just gave up and tried a simple biquad instead which worked just fine.

SIMD

Since we’re doing 48 IIR filters in parallel, this was a perfect case for SIMD optimizations. I made everything into a struct-of-arrays (SoA) form, and just vectorized the scalar IIR filter directly. Normally, small IIR filters are tricky to vectorize since there are inter-dependencies between samples, but not here.

I optimized the filter in NEON, SSE1 and AVX and got a very nice performance boost, more on that later.

This would have been a great case for ISPC, but I considered it a too large dependency for something simple like this.

Distortion

The distortion function must be nicely SIMD-friendly and not too expensive. I landed on the classic x/(1+abs(x)) operator. The divide can be done fast with reciprocal estimations. We didn’t need high accuracy.

Slight low-pass

After we have mixed together the 48 distorted streams, we run a weak low-pass filter on top to remove some of the harshest harmonics. This is done with a trivial 1-pole IIR filter.

Performance

I tested performance on a Ryzen 7 1800x @ 3.8 GHz as well as a high-end phone (Galaxy S9 Exynos) to measure NEON performance. The benchmark pushes 20 million white noise samples through the filter and then times the result. The test doesn’t take that long, so this should be assumed to be absolute peak performance without any thermal / power consideration. The results below are given in samples processed per second. Normal audio clips are 44.1 kHz, so 0.441 M/s should correspond to 1x real-time performance. The C++ version is written without any intrinsics with -O3 -ffast-math. The SIMD versions are written with the standard intrinsics.

ChipC++SSEAVXAArch64 NEON
Samsung Exynos 9810
1.8 M/s6.8 M/s
Ryzen 7 1800x @ 3.8 GHz3.6 M/s7.1 M/s11.5 M/s

Basically, we’re 100x realtime performance here, even on a mobile CPU, nice. I’m surprised how close the performance ended up when comparing SSE and NEON. I didn’t see any auto-vectorization activate in the C++ variant, so I wonder what is going on with just 2x scaling in SSE. I got similar results on MSVC and GCC for what it’s worth … NEON gets close to ideal 4x scaling though, nice.

This uses quite a bit of processing power, so we can’t run wild with effects like this right now. But I look forward to being able to take advantage of systems like this for even more precise operations in the future.


The original implementation probably does more work on more gimped CPU hardware (AMD Jaguar consoles), but 100x real-time is pretty fast in my book. 😉

Source

The implementation is out there, but don’t expect to be able to use it as-is. This is a hobby project after all.

https://github.com/Themaister/Granite/blob/master/audio/dsp/tone_filter.cpp

https://github.com/Themaister/Granite/blob/master/tests/tone_filter_bench.cpp

VST plugin

I implemented a simple VST plugin with builds for Windows and macOS, both 64-bit. Feel free to try it out. It’s ultra bare bones.

Windows 64-bit

macOS 64-bit

An unusual recompiler experiment – MIPS to LLVM IR – Part 4

This is the final part in my blog series on my adventure recompiling MIPS to LLVM IR. If you’re new to this series you can read:

  • Part 1 – Explains the goals, MIPS ELF format, etc.
  • Part 2 – Explains how to generate code using the LLVM APIs.
  • Part 3 – Explains how we recompile MIPS code to LLVM.

In this post, I’m going to test performance on some applications and get a feel for how the various different codegen options we have can affect performance.

Due our lack of extensive syscall support, there is a limit to what we can test without going out of our way to port stuff, so I’ll be focusing on some tests which don’t require much beyond simple stdio.

STB PNG read + write

This test is based on the STB library’s PNG implementation. The test will load a PNG file from disk and compress it again.

#include "stb_image.h"
#include "stb_image_write.h"
#include <stdlib.h>

int main(int argc, char *argv[])
{
	for (int i = 0; i < 20; i++)
	{
		int x, y, chan;
		stbi_uc *data = stbi_load("/tmp/test.png", &x, &y, &chan, 4);
		if (!data)
			return 1;

		if (!stbi_write_png("/tmp/output.png", x, y, 4, data, 4 * x))
			return 2;
	}

	return 0;
}

Native performance (32-bit)

To make the comparison a bit more fair, we’ll compile this using 32-bit x86 targeting i486 with -O3.

Time: 20.6 s

For reference, this matters quite a lot, in x86-64, we get 15.38 s. I will use the i486 result as a baseline, since both i486 and MIPS I are ancient ISAs from around the same era of computing.

MIPS on-demand JIT (baseline)

To begin our benchmarking, we’re going to test fully on-line JIT-ing. This is what needs to happen at least the first time we’re running an application. The results here will be affected by a balance between optimization in run-time and having to do less work while JIT-ing.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf

In this first test, we will apply the following options:

  • Function calls will link directly to their targets. This increases JIT workload significantly, since we need to JIT all possible call paths to be able to link code directly. However, runtime should be faster once we have JIT-ed.
  • No IR optimizations are enabled.

Time: 71.43 s

The up-front cost of JIT-ing is quite long. But overall, 3.5x slower isn’t terrible. Let’s see if we can do it better.

On-demand JIT with optimizations

The JIT-er can perform some in-place optimizations. We’ll see if it helps here.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --optimize

Time: 75.43 s

It seems like the optimization passes made it a bit slower.

On-demand JIT with thunked calls

Rather than aggressively JIT-ing call possible call paths, we can try just JIT-ing functions we are actually calling. All direct calls are translated into indirect calls, and every call requires a lookup. This should reduce the JIT overhead a lot, but potentially have worse runtime performance. Without –optimize, this should be the most efficient option if we want to avoid JIT overhead.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls

Time: 70.0 s

Interesting. This might be the sweet spot for on-demand JIT.

On-demand JIT with thunked all the things

We can also use thunked load-store operations, rather than emit IR code to translate addresses for every memory operation. This should reduce code bloat, and might help when we’re doing on-demand JIT.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls --thunk-load-store

Time: 90.4 s

Ouch.

Assuming well behaved calls and returns?

Unfortunately, my assumption that GCC would generate expected code for returns was wrong, or my implementation was buggy. I couldn’t get it to work for non-trivial test cases, so I can’t test performance here.

Ahead-of-time recompiled IR

Now we’re starting to get into interesting territory which I haven’t seen much of in the past.

We need to run the application here, dump IR code to disk, and recompile into a dynamic library. For this case, we should be able to generate pretty good code and avoid any run-time recompilation. This is the ideal scenario if we can deduce all known call-paths.

Let’s start with the optimal case. No thunking.

~/git/jitter/cmake-build-release/mipsvm --dump-llvm /tmp/llvm stb-test.elf

This dumps out a whopping 68 MB of LLVM IR. Time to turn this ball of mud into a dynamic library.

#!/bin/bash

OUTPUT="$1"
LLDIR="$2"

echo "== Linking LLVM IR =="
llvm-link -o __llvm_linked.bc "$LLDIR"/*.ll
echo "== Optimizing offline LLVM IR =="
opt -O3 -o __llvm_opt.bc __llvm_linked.bc -disable-inlining
#cp __llvm_linked.bc __llvm_opt.bc
echo "== Compiling static library to object file with LLC =="
llc -relocation-model=pic -filetype obj -o __linked.o __llvm_opt.bc -O3
echo "== Linking shared library =="
gcc -o "$OUTPUT" -shared __linked.o

rm -f __llvm_linked.bc
rm -f __llvm_opt.bc
rm -f __linked.o

This operation takes 53.4 seconds and generates a 3.4 MB binary. The original binary is 792 kB due to the statically linked glibc.

This should yield us the absolute best performance we can hope for. So let’s try it.

~/git/jitter/cmake-build-release/mipsvm test --static-lib ~/git/jitter/test_linked.so --static-symbols /tmp/llvm/addr.bin

Time: 46.2 s

That’s a pretty great improvement. Compared to 20 seconds for a native binary with 32-bit/i486. It starts up basically instantly since there is no recompilation necessary. We only need to recompile if we find new code we haven’t looked at yet.

From here, we can get a better idea of what runtime cost we have by removing optimizations, and adding thunking. Let’s see if opt -O3 helps at all by just going straight to llc.

Building the native binary just takes 24 seconds now.

Time: 57.0 s

opt -O3 is clearly doing something well. Let’s add back the optimization and use thunked calls. For thunked calls, the IR dump is just 21 MB. We can see here that we were JIT-ing out a lot of useless code we never had to actually run.

The binary is 984 kB now.

~/git/jitter/cmake-build-release/mipsvm stb-test.elf --static-lib ~/git/jitter/test_unlinked.so --static-symbols /tmp/llvm-nolink/addr.bin

Time: 60.0 s

The win from linking directly is nothing to sneeze at. 46.2s to 60.0 s. Let’s thunk the load-store calls and see where we get.

Time: 92.7 s

Yup. Clearly, we can get a 2x speedup by just inlining the load-store code and directly calling functions rather than rely on thunking. We’re not that far away from 2x differential from native code in the best case!

Best of both worlds codegen?

If we’re dumping code with thunking to disk to improve JIT overhead, we can imagine that we can optimize the thunked calls to direct code off-line if we write our own LLVM optimization pass. Just an idea …

We need to go one level deeper

Let’s try something silly. We will recompile a cross-compiled cross-compiler. What? Well, I’ve built SPIRV-Cross for MIPS big-endian this time around. This was actually a useful exercise, because I can now verify that SPIRV-Cross works for both MIPS and big-endian at the same time 😛 Nice. SPIRV-Cross uses C++11, a fair bit of STL and exceptions. Can we host libstdc++ properly? Let’s see. With a statically linked libstdc++, the binary is 3.2 MB.

Let’s dump some LLVM …

~/git/jitter/cmake-build-release/mipsvm spirv-cross --dump-llvm /tmp/llvm-spirv -- spirv-cross /tmp/test.spv

I tried running this with a test shader in the SPIRV-Cross repository. It makes use of FP64, so we can see if we support doubles. Here we also see that we can pass arguments to argv. We end up with 168 MB of LLVM IR, which sure is intense. Let’s recompile it. This process takes over 2 minutes and creates a 8.6 MB binary.

~/git/jitter/cmake-build-release/mipsvm spirv-cross --static-lib ~/git/jitter/test_spirv.so --static-symbols /tmp/llvm-spirv/addr.bin -- spirv-cross /tmp/test.spv

Now it runs almost instantly and correctly.

Conclusion

This has been a fun little side project. The overhead of JIT-ing is rather high as we would expect, but the peak runtime performance is surprisingly good. We’re in the 2-3x slower ballpark against natively compiled code for ahead-of-time compiled code. I haven’t tested a lot of code out there, but STB’s PNG implementation, SPIRV-Cross, glibc and libstdc++ should represent reasonably complex and varied code.

Release

I’ve released this project on GitHub under an MIT license. Please read the disclaimer.

An unusual recompiler experiment – MIPS to LLVM IR – Part 3

In part 1 and part 2 we laid the groundwork to start recompiling MIPS code to LLVM IR. Strap your seatbelts, we’re going to MIPS and x86 assembly land.

The top-level run loop

The top level code fundamentally needs to be able to translate the program counter (short-hand, PC) to an executable function pointer. We can choose a hash map (large address space) or flat array (small address space) here.

If we need to call a PC we have not seen before, we will need to recompile a new LLVM module, starting at that PC, and then we can execute it.

Self-modifying code?

An immediate question is self-modifying code. This is a fairly ugly topic to deal with since our previously compiled function might become invalid if the underlying code changes. I think the solution for that is to keep a JIT block cache which translates a hash to function pointer and do some analysis of code blocks we don’t have a function pointer for yet. Any i-cache invalidations will clear out the relevant function pointers which triggers hashing in some form. Most likely the code for our function in particular did not change, so we can likely reuse the code blocks we generated.

For our purpose, we will not deal with self-modifying code here. A real emulator will have to deal with it, but self-modifying code should be rarer and rarer the more modern hardware we’re dealing with.

Recompiling a function

So, given a PC to execute, we’ll do some analysis where we map out all execution paths from that PC. We do this by mapping out all the basic blocks. See part 2 for more detail on what basic blocks do in LLVM.

Basic block

Basic blocks are represented as a starting PC and an end, where the execution flow is linear. The end of a basic block occurs where we see some kind of branch instruction (except for call instructions!). In this analysis we only care about these “special” instructions. Normal opcodes like arithmetic and load/store are ignored since they cannot affect control flow.

Branch delay slots

A very important part of MIPS is the use of a branch delay slot. It is a very unique design aspect of the architecture, which is considered a design flaw today because it was hard-coded to help a very specific micro-architecture. Exposing micro-architecture details like this should be considered bad taste. Whenever a branch is taken, unconditionally or not, the next instruction is always executed. Let’s see a trivial example:

int foo(int a)
{
	return a + 10;
}
00000000 <foo>:
   0:	03e00008 	jr	ra
   4:	2482000a 	addiu	v0,a0,10

“jr $ra” jumps to an address stored in a register, and $ra is used for the return address of a function. However, we can see that the add instruction comes afterwards. GCC exploits the delay slot here to do the useful computation inside it. Note that if you write MIPS assembly, you can get the assembler to perform this reordering for you. Often you will see “nop” after a branch if there is nothing useful to do in the delay slot.

One thought you might have now is, what happens if you have multiple branches back to back, branching in a delay slot? Well, if you actually thought of that then congratulations, have a cookie. This is explicitly banned in MIPS ISA, because it is non-sensical and undefined. While the hardware behavior could be well defined for a particular chip, it is still extremely broken, because if there is any exception or hardware interrupt happening in the middle of this sequence, it is impossible to recover from it. MIPS interrupt handlers typically have to deal with delay slots and fix-up the PC register accordingly.

The practical effect of the delay slot for us is that whenever we recompile a branch instruction, we recompile the following instruction first, then perform the branch. If the following instruction is also a branch instruction, we know that it cannot legally be taken, and branch instructions cannot have side effects (except for jal/jalr, but those always take the branch, illegal!), so we just skip it.

Load delay slots

MIPS I also has a delay for loads. We cannot use the target register in the instruction following a load. However, for recompilation purpose we can ignore this. While clever code might attempt to abuse the fact that a target register for a load instruction hasn’t been updated yet, this is also unsafe in the real world. If an interrupt triggers in the middle of this sequence, the register will be updated anyways, breaking the assumption of the clever code. Thus, we simply ignore the existence of the load delay slot because correct code cannot rely on this hack.

Conditional branches

MIPS has a few conditional branch instructions. When we see a conditional branch, we can branch to one of two basic blocks. Either we take the branch, or we don’t. We recursively analyze the new basic blocks we found if their target PCs haven’t been analyzed already. Some instructions to look out for are

  • BEQ
  • BNE
  • BLEZ
  • BLTZ
  • BGEZ
  • BGTZ
  • BC1T (floating point compare)
  • BC1F (floating point compare)

Direct branches

The direct branch in MIPS is the “J” instruction. We need to be careful with this instruction because it is commonly used in two ways:

  • Branch to basic block
  • Tail call to an unrelated function

If we mistakenly treat a J as a basic block where it should have been a function, we will end up inlining huge functions into our own, where we should have just “called” them instead. Too much inlining will bloat the JIT and make recompilation slower. Let’s see an example.

#include <stdlib.h>

// Make sure we don't get inlining optimizations.
__attribute__((noinline))
static void *wrapped_malloc(size_t size)
{
	return malloc(size);
}

void *my_malloc(size_t size)
{ 
	return wrapped_malloc(size * 4); // Tail-call
}
00000000 <wrapped_malloc>:
   0:	3c1c0000 	lui	gp,0x0
   4:	279c0000 	addiu	gp,gp,0
   8:	8f990000 	lw	t9,0(gp)
   c:	00000000 	nop
  10:	03200008 	jr	t9
  14:	00000000 	nop

00000018 <my_malloc>:
  18:	08000000 	j	0 <wrapped_malloc>
  1c:	00042080 	sll	a0,a0,0x2

Here we need to see that J is actually a tail call to “wrapped_malloc”, and not a branch to a basic block. The heuristic I ended up with was that if the J target refers to a basic block through the use of conditional branches elsewhere, we can assume J refers to a branch to a basic block ala if/else or switch blocks. If not, we assume it’s a tail call.

There are other static branches we can find in MIPS. The conditional branches can become static branches if the $0 register is used. This seems to be mostly useful with position-independent code since we can branch to an address relative to our PC rather than fixed address with J. We should try to detect these “static” branches as well. There is no need in analyzing code which can never be executed.

Indirect branches

Indirect branches in MIPS are a bit tricky to handle. They are implemented using the JR instruction. The edge case we need to handle is that JR is also used to return from a function. Either way, JR will always end a basic block. The implementation logic will end up being something like this:

// JR
uint32_t target_pc = registers[instr.register];
if (target_pc == return_prediction_stack.top())
{
    // This is actually a return!
    predicition_stack.pop();
    return;
}
else
{
    // This might have to recompile new code if we haven't seen target_pc before!
    auto *target = mips_resolve_call_target(target_pc);
    return target(mips_state); // Tail-call.
}

There are a few main use cases for JR:

  • Returning from a function, almost always using “jr $ra”.
  • Jump tables
  • Tail calls in dynamically loaded code.

We might be able to add a few optimizations for “well-behaved” code, where we can safely assume that “jr $ra” always means return, and that $ra always refers to the correct return address. That is not guaranteed, but I think GCC will always generate sane code at least.

Illegal instructions

If we find an illegal instruction, we can call out to the VM host, and request a SIGILL signal to be raised to our thread. This also ends the basic block.

Putting it together

Now we have gone through all instructions which can trigger an end of a basic block. Let’s take a more complex function and split it up into basic blocks.

int number_of_even(const int *values, int count)
{
	int res = 0;
	for (int i = 0; i < count; i++)
		if ((values[i] & 1) == 0)
			res++;
	return res;
}
 00000000 <number_of_even>:
  // ConditionalBranch -> 0x38 or 0x8
  // Note that the basic block does not end until after
  // the delay slot has executed.
   0:	18a0000d 	blez	a1,38 <number_of_even+0x38>
   4:	00052880 	sll	a1,a1,0x2

  // ConditionalBranch -> 0x28 or 0x24
   8:	00852821 	addu	a1,a0,a1
   c:	00001025 	move	v0,zero
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x28 or 0x24
  // Should split up here because 0x10 is a branch target.
  // Current implementation does not split up basic blocks
  // to allow branching to the middle of another basic block.
  // Instead we end up duplicating some code.
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop // A wild load-delay slot appears.
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x30 or 0x10
  24:	24420001 	addiu	v0,v0,1
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // ConditionalBranch -> 0x30 or 0x10
  // Same here w.r.t. code duplication,
  // 0x24 and 0x28 are both branch targets.
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  30:	03e00008 	jr	ra
  34:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  38:	03e00008 	jr	ra
  3c:	00001025 	move	v0,zero

Once we know all the basic blocks, we can create LLVM basic blocks for them, and then recompile the blocks directly and link them together with BranchInst. This way of analyzing and recompiling is fairly ISA agnostic actually and it’s not that hard to change MIPS into something else once the basic structure is in place. The recompiler itself which sets up this is actually completely MIPS agnostic, it only asks for “given a start PC, where does the basic block end, and what kind of basic block is it.”

Register allocation and branches

While we’re working on registers, we ideally want the MIPS registers to be reflected by our native hardware registers. It’s obvious a 1:1 mapping is not possible. MIPS has 32 (well, 31) general purpose registers, 32 floating-point registers and various control registers. This isn’t going to fit on x86 or Arm.

Fortunately, we do not have to really care about register allocation when using LLVM. We just need to make sure we don’t emit CreateStore/CreateLoad as much as possible, and LLVM should take care of the rest. Within a basic block, this is very easy since we always know which SSA value a register refers to as the control flow is linear. I implemented a simple RegisterTracker class which lets me translate registers to SSA values. If we haven’t used a register yet, load it from memory, if we modify a register, just replace the SSA value and remember that we eventually have to write it back to memory later, i.e. the register bank.

The real problem is how to deal with branches. We learned last time that to pass values to other basic blocks we can use PHI nodes. I tried implementing a scheme like this, where I would build a full CFG and try to link up register values using PHI nodes, but I gave up. The biggest complication is that our registers can become invalidated when calling other functions (since they modify registers as well), and we will have a real hard time handling register dirty tracking. If we have say a basic block C which can be entered from basic block A and B, A might write registers 1 through 15 and B might write registers 16 through 31. If we want to use PHI nodes, we’ll need to create one for every possible register all predecessors of C might have touched. We also don’t really know which registers are dirty and need to be moved back to memory after the function ends, and emitting branches just to conditionally move registers back to memory is dumb. Because of all these complications and pathological cases I went with a very simple scheme. At the end of a basic block or before a function call, all dirty registers are flushed to memory. On entry of a basic block, we will have to load all the registers we need from memory. Ideally, LLVM should be able to optimize this back to SSA/PHI form, but it might be rather expensive to do so. Even if LLVM does not optimize for this, the register bank should be 100% hot in L1 cache, so I’m not too worried about performance. x86 is a very register starved architecture to begin with and moving data to and from L1 cache is very common.

Call instructions

MIPS has several ways of “calling” functions. These functions do not necessarily end a basic block, since we expect control flow to return to the instruction following the branch delay slot.

  • JAL
  • JALR
  • BLTZAL
  • BGEZAL
  • J (deduced tail call)
  • BEQ/BGEZ/BLTZ (deduced position-independent tail call)

The L stands for link, which means that $pc + 8 is written to the return register $ra before jumping. As we saw earlier, we can return by jumping indirectly to $ra. Unlike x86, there is no “return address is on stack”.

JAL is the easiest one to understand, as it means “call this address”. JALR is a variant where we call a function pointer. BLTZAL and BGEZAL are very interesting as they conditionally call a function. They are also useful for position independent calls since they use the PC-relative addressing mode. All of these instructions are fundamentally implemented in the same way.

Return stack prediction

We want to be as friendly as possible to our CPUs branch predictor. The return instruction is one of the best prediction methods we can exploit. When we return, the CPU can be almost 100% sure where we are going to branch unless we were the subject of a stack smash attack or something. The CPU keeps an internal stack where it expects a return to go, and that can be used to predict returns perfectly if our code is well behaved.

Of course, we cannot assume the code we’re running is perfect, but we can optimize for it. Whenever we are executing a link instruction, we can push the link target to a prediction stack. Whenever we see a JR instruction later, we check if it equals the top of the prediction stack. If so, we can pop the stack and simply return, no extra JIT compilation necessary. If JR is not a return, we might have to compile some more code.

One problem of the return stack is that MIPS code is free to just call JAL over and over and over, since JAL just writes to the link register, and doesn’t actually affect the stack pointer $sp.

To deal with the situation where the return stack grows towards infinity, we will just need to deal with it by setting a rational upper limit. In the worst case where our return stack for some reason grows too large, we can use the nuclear option in our arsenal, longjmp! The top level code uses setjmp, and if at any time we’ve reached a hopeless situation, longjmp unwinds the entire stack at once, and we can re-enter with our new PC. However, this is kinda terrible for performance since all return instructions will now fail to optimize to a simple return, and might have to JIT out random code which followed a call instruction. We’ll hope this never happens for real.

To thunk or not to thunk

While indirect calls must have a lookup to determine what we are actually calling in runtime, it’s possible for direct call instructions to directly call another function in LLVM. In this case, we avoid any runtime lookups. We risk recursively having to recompile the callee functions to be able to link such a function, so the initial JIT step can become really slow. I added an option which lets JAL pretend to be JALR and have all call instructions go through an indirection. LLVM can support lazy JIT to alleviate this problem, but I don’t know how to make that work, so, meh. Our grand plan is to optimize all of this stuff offline anyways later 😉

Putting it all together

It’s time to look at some real code, real MIPS output and the resulting LLVM IR. In the VM, I added a mode which lets me call any function by name. This is very useful to facilitate small test cases, so I don’t have to go through the entire libc init step just to test some basic arithmetic. $ra will be 0, and I treat returning to PC 0 as “I’m done with the test, dump registers”.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        return foo(40, 50);
}
004005ec <main>:
  4005ec:       24050032        li      a1,50
  4005f0:       08100208        j       400820 <foo>
  4005f4:       24040028        li      a0,40

00400820 <foo>:
  400820:       03e00008        jr      ra
  400824:       00851021        addu    v0,a0,a1

Doesn’t get much simpler to start with this test. main calls foo through a tail call, let’s see what the LLVM looks like completely unoptimized:

; ModuleID = '_004005ec'
source_filename = "_004005ec"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004005ec(%0*) {
entry:
  br label %_004005ec

_004005ec:                                        ; preds = %entry
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 40, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 50, i32* %a1Ptr
  tail call void @_00400820(%0* %0)
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %v0 = add i32 %a0Loaded, %a1Loaded
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

The first thing we notice is

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

which expresses the MIPS state which we pass around to our JIT functions. 64 i32 values are reserved for the general purpose registers (32 + a couple other hidden registers), 64 FP registers (32 + a couple extra), and finally, the page table. We inline it in the struct to be able to load and store memory as efficiently as possible. The code should be fairly easy to follow until we reach the return in foo()

  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

Here we call to our externally defined function in the VM to check if return stack prediction worked. Either we tail call or just simply return. $ra in this case will be 0, and we just end execution here.

The registers are dumped at the end to read:

...
  v0 = 90
  v1 = 0
  a0 = 40
  a1 = 50
...

Very nice! $v0 is the return register in the MIPS ABI and $a0/$a1 are the first and second arguments respectively.

Loads and stores

Let’s have a look what happens when we cannot rely on tail calls.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        int a = foo(1, 2);
        a += foo(3, 4);
        return a;
}
004005ec <main>:
  4005ec:       27bdffe0        addiu   sp,sp,-32
  4005f0:       24050002        li      a1,2
  4005f4:       afbf001c        sw      ra,28(sp)
  4005f8:       0c100210        jal     400840 <foo>
  4005fc:       24040001        li      a0,1
  400600:       24050004        li      a1,4
  400604:       24040003        li      a0,3
  400608:       0c100210        jal     400840 <foo>
  40060c:       00401825        move    v1,v0
  400610:       8fbf001c        lw      ra,28(sp)
  400614:       00621021        addu    v0,v1,v0
  400618:       03e00008        jr      ra
  40061c:       27bd0020        addiu   sp,sp,32

We only need to load and store to stack, but we’ll see the codegen in action.

_004005ec:                                        ; preds = %entry
  %spPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded = load i32, i32* %spPtr
  %sp = add i32 %spLoaded, -32
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %SWAddr = add i32 %sp, 28

; Translate virtual address to page + offset
  %PageIndex = lshr i32 %SWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %SWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  store i32 %raLoaded, i32* %PagePtr

; Flush registers before calling foo
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 2, i32* %a1Ptr
  %spPtr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp, i32* %spPtr1
  %raPtr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195840, i32* %raPtr2

; Predict the return address
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195840)
; Direct call to foo, no indirection needed here.
  call void @_00400840(%0* %0)
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v0Loaded, i32* %v1Ptr
  %a0Ptr3 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 3, i32* %a0Ptr3
  %a1Ptr4 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4, i32* %a1Ptr4
  %raPtr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195856, i32* %raPtr5
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195856)
  call void @_00400840(%0* %0)
  %spPtr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded7 = load i32, i32* %spPtr6
  %LWAddr = add i32 %spLoaded7, 28
  %PageIndex8 = lshr i32 %LWAddr, 12
  %Page9 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex8
  %PageLoaded10 = load i8*, i8** %Page9
  %Page3211 = bitcast i8* %PageLoaded10 to i32*
  %PageOffset12 = lshr i32 %LWAddr, 2
  %PageOffsetMasked13 = and i32 %PageOffset12, 1023
  %PagePtr14 = getelementptr inbounds i32, i32* %Page3211, i32 %PageOffsetMasked13
  %Loaded = load i32, i32* %PagePtr14
  %v0Ptr15 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded16 = load i32, i32* %v0Ptr15
  %v1Ptr17 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  %v1Loaded = load i32, i32* %v1Ptr17
  %v0 = add i32 %v1Loaded, %v0Loaded16
  %sp18 = add i32 %spLoaded7, 32
  %v0Ptr19 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr19
  %spPtr20 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp18, i32* %spPtr20
  %raPtr21 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 %Loaded, i32* %raPtr21
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %Loaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004005ec
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004005ec
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

There is a fair bit of noise here with loading and storing to memory. We have to emulate the virtual address space, so that means translating addresses into pages and offsets. What about the x86-64 output?

0000000000000000 <_004005ec>:
   0:   53                      push   %rbx
   1:   48 89 fb                mov    %rdi,%rbx
   4:   8b 47 74                mov    0x74(%rdi),%eax
   7:   8b 4f 7c                mov    0x7c(%rdi),%ecx
   a:   8d 50 e0                lea    -0x20(%rax),%edx
   d:   83 c0 fc                add    $0xfffffffc,%eax
  10:   89 c6                   mov    %eax,%esi
  12:   c1 ee 0c                shr    $0xc,%esi
  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)
  28:   48 b8 01 00 00 00 02    movabs $0x200000001,%rax
  2f:   00 00 00 
  32:   48 89 47 10             mov    %rax,0x10(%rdi)
  36:   89 57 74                mov    %edx,0x74(%rdi)
  39:   c7 47 7c 00 06 40 00    movl   $0x400600,0x7c(%rdi)
  40:   be 40 08 40 00          mov    $0x400840,%esi
  45:   ba 00 06 40 00          mov    $0x400600,%edx
  4a:   e8 00 00 00 00          callq  4f <_004005ec+0x4f>
  4f:   48 89 df                mov    %rbx,%rdi
  52:   e8 79 00 00 00          callq  d0 <_00400840>
  57:   8b 43 08                mov    0x8(%rbx),%eax
  5a:   89 43 0c                mov    %eax,0xc(%rbx)
  5d:   48 b8 03 00 00 00 04    movabs $0x400000003,%rax
  64:   00 00 00 
  67:   48 89 43 10             mov    %rax,0x10(%rbx)
  6b:   c7 43 7c 10 06 40 00    movl   $0x400610,0x7c(%rbx)
  72:   be 40 08 40 00          mov    $0x400840,%esi
  77:   ba 10 06 40 00          mov    $0x400610,%edx
  7c:   48 89 df                mov    %rbx,%rdi
  7f:   e8 00 00 00 00          callq  84 <_004005ec+0x84>
  84:   48 89 df                mov    %rbx,%rdi
  87:   e8 44 00 00 00          callq  d0 <_00400840>
  8c:   8b 43 0c                mov    0xc(%rbx),%eax
  8f:   8b 4b 74                mov    0x74(%rbx),%ecx
  92:   8d 51 1c                lea    0x1c(%rcx),%edx
  95:   89 d6                   mov    %edx,%esi
  97:   c1 ee 0c                shr    $0xc,%esi
  9a:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  a1:   00 
  a2:   c1 ea 02                shr    $0x2,%edx
  a5:   81 e2 ff 03 00 00       and    $0x3ff,%edx
  ab:   8b 34 96                mov    (%rsi,%rdx,4),%esi
  ae:   83 c1 20                add    $0x20,%ecx
  b1:   01 43 08                add    %eax,0x8(%rbx)
  b4:   89 4b 74                mov    %ecx,0x74(%rbx)
  b7:   89 73 7c                mov    %esi,0x7c(%rbx)
  ba:   48 89 df                mov    %rbx,%rdi
  bd:   e8 00 00 00 00          callq  c2 <_004005ec+0xc2>
  c2:   48 85 c0                test   %rax,%rax
  c5:   74 06                   je     cd <_004005ec+0xcd>
  c7:   48 89 df                mov    %rbx,%rdi
  ca:   5b                      pop    %rbx
  cb:   ff e0                   jmpq   *%rax
  cd:   5b                      pop    %rbx
  ce:   c3                      retq   
  cf:   90                      nop

Ouch. A lot of this is noise to deal with register moves. We can see the code sequence which performs loads and stores here:

  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)

Good news is that this is very straight forward code, so the CPU should churn through most of this like butter unless we’re missing the page table reads in L1. It will be interesting to benchmark this code against natively compiled C code later.

Loops

Let’s try to JIT the number_of_even function we made earlier and see if LLVM can preserve data in registers across loop iterations.

__attribute__((noinline))
int number_of_even(const int *values, int count)
{
        int res = 0;
        for (int i = 0; i < count; i++)
                if ((values[i] & 1) == 0)
                        res++;
        return res;
}

int main(void)
{
        static const int values[] = { 1, 2, 3, 4 };
        return number_of_even(values, 4); 
}
00400820 <number_of_even>:
  400820:       18a0000d        blez    a1,400858 <number_of_even+0x38>
  400824:       00052880        sll     a1,a1,0x2
  400828:       00852821        addu    a1,a0,a1
  40082c:       00001025        move    v0,zero
  400830:       8c830000        lw      v1,0(a0)
  400834:       00000000        nop
  400838:       30630001        andi    v1,v1,0x1
  40083c:       14600002        bnez    v1,400848 <number_of_even+0x28>
  400840:       24840004        addiu   a0,a0,4
  400844:       24420001        addiu   v0,v0,1
  400848:       14a4fff9        bne     a1,a0,400830 <number_of_even+0x10>
  40084c:       00000000        nop
  400850:       03e00008        jr      ra
  400854:       00000000        nop
  400858:       03e00008        jr      ra
  40085c:       00001025        move    v0,zero

004005ec <main>:
  4005ec:       3c040047        lui     a0,0x47
  4005f0:       24050004        li      a1,4
  4005f4:       08100208        j       400820 <number_of_even>
  4005f8:       2484a330        addiu   a0,a0,-23760
  4005fc:       00000000        nop
define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %BLEZ = icmp sle i32 %a1Loaded, 0
  %a1 = shl i32 %a1Loaded, 2
  %a1Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a1, i32* %a1Ptr1
  br i1 %BLEZ, label %_00400858, label %_00400828

_00400858:                                        ; preds = %_00400820
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

_00400828:                                        ; preds = %_00400820
  %a1Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded3 = load i32, i32* %a1Ptr2
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %a14 = add i32 %a0Loaded, %a1Loaded3
  %LWAddr = add i32 %a0Loaded, 0
  %PageIndex = lshr i32 %LWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %LWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  %Loaded = load i32, i32* %PagePtr
  %v1 = and i32 %Loaded, 1
  %BNE = icmp ne i32 %v1, 0
  %a0 = add i32 %a0Loaded, 4
  %v0Ptr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr5
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v1, i32* %v1Ptr
  %a0Ptr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a0, i32* %a0Ptr6
  %a1Ptr7 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a14, i32* %a1Ptr7
  br i1 %BNE, label %_00400848, label %_00400844

_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Ptr8 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded9 = load i32, i32* %a0Ptr8
  %a1Ptr10 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded11 = load i32, i32* %a1Ptr10
  %BNE12 = icmp ne i32 %a1Loaded11, %a0Loaded9
  br i1 %BNE12, label %_00400830, label %_00400850

_00400830:                                        ; preds = %_00400844, %_00400848
  %a0Ptr13 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded14 = load i32, i32* %a0Ptr13
  %LWAddr15 = add i32 %a0Loaded14, 0
  %PageIndex16 = lshr i32 %LWAddr15, 12
  %Page17 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex16
  %PageLoaded18 = load i8*, i8** %Page17
  %Page3219 = bitcast i8* %PageLoaded18 to i32*
  %PageOffset20 = lshr i32 %LWAddr15, 2
  %PageOffsetMasked21 = and i32 %PageOffset20, 1023
  %PagePtr22 = getelementptr inbounds i32, i32* %Page3219, i32 %PageOffsetMasked21
  %Loaded23 = load i32, i32* %PagePtr22
  %v124 = and i32 %Loaded23, 1
  %BNE25 = icmp ne i32 %v124, 0
  %a026 = add i32 %a0Loaded14, 4
  %v1Ptr27 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v124, i32* %v1Ptr27
  %a0Ptr28 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a026, i32* %a0Ptr28
  br i1 %BNE25, label %_00400848, label %_00400844

_00400844:                                        ; preds = %_00400830, %_00400828
  %v0Ptr29 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr29
  %v0 = add i32 %v0Loaded, 1
  %a0Ptr30 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded31 = load i32, i32* %a0Ptr30
  %a1Ptr32 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded33 = load i32, i32* %a1Ptr32
  %BNE34 = icmp ne i32 %a1Loaded33, %a0Loaded31
  %v0Ptr35 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr35
  br i1 %BNE34, label %_00400830, label %_00400850

_00400850:                                        ; preds = %_00400844, %_00400848
  %raPtr36 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded37 = load i32, i32* %raPtr36
  %jump_addr38 = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded37)
  %jump_addr_cmp41 = icmp ne void (%0*)* %jump_addr38, null
  br i1 %jump_addr_cmp41, label %IndirectJumpPath39, label %IndirectJumpReturn40

IndirectJumpPath:                                 ; preds = %_00400858
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400858
  ret void

IndirectJumpPath39:                               ; preds = %_00400850
  tail call void %jump_addr38(%0* %0)
  ret void

IndirectJumpReturn40:                             ; preds = %_00400850
  ret void
}

Again, pretty noisy output, and this is unoptimized output after all. If we look at the x86-64 output, then as expected, it’s pretty bad:

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 47 14                mov    0x14(%rdi),%eax
  17:   8d 0c 85 00 00 00 00    lea    0x0(,%rax,4),%ecx
  1e:   85 c0                   test   %eax,%eax
  20:   89 4f 14                mov    %ecx,0x14(%rdi)
  23:   7f 19                   jg     3e <_00400820+0x2e>
  25:   8b 73 7c                mov    0x7c(%rbx),%esi
  28:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  2f:   48 89 df                mov    %rbx,%rdi
  32:   e8 00 00 00 00          callq  37 <_00400820+0x27>
  37:   48 85 c0                test   %rax,%rax
  3a:   75 50                   jne    8c <_00400820+0x7c>
  3c:   5b                      pop    %rbx
  3d:   c3                      retq   
  3e:   8b 43 10                mov    0x10(%rbx),%eax
  41:   89 c1                   mov    %eax,%ecx
  43:   c1 e9 0c                shr    $0xc,%ecx
  46:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  4d:   00 
  4e:   89 c2                   mov    %eax,%edx
  50:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  56:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  59:   01 43 14                add    %eax,0x14(%rbx)
  5c:   83 c0 04                add    $0x4,%eax
  5f:   83 e1 01                and    $0x1,%ecx
  62:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  69:   89 4b 0c                mov    %ecx,0xc(%rbx)
  6c:   89 43 10                mov    %eax,0x10(%rbx)
  6f:   75 23                   jne    94 <_00400820+0x84>
  71:   8b 43 14                mov    0x14(%rbx),%eax
  74:   ff 43 08                incl   0x8(%rbx)
  77:   3b 43 10                cmp    0x10(%rbx),%eax
  7a:   75 20                   jne    9c <_00400820+0x8c>
  7c:   8b 73 7c                mov    0x7c(%rbx),%esi
  7f:   48 89 df                mov    %rbx,%rdi
  82:   e8 00 00 00 00          callq  87 <_00400820+0x77>
  87:   48 85 c0                test   %rax,%rax
  8a:   74 06                   je     92 <_00400820+0x82>
  8c:   48 89 df                mov    %rbx,%rdi
  8f:   5b                      pop    %rbx
  90:   ff e0                   jmpq   *%rax
  92:   5b                      pop    %rbx
  93:   c3                      retq   
  94:   8b 43 14                mov    0x14(%rbx),%eax
  97:   3b 43 10                cmp    0x10(%rbx),%eax
  9a:   74 e0                   je     7c <_00400820+0x6c>
  9c:   8b 43 10                mov    0x10(%rbx),%eax
  9f:   89 c1                   mov    %eax,%ecx
  a1:   c1 e9 0c                shr    $0xc,%ecx
  a4:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  ab:   00 
  ac:   89 c2                   mov    %eax,%edx
  ae:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  b4:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  b7:   83 e1 01                and    $0x1,%ecx
  ba:   83 c0 04                add    $0x4,%eax
  bd:   89 4b 0c                mov    %ecx,0xc(%rbx)
  c0:   89 43 10                mov    %eax,0x10(%rbx)
  c3:   85 c9                   test   %ecx,%ecx
  c5:   74 aa                   je     71 <_00400820+0x61>
  c7:   eb cb                   jmp    94 <_00400820+0x84>

Not a lot of register use here, what happens if the run the LLVM through opt first though?

; ....
_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Loaded9 = phi i32 [ %a026, %_00400830 ], [ %a0, %_00400828 ]
  %v0Loaded3 = phi i32 [ %v0Loaded4, %_00400830 ], [ 0, %_00400828 ]
  %BNE12 = icmp eq i32 %a14, %a0Loaded9
  br i1 %BNE12, label %_00400850, label %_00400830

_00400830:                                        ; preds = %_00400848, %_00400844
  %a0Loaded14 = phi i32 [ %a0Loaded9, %_00400848 ], [ %a0Loaded31, %_00400844 ]
  %v0Loaded4 = phi i32 [ %v0Loaded3, %_00400848 ], [ %v0, %_00400844 ]
; ...

Sure enough, we’re seeing some loads and stores getting promoted to phi nodes, excellent. The x86-64 codegen is improved a bit as well. Still kinda hard to read though …

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 4f 14                mov    0x14(%rdi),%ecx
  17:   8d 04 8d 00 00 00 00    lea    0x0(,%rcx,4),%eax
  1e:   85 c9                   test   %ecx,%ecx
  20:   89 47 14                mov    %eax,0x14(%rdi)
  23:   0f 8e 84 00 00 00       jle    ad <_00400820+0x9d>
  29:   8b 53 10                mov    0x10(%rbx),%edx
  2c:   01 d0                   add    %edx,%eax
  2e:   89 d6                   mov    %edx,%esi
  30:   8d 4a 04                lea    0x4(%rdx),%ecx
  33:   48 c1 ea 0c             shr    $0xc,%rdx
  37:   48 8b 94 d3 00 02 00    mov    0x200(%rbx,%rdx,8),%rdx
  3e:   00 
  3f:   81 e6 fc 0f 00 00       and    $0xffc,%esi
  45:   8b 34 32                mov    (%rdx,%rsi,1),%esi
  48:   31 d2                   xor    %edx,%edx
  4a:   83 e6 01                and    $0x1,%esi
  4d:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  54:   89 73 0c                mov    %esi,0xc(%rbx)
  57:   89 4b 10                mov    %ecx,0x10(%rbx)
  5a:   89 43 14                mov    %eax,0x14(%rbx)
  5d:   74 2f                   je     8e <_00400820+0x7e>
  5f:   39 c8                   cmp    %ecx,%eax
  61:   74 34                   je     97 <_00400820+0x87>
  63:   89 ce                   mov    %ecx,%esi
  65:   c1 ee 0c                shr    $0xc,%esi
  68:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  6f:   00 
  70:   89 cf                   mov    %ecx,%edi
  72:   c1 ef 02                shr    $0x2,%edi
  75:   81 e7 ff 03 00 00       and    $0x3ff,%edi
  7b:   8b 34 be                mov    (%rsi,%rdi,4),%esi
  7e:   83 e6 01                and    $0x1,%esi
  81:   83 c1 04                add    $0x4,%ecx
  84:   89 73 0c                mov    %esi,0xc(%rbx)
  87:   89 4b 10                mov    %ecx,0x10(%rbx)
  8a:   85 f6                   test   %esi,%esi
  8c:   75 d1                   jne    5f <_00400820+0x4f>
  8e:   ff c2                   inc    %edx
  90:   39 c8                   cmp    %ecx,%eax
  92:   89 53 08                mov    %edx,0x8(%rbx)
  95:   75 cc                   jne    63 <_00400820+0x53>
  97:   8b 73 7c                mov    0x7c(%rbx),%esi
  9a:   48 89 df                mov    %rbx,%rdi
  9d:   e8 00 00 00 00          callq  a2 <_00400820+0x92>
  a2:   48 85 c0                test   %rax,%rax
  a5:   74 1d                   je     c4 <_00400820+0xb4>
  a7:   48 89 df                mov    %rbx,%rdi
  aa:   5b                      pop    %rbx
  ab:   ff e0                   jmpq   *%rax
  ad:   8b 73 7c                mov    0x7c(%rbx),%esi
  b0:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  b7:   48 89 df                mov    %rbx,%rdi
  ba:   e8 00 00 00 00          callq  bf <_00400820+0xaf>
  bf:   48 85 c0                test   %rax,%rax
  c2:   75 e3                   jne    a7 <_00400820+0x97>
  c4:   5b                      pop    %rbx
  c5:   c3                      retq 

I suspect some of the issues are related with lack of noalias attributes. LLVM might think that storing to virtual memory might alias with the register bank, and generate very conservative code. Something to have a look at later.

Optimizing well-behaved calls

If we know that the application is well-behaved w.r.t. calls and returns, we can remove the thunk calls to __recompiler_predict_return and checks for JR. If jr $ra is seen, we statically translate that to a return.

Floating point

In MIPS I, floating point math is handled by coprocessor 1, CP1. We can load 32-bit values directly into the FP registers, move to and from integer registers, and fiddle with the control register. The control register controls rounding modes. I haven’t bothered emulating correct rounding modes for now, but the control register is used to deal with floating point conditional branches, so the register needs to be emulated at least. Just like SSE, the actual data type of the FP register can vary depending on the instruction, so we will need a lot of bitcasts, fortunately, this is a native construct in LLVM.

Let’s try implementing an FMA loop for good measure.

__attribute__((noinline))
float my_fma(const float *a, const float *b, int count)
{
        float res = 0.0f;
        for (int i = 0; i < count; i++)
                res += a[i] * b[i];
        return res;
}

int main(void)
{
        const float as[] = { 1.0f, 2.0f, 3.0f, 4.0f };
        const float bs[] = { 10.0f, -2.0f, 50.0f, -4.0f };
        float result = my_fma(as, bs, 4);
        return (int)result;
}
004008c0 <my_fma>:
  4008c0:       18c0000c        blez    a2,4008f4 <my_fma+0x34>
  4008c4:       00063080        sll     a2,a2,0x2
  4008c8:       44800000        mtc1    zero,$f0
  4008cc:       00863021        addu    a2,a0,a2
  4008d0:       c4820000        lwc1    $f2,0(a0)
  4008d4:       c4a40000        lwc1    $f4,0(a1)
  4008d8:       24840004        addiu   a0,a0,4
  4008dc:       46041082        mul.s   $f2,$f2,$f4
  4008e0:       24a50004        addiu   a1,a1,4
  4008e4:       14c4fffa        bne     a2,a0,4008d0 <my_fma+0x10>
  4008e8:       46020000        add.s   $f0,$f0,$f2
  4008ec:       03e00008        jr      ra
  4008f0:       00000000        nop
  4008f4:       44800000        mtc1    zero,$f0
  4008f8:       03e00008        jr      ra
  4008fc:       00000000        nop

We implement the floating point registers by bitcasting all the things, and keeping the register bank as integer always. Otherwise, the code-gen to LLVM IR is reasonably straight forward. In the generated x86-64 we end up seeing the magic instructions we want to see buried in the noise.

...  
  ea:   f3 0f 59 0c 98          mulss  (%rax,%rbx,4),%xmm1
  ef:   f3 0f 58 c1             addss  %xmm1,%xmm0
...

Syscalls

To end this post on a less intense note, let’s write hello world without the support of libc setup and run it in our VM. Unfortunately, we will have to write this in assembly as the C code we generate assumes that libc is up and running (something something $gp register), so raw assembly it is.

.data
str:
.ascii "Hello World!\n"

.text
.global __start
__start:
# write syscall is 4004
        li $v0, 4004
        li $a0, 1
        la $a1, str
        li $a2, 13
        syscall

# exit syscall is 4001
        li $v0, 4001
        li $a0, 0
        syscall

# Should never get here.
        jr $ra
; ModuleID = '_004000f0'
source_filename = "_004000f0"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004000f0(%0*) {
entry:
  br label %_004000f0

_004000f0:                                        ; preds = %entry
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4004, i32* %v0Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4260128, i32* %a1Ptr
  %a2Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 6
  store i32 13, i32* %a2Ptr
  call void @__recompiler_syscall(%0* %0, i32 4194564, i32 0)
  %v0Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4001, i32* %v0Ptr1
  %a0Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 0, i32* %a0Ptr2
  call void @__recompiler_syscall(%0* %0, i32 4194576, i32 0)
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004000f0
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004000f0
  ret void
}

declare void @__recompiler_syscall(%0*, i32, i32)

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

To handle syscalls, we simply create a hook into the VM host, and handle it. The syscall number goes in the $v0 register, and arguments follow as normal. To implement the write syscall we simply need to copy over data from our virtual address space and call write in our native environment.

void MIPS::syscall_write()
{
	int fd = scalar_registers[REG_A0];
	Address addr = scalar_registers[REG_A1];
	uint32_t count = scalar_registers[REG_A2];
	std::vector<uint8_t> output;
	output.resize(count);

	addr_space.copy_from_user(output.data(), addr, count);

	scalar_registers[REG_V0] = write(fd, output.data(), count);

	if (scalar_registers[REG_V0] < 0)
		scalar_registers[REG_A3] = errno;
	else
		scalar_registers[REG_A3] = 0;
}

Of course, to run this code on Windows, we’d have to do a lot of extra work to emulate these syscalls, but meh :p That is boring.

Syscalls are generally easy to deal with, but the exception is mmap() and friends. These interact directly with the virtual address space, and we need to implement our own virtual page allocator. glibc requires this to implement malloc(), so any non-trivial code is going to need a decent mmap() implementation. Getting all the weird edge cases working took a surprising amount of time. We also need to implement the obscure brk() syscall which predates mmap(). brk() is used by glibc until it fails, and then it falls back to mmap() to allocate heap memory. mmap() can also refer to non-memory resources, so we cannot just assume we have a nice, big and flat address space which we allocate from.

ioctl() will also be a nightmare, and I have not bothered with this syscall yet. We cannot translate generic structs between the two completely different ABIs since ioctl() just takes a void *. Fortunately, glibc does not require ioctl to work properly to host a full C++ application.

Conclusion

We have seen how we’re taking MIPS code and turning it into running code through LLVM. In the next post we will bring up a fully-fledged C application and even a C++ application, and do some benchmarking to compare native applications vs recompiled MIPS applications, stay tuned!