YUV sampling in Vulkan – a niche and complicated feature – VK_KHR_sampler_ycbcr_conversion

Sometimes I like to implement Vulkan extensions in Granite just because. This time around we’re looking at YUV.

For anyone who have worked with multimedia before, YUV (or YCbCr as it’s also referred to) is a nightmare. The act of converting YUV to RGB is very simple as it’s just a color space transform with a 3×3 matrix, but YUV means many things. There is no end to how overloaded YUV can be, and making sure you know exactly which YUV flavor you’re dealing with can be quite tricky. From a system integration point of view, YUV is a serious pain.

When dealing with video content in Vulkan, you will have to consume YUV somehow.

YUV has some characteristics which all make sense for video compression, but complicate things.

Planar: Each color component is often packed in different 2D images.
Luma/Chroma: Y refers to luminance, UV (or CbCr) refers to chrominance (color). This is critical for video compression.
Downsampled chroma. Human eyes have higher resolution for luminance vs color, so less bandwidth on color is an easy way to save space.

There is a lot of variance here between different YUV formats, as there is no obvious standard for these kinds of things.

How many planes? 2 or 3 are common (Y, U, V) vs (Y, packed UV). Packed single plane can be used in some other obscure scenarios. (YUYV and variations on that).
For packed representations, which component comes first?
How many bits per component? 8 is the most common, but 10-bit YUV content can be found in the wild.
How much is chroma downsampled? 2x horizontally and vertically is by far the most common, often referred to as YUV420p (the naming convension in YUV makes no sense, don’t try to find any).
Where is the texel center for the chroma samples? Common values are co-sited at every other luma sample, or in the mid-point between groups of 2×2 luma samples.
What is the exact color space conversion matrix from YUV to RGB?
How is chroma reconstructed to full resolution?

Dealing with YUV without fancy extensions

To render a YUV frame in RGB is not necessarily a difficult task, but depending on how many formats you need to deal with, shader variants can quickly get out of hand. You’ll typically do something like this:

layout(binding = 0) uniform TexLuma;
layout(binding = 1) uniform TexCb;
layout(binding = 2) uniform TexCr;

layout(location = 0) out vec3 FragColor;
layout(location = 0) in vec2 TexCoord;

const mat3 yuv_to_rgb_matrix = mat3(....);

void main()
{
    float Luma = textureLod(TexLuma, TexCoord, 0.0).x;
    float Cb = textureLod(TexCb, TexCoord, 0.0).x; // For mid-point chroma
    float Cr = textureLod(TexCr, TexCoord, 0.0).x;
    vec3 yuv = vec3(Luma, Cb, Cr);
    // Possibly expand range here if using TV YUV range and not PC YUV range.
    yuv = rescale_yuv(yuv);
    FragColor = yuv_to_rgb_matrix * yuv;
}

This is fine, but there is some motivation to do better here. Since watching video is one of the most common operations a GPU does, it would be nice if we could do this more efficiently, especially on low-power mobile devices with batteries attached to them. It would also be really nice if we could sample a video texture in our shader and have the GPU just “deal with it”.

The planar texture formats

VK_KHR_sampler_ycbcr_conversion adds a lot of new texture formats to Vulkan. For YUV420p, we’re going to look at this format, VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM.

In planar formats, we see that 3 color components are spread out across 3 planes, with a “420” here to mean that the second and third components are half resolution. “GBR” is a bit strange, but G refers to Y, and B and R refer to Cb and Cr respectively. Green is the strongest contributor to the luma Y channel after all, so I guess it kinda makes sense like that …

By having planar texture formats, it means that a texture unit of the GPU is actually able to sample 3 samples at once, meaning we will put a lot less stress on the GPU texturing unit. If we think of rendering video, this is a large part of the work done on the GPU. For lower-end mobile devices rendering 4K video without melting, this is actually really important.

The image aspects

In order to be able to refer to each plane separately when copying data in and out of the texture, we can use VK_IMAGE_ASPECT_PLANE_{0,1,2}_BIT. When copying to and from plane 1 and 2 in YUV420p, the resolutions of those planes are halved. VK_IMAGE_ASPECT_COLOR_BIT refers to the whole “GBR” as a whole, and it’s only useful when sampling the image.

Disjoint image allocation

Since we have 3 image planes, it is not weird to desire combining three separate images together. E.g. we can allocate three R8_UNORM images and make it planar later. This is supported and we’ll have to create and allocate images in a particular way.

First, we create the R8_UNORM images with VK_IMAGE_CREATE_ALIAS_BIT. This means that we will alias the image meaningfully with another image, even when using OPTIMAL image layout and image layouts are shared across aliases! We’ll use this to alias with a plane inside the planar texture later.

Be aware of alignment requirement. I’ve found that planar textures can need larger alignment than the single-component textures might need, so either using standalone allocations per plane, or bumping alignment to something like 64k works around that.

When we create the planar texture, we specify DISJOINT_BIT and ALIAS_BIT. For disjoint, it means we need to query allocation requirements and bind memory separately for each plane. To do this, we need to use vkGetImageMemoryRequirements2 and vkBindImageMemory2. Here, we just bind the same memory we used for our separate textures.

Setting up a sampler conversion object

The vkCreateSamplerYcbcrConversion function creates an object which encodes exactly how we will convert the planar components into RGB values.

VkSamplerYcbcrConversionCreateInfo info = { VK_STRUCTURE_TYPE_SAMPLER_YCBCR_CONVERSION_CREATE_INFO };

// Which 3x3 YUV to RGB matrix is used?
// 601 is generally used for SD content.
// 709 for HD content.
// 2020 for UHD content.
// Can also use IDENTITY which lets you sample the raw YUV and
// do the conversion in shader code.
// At least you don't have to hit the texture unit 3 times.
info.ycbcrModel = VK_SAMPLER_YCBCR_MODEL_CONVERSION_YCBCR_709;

// TV (NARROW) or PC (FULL) range for YUV?
// Usually, JPEG uses full range and broadcast content is narrow.
// If using narrow, the YUV components need to be
// rescaled before it can be converted.
info.ycbcrRange = VK_SAMPLER_YCBCR_RANGE_ITU_NARROW;

// Deal with order of components.
info.components = {
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
	VK_COMPONENT_SWIZZLE_IDENTITY,
};

// With NEAREST, chroma is duplicated to a 2x2 block for YUV420p.
// In fancy video players, you might even get bicubic/sinc
// interpolation filters for chroma because why not ...
info.chromaFilter = VK_FILTER_LINEAR;

// COSITED or MIDPOINT? I think normal YUV420p content is MIDPOINT,
// but not quite sure ...
info.xChromaOffset = VK_CHROMA_LOCATION_MIDPOINT;
info.yChromaOffset = VK_CHROMA_LOCATION_MIDPOINT;

// Not sure what this is for.
info.forceExplicitReconstruction = VK_FALSE;

// For YUV420p.
info.format = VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM;

VkSamplerYcbcrConversion conversion;
table->vkCreateSamplerYcbcrConversionKHR(device, &info, nullptr,
                                         &conversion);

Passing along to VkImageView and VkSampler

Both an image view and sampler which are used with KHR_sampler_ycbcr_conversion must have this sampler conversion object passed into it via pNext. This is because part of this information will be split up between the two objects. Planar and swizzle information is likely part of image view, while filtering and chroma siting is likely part of sampler object.

Immutable sampler

Some information is also relevant for the shader compiler, so for YCbCr sampling there are some restrictions. You have to use a COMBINED_IMAGE_SAMPLER and the sampler must be immutable in the descriptor set layout. Since that immutable sampler contains the conversion object, this allows the shader compiler to see how to complete the transform where hardware support stops. This might just be performing the YUV to RGB conversion, or it lets the shader implement almost everything on its own.

In Granite, I pass along immutable sampler information in a rather crude way where I look at the declared name of a combined image sampler in the shader.

Putting it all together I can write shader code like this in Granite and sample YUV420p content directly:

#version 450

// "LinearYUV420P" guides reflection to use an immutable sampler.
// Not ideal since I need to know which YUV conversion to use ...
// But it works for now.
layout(set = 0, binding = 0) uniform sampler2D uSamplerLinearYUV420P;
layout(location = 0) in vec2 vUV;
layout(location = 0) out vec4 FragColor;

void main()
{
    // It's just like sampling a regular texture :D
    FragColor = textureLod(uSamplerLinearYUV420P, vUV, 0.0);
}

Getting fancy with hardware video decoding

The natural way to extend this is to use hardware video decoding, export each plane as external memory, and import it in Vulkan. DISJOINT allocation comes in handy here since we don’t have to rely on a video decoder exporting a full planar image as one large unit. I haven’t looked into this yet though, but I suggest looking at MPV here. There seems to be some support for this kind of zero-copy flow with Vulkan.

Was all of this really all that useful?

Not really, but I have some ideas to add a “video texture material” in the future to Granite, so it might come in handy later. There might be some use for it in graphics programming where bandwidth is saved by encoding post-process render targets to YUV. That way, sampling the YUV post targets shouldn’t require a lot of manual hackery.

Emulating a fake retro GPU in Vulkan compute

RetroWarp – a fake retro GPU

Lately, I’ve been fiddling with a side project which ended being quite interesting.

The goal of this side project was to prototype out a system which implements software rasterization in compute shaders using modern GPU features like Vulkan 1.1 subgroups and async compute to improve performance. Then, I wanted to apply this to emulation of retro GPUs, in particular, a more low-level approach.

I believe compute shader rasterization has some key advantages in the domain of low-level emulation. Chasing full accuracy means not being able to make use of the key fixed function aspects of the graphics pipeline on Vulkan GPUs and most of the reasons to use fragment shaders goes away. With compute, there is no fixed function baggage to grapple with, but it does mean a lot of the things we take for granted must be implemented in software.

I didn’t aim to dive straight into a concrete retro GPU with this prototype, but rather I designed a straight forward rasterizer which supports the basic features found in particular old GPUs. My approach here is that this could be used as a starting point when going further and emulating a real legacy chip.

The repository is available on Github: https://github.com/Themaister/RetroWarp

The high level system

Rather than reiterate everything presented in the slide deck, I will link to it directly instead, however, it’s useful to discuss the system at a very high level.

Presentation slides

I presented this work at the Khronos Munich meetup in October 2019. You can find the presentation slides here.

Implementing Low Level GPU – Hans-Kristian – Munich 2019

Tile-based

Going tile-based is practically necessary for any compute shader implementation. I implemented 8×8 and 16×16 tile modes. Smaller tiles are more suitable for lower resolution like 320×240 and 640×480, but 16×16 was useful for 720p and up.

If you are familiar with tile deferred shading and friends, you know where I’m going with this.

Coarse-then-fine binning

To be tile based, we need to assign primitives to tiles. This is a quite intensive process when the tile size is small as time scales as resolution times number of primitives. To optimize, I bin at a low resolution (e.g. 64×64 tiles) and then refine the binning at full tile resolution.

Bitmap instead of primitive list

A common way to bin is to build an array of primitives which affects a tile, and then the renderer can just loop through that array of indices on a per-tile basis. This is problematic in the worst case where a lot of primitives end up filling the entire screen, there simply might not be memory available to store all these lists. We cannot allocate memory arbitrary on the GPU, and we really want to do tile binning on the GPU and not CPU.

Instead, each tile gets a fixed array of u32 bitmasks, where 1 bit is used per primitive. Bit-scan loops are used instead. To speed up the process where there are many gaps in the bitmap (there certainly is), there is a hierarchy, where the first hierarchy of bits marks a bit if any primitives is binned in groups of 32 primitives. If we find a bit set here, we go down the hierarchy and loop some more. A more concrete example here is:

Maximum of 16k primitives (arbitrary limit we choose)
Bitmap is u32[16k / 32] to contain all state
Coarse bitmap is u32[16k / (32 * 32)]

If more than 16k primitives are used, we can just split this into multiple render passes. These old GPUs don’t exactly support indirect rendering, so that’s not really a problem.

Ubershader vs. split shader architecture

After binning, we could simply implement an ubershader from doom where we deal with any possible render state the GPU supports in one 5000+ line monster. This is very problematic for performance, particularly with register pressure/occupancy on the shader cores.

One of my key deviations from the norm here was to implement a split shader architecture. Rather than rely on ubershaders, it is possible for the depth/blending state to consume pre-shaded tiles which contains color/depth/coverage information necessary to run these stages.

To create color/depth/coverage information, we can generate indirect dispatches and use specialization constants to carve out the code paths we need to run instead. This keeps register pressure down. The key downside of this approach is that we need to allocate memory and bandwidth for the pre-shaded data.

Async compute + graphics queue compute

Depth/blending is the only stage which needs to happen in-order. We can happily do binning and shading and feed the results to the final shading stage. I run everything except for depth/blending in the async compute queue, and depth/blending can run in the graphics queue.

Performance uplifts

See presentation slides for more detailed results.

Subgroup optimizations gave a solid ~20% uplift on AMD/NV/Intel. Async compute gave further 10-20% uplift on AMD/NV. Overall, I’m quite happy with this.

Recreating the tone filter from NieR:Automata

Audio tech in games is rarely particularly interesting. Sadly, most of it seems to have been made into commodity over the last couple of decades. Once in while, some games have very clever audio tech, and in this case it was NieR:Automata’s tone filter which caught my attention. The blog post explaining this tech is found here. If you haven’t played the game, it is highly recommended to read the post and watch the videos to understand what it’s doing. That saves me a lot of explanation. Their blog is very sparse on technical implementation details, but I wanted to try recreating it as there was just enough high-level detail in there to get me started.

Outside graphics, I’ve done a fair bit of audio programming in the past. It’s been too long since I did any significant audio DSP programming.

In short, the goal of the filter is to attempt to turn normal high-fidelity soundtracks into something with an 8-bit feel on-demand. Being able to introduce dynamic aspects to the music with a pure filter is interesting.

The tone filter as explained

The goal of the filter is to extract musical notes, and emphasize them. By having a few notes playing with a classic waveform like square or saw waves, we can recreate a retro 8-bit feel.

The blog describes 48 filters, spanning 4 octaves. Each octave in the (western music) scale is divided into 12 tones. What I deduced from this is that the 48 filters should be 48 very sharp bandpass filters.

A theoretically perfect bandpass filter will output a pure sine wave if the tone exists, and nothing if it doesn’t.

The distortion is tough to do well, and I spent a lot of time fiddling with this. We want to try making the sine wave become something like a square wave. I tried many variants but I ended up with something very simple like

https://www.desmos.com/calculator/qvymx5qf8t

If you’ve done HDR tone-mapping, this formula will look very familiar to you. Sometimes cross-domain knowledge comes in handy.

After distorting, there is the levelling stage, which I spent a lot of time fiddling with. Basically, if we run this system as is, we end up with a ton of noise in the signal with all the chromatic tones playing over each other. Needless to say, this sounded absolutely terrible.

There are little to no details on how this should be implemented, so I tried a very crude model which seems to work reasonably well. Basically, each of the 48 channels have a running power estimate which is computed right after the filter. We can compare that against the running power estimate of the unfiltered audio. This lets us get an idea how much of the audio energy is concentrated into each individual tone. If the energy is low enough, it falls off in a power-of-4 fashion to avoid leaking in audio from completely unrelated tones. Percussion sounds will generally have energy in almost the entire audio spectrum, and we need to filter that out as well as we can. If the ratio is too high, we just cap it. This is the fiddly part. There’s a lot of magic constants to tweak to get it sounding pleasing.

At the end we mix our mono output signal back into the original audio, and when it works well, it gives a nice harmonic edge. I believe it’s reasonably close to the original game now. Here’s an example from the NieR:Automata OST. I’m visualizing all the 48 bands, and the colors are:

Blue: Below threshold, severely muted.
Green: Over threshold, should be heard.
Red: Saturated, hitting max threshold.

The tones in one octave form one row, and the four octaves are stacked on top of each other. The top-left starts at A3 – 220 Hz. If you know some music theory, maybe you can figure out which key the tune is in? 🙂

https://www.youtube.com/watch?v=yZvHErcuAUk

Implementation

First we mix stereo down to mono. This is kind of trivial. Just take the average of left and right channels.

Ultra-sharp bandpass, resonance filters

I went through a few failed iterations to get here. My first attempts were to do all of this in the frequency domain with FFTs, but that plan failed very quickly. What I ended up with in the end was a simple biquad resonance filter. This filter is characterized by having two zeroes and two poles in DSP parlance, or in other words, FIR (finite impulse response) and IIR (infinite impulse response). In code, this would look something like:

y[t] = n0 * x[t] + n1 * x[t - 1] + n2 * x[t - 2] - d0 * y[t - 1] - d1 * y[t - 2]

In the Z-domain, this looks like

H(z) = 1/n0 * (1 + z^-1 * n1/g + z^-2 * n2/g) / (1 + z^-1 * d0 + z^-2 * d1)

The zeroes and poles occur where the roots of the polynomials go to zero in the numerator and denominator respectively. Basically, I designed the filter by deliberately placing zeroes and poles in the Z-domain, factoring the expressions out and converting it back to a normal FIR and IIR form.

I placed a zero at DC and the Nyquist frequency (w = pi). The poles were placed very close to the unit circle at w = +/- 2 * pi * freq / samplerate, and amplitude 0.9999. Then I evaluated the filter response at the resonance frequency and adjusted the FIR portion of the filter so that we got an estimated unit gain at the resonance frequency.

Basically, the frequency response at the resonance frequency will be very close to dividing by zero, so near-infinite response, but not quite. Numerical stability can easily throw off the filter if we’re not careful. This is one of the major issues with IIR filters in general. I initially tried an 8-pole filter but it was impossible to get this stable even in FP64, so I just gave up and tried a simple biquad instead which worked just fine.

SIMD

Since we’re doing 48 IIR filters in parallel, this was a perfect case for SIMD optimizations. I made everything into a struct-of-arrays (SoA) form, and just vectorized the scalar IIR filter directly. Normally, small IIR filters are tricky to vectorize since there are inter-dependencies between samples, but not here.

I optimized the filter in NEON, SSE1 and AVX and got a very nice performance boost, more on that later.

This would have been a great case for ISPC, but I considered it a too large dependency for something simple like this.

Distortion

The distortion function must be nicely SIMD-friendly and not too expensive. I landed on the classic x/(1+abs(x)) operator. The divide can be done fast with reciprocal estimations. We didn’t need high accuracy.

Slight low-pass

After we have mixed together the 48 distorted streams, we run a weak low-pass filter on top to remove some of the harshest harmonics. This is done with a trivial 1-pole IIR filter.

Performance

I tested performance on a Ryzen 7 1800x @ 3.8 GHz as well as a high-end phone (Galaxy S9 Exynos) to measure NEON performance. The benchmark pushes 20 million white noise samples through the filter and then times the result. The test doesn’t take that long, so this should be assumed to be absolute peak performance without any thermal / power consideration. The results below are given in samples processed per second. Normal audio clips are 44.1 kHz, so 0.041 M/s should correspond to 1x real-time performance. The C++ version is written without any intrinsics with -O3 -ffast-math. The SIMD versions are written with the standard intrinsics.

Chip	C++	SSE	AVX	AArch64 NEON
Samsung Exynos 9810	1.8 M/s			6.8 M/s
Ryzen 7 1800x @ 3.8 GHz	3.6 M/s	7.1 M/s	11.5 M/s

Basically, we’re 100x realtime performance here, even on a mobile CPU, nice. I’m surprised how close the performance ended up when comparing SSE and NEON. I didn’t see any auto-vectorization activate in the C++ variant, so I wonder what is going on with just 2x scaling in SSE. I got similar results on MSVC and GCC for what it’s worth … NEON gets close to ideal 4x scaling though, nice.

This uses quite a bit of processing power, so we can’t run wild with effects like this right now. But I look forward to being able to take advantage of systems like this for even more precise operations in the future.

The original implementation probably does more work on more gimped CPU hardware (AMD Jaguar consoles), but 100x real-time is pretty fast in my book. 😉

Source

The implementation is out there, but don’t expect to be able to use it as-is. This is a hobby project after all.

https://github.com/Themaister/Granite/blob/master/audio/dsp/tone_filter.cpp

https://github.com/Themaister/Granite/blob/master/tests/tone_filter_bench.cpp

VST plugin

I implemented a simple VST plugin with builds for Windows and macOS, both 64-bit. Feel free to try it out. It’s ultra bare bones.

EDIT: Old links were stale. Found the old code lying around in a private repo and made it public with the VST headers removed.

https://github.com/Themaister/ToneFilterVST

ToneFilterVST (Win64 build, untested)

An unusual recompiler experiment – MIPS to LLVM IR – Part 4

This is the final part in my blog series on my adventure recompiling MIPS to LLVM IR. If you’re new to this series you can read:

Part 1 – Explains the goals, MIPS ELF format, etc.
Part 2 – Explains how to generate code using the LLVM APIs.
Part 3 – Explains how we recompile MIPS code to LLVM.

In this post, I’m going to test performance on some applications and get a feel for how the various different codegen options we have can affect performance.

Due our lack of extensive syscall support, there is a limit to what we can test without going out of our way to port stuff, so I’ll be focusing on some tests which don’t require much beyond simple stdio.

STB PNG read + write

This test is based on the STB library’s PNG implementation. The test will load a PNG file from disk and compress it again.

#include "stb_image.h"
#include "stb_image_write.h"
#include <stdlib.h>

int main(int argc, char *argv[])
{
	for (int i = 0; i < 20; i++)
	{
		int x, y, chan;
		stbi_uc *data = stbi_load("/tmp/test.png", &x, &y, &chan, 4);
		if (!data)
			return 1;

		if (!stbi_write_png("/tmp/output.png", x, y, 4, data, 4 * x))
			return 2;
	}

	return 0;
}

Native performance (32-bit)

To make the comparison a bit more fair, we’ll compile this using 32-bit x86 targeting i486 with -O3.

Time: 20.6 s

For reference, this matters quite a lot, in x86-64, we get 15.38 s. I will use the i486 result as a baseline, since both i486 and MIPS I are ancient ISAs from around the same era of computing.

MIPS on-demand JIT (baseline)

To begin our benchmarking, we’re going to test fully on-line JIT-ing. This is what needs to happen at least the first time we’re running an application. The results here will be affected by a balance between optimization in run-time and having to do less work while JIT-ing.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf

In this first test, we will apply the following options:

Function calls will link directly to their targets. This increases JIT workload significantly, since we need to JIT all possible call paths to be able to link code directly. However, runtime should be faster once we have JIT-ed.
No IR optimizations are enabled.

Time: 71.43 s

The up-front cost of JIT-ing is quite long. But overall, 3.5x slower isn’t terrible. Let’s see if we can do it better.

On-demand JIT with optimizations

The JIT-er can perform some in-place optimizations. We’ll see if it helps here.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --optimize

Time: 75.43 s

It seems like the optimization passes made it a bit slower.

On-demand JIT with thunked calls

Rather than aggressively JIT-ing call possible call paths, we can try just JIT-ing functions we are actually calling. All direct calls are translated into indirect calls, and every call requires a lookup. This should reduce the JIT overhead a lot, but potentially have worse runtime performance. Without –optimize, this should be the most efficient option if we want to avoid JIT overhead.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls

Time: 70.0 s

Interesting. This might be the sweet spot for on-demand JIT.

On-demand JIT with thunked all the things

We can also use thunked load-store operations, rather than emit IR code to translate addresses for every memory operation. This should reduce code bloat, and might help when we’re doing on-demand JIT.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls --thunk-load-store

Time: 90.4 s

Ouch.

Assuming well behaved calls and returns?

Unfortunately, my assumption that GCC would generate expected code for returns was wrong, or my implementation was buggy. I couldn’t get it to work for non-trivial test cases, so I can’t test performance here.

Ahead-of-time recompiled IR

Now we’re starting to get into interesting territory which I haven’t seen much of in the past.

We need to run the application here, dump IR code to disk, and recompile into a dynamic library. For this case, we should be able to generate pretty good code and avoid any run-time recompilation. This is the ideal scenario if we can deduce all known call-paths.

Let’s start with the optimal case. No thunking.

~/git/jitter/cmake-build-release/mipsvm --dump-llvm /tmp/llvm stb-test.elf

This dumps out a whopping 68 MB of LLVM IR. Time to turn this ball of mud into a dynamic library.

#!/bin/bash

OUTPUT="$1"
LLDIR="$2"

echo "== Linking LLVM IR =="
llvm-link -o __llvm_linked.bc "$LLDIR"/*.ll
echo "== Optimizing offline LLVM IR =="
opt -O3 -o __llvm_opt.bc __llvm_linked.bc -disable-inlining
#cp __llvm_linked.bc __llvm_opt.bc
echo "== Compiling static library to object file with LLC =="
llc -relocation-model=pic -filetype obj -o __linked.o __llvm_opt.bc -O3
echo "== Linking shared library =="
gcc -o "$OUTPUT" -shared __linked.o

rm -f __llvm_linked.bc
rm -f __llvm_opt.bc
rm -f __linked.o

This operation takes 53.4 seconds and generates a 3.4 MB binary. The original binary is 792 kB due to the statically linked glibc.

This should yield us the absolute best performance we can hope for. So let’s try it.

~/git/jitter/cmake-build-release/mipsvm test --static-lib ~/git/jitter/test_linked.so --static-symbols /tmp/llvm/addr.bin

Time: 46.2 s

That’s a pretty great improvement. Compared to 20 seconds for a native binary with 32-bit/i486. It starts up basically instantly since there is no recompilation necessary. We only need to recompile if we find new code we haven’t looked at yet.

From here, we can get a better idea of what runtime cost we have by removing optimizations, and adding thunking. Let’s see if opt -O3 helps at all by just going straight to llc.

Building the native binary just takes 24 seconds now.

Time: 57.0 s

opt -O3 is clearly doing something well. Let’s add back the optimization and use thunked calls. For thunked calls, the IR dump is just 21 MB. We can see here that we were JIT-ing out a lot of useless code we never had to actually run.

The binary is 984 kB now.

~/git/jitter/cmake-build-release/mipsvm stb-test.elf --static-lib ~/git/jitter/test_unlinked.so --static-symbols /tmp/llvm-nolink/addr.bin

Time: 60.0 s

The win from linking directly is nothing to sneeze at. 46.2s to 60.0 s. Let’s thunk the load-store calls and see where we get.

Time: 92.7 s

Yup. Clearly, we can get a 2x speedup by just inlining the load-store code and directly calling functions rather than rely on thunking. We’re not that far away from 2x differential from native code in the best case!

Best of both worlds codegen?

If we’re dumping code with thunking to disk to improve JIT overhead, we can imagine that we can optimize the thunked calls to direct code off-line if we write our own LLVM optimization pass. Just an idea …

We need to go one level deeper

Let’s try something silly. We will recompile a cross-compiled cross-compiler. What? Well, I’ve built SPIRV-Cross for MIPS big-endian this time around. This was actually a useful exercise, because I can now verify that SPIRV-Cross works for both MIPS and big-endian at the same time 😛 Nice. SPIRV-Cross uses C++11, a fair bit of STL and exceptions. Can we host libstdc++ properly? Let’s see. With a statically linked libstdc++, the binary is 3.2 MB.

Let’s dump some LLVM …

~/git/jitter/cmake-build-release/mipsvm spirv-cross --dump-llvm /tmp/llvm-spirv -- spirv-cross /tmp/test.spv

I tried running this with a test shader in the SPIRV-Cross repository. It makes use of FP64, so we can see if we support doubles. Here we also see that we can pass arguments to argv. We end up with 168 MB of LLVM IR, which sure is intense. Let’s recompile it. This process takes over 2 minutes and creates a 8.6 MB binary.

~/git/jitter/cmake-build-release/mipsvm spirv-cross --static-lib ~/git/jitter/test_spirv.so --static-symbols /tmp/llvm-spirv/addr.bin -- spirv-cross /tmp/test.spv

Now it runs almost instantly and correctly.

Conclusion

This has been a fun little side project. The overhead of JIT-ing is rather high as we would expect, but the peak runtime performance is surprisingly good. We’re in the 2-3x slower ballpark against natively compiled code for ahead-of-time compiled code. I haven’t tested a lot of code out there, but STB’s PNG implementation, SPIRV-Cross, glibc and libstdc++ should represent reasonably complex and varied code.

Release

I’ve released this project on GitHub under an MIT license. Please read the disclaimer.

An unusual recompiler experiment – MIPS to LLVM IR – Part 3

In part 1 and part 2 we laid the groundwork to start recompiling MIPS code to LLVM IR. Strap your seatbelts, we’re going to MIPS and x86 assembly land.

The top-level run loop

The top level code fundamentally needs to be able to translate the program counter (short-hand, PC) to an executable function pointer. We can choose a hash map (large address space) or flat array (small address space) here.

If we need to call a PC we have not seen before, we will need to recompile a new LLVM module, starting at that PC, and then we can execute it.

Self-modifying code?

An immediate question is self-modifying code. This is a fairly ugly topic to deal with since our previously compiled function might become invalid if the underlying code changes. I think the solution for that is to keep a JIT block cache which translates a hash to function pointer and do some analysis of code blocks we don’t have a function pointer for yet. Any i-cache invalidations will clear out the relevant function pointers which triggers hashing in some form. Most likely the code for our function in particular did not change, so we can likely reuse the code blocks we generated.

For our purpose, we will not deal with self-modifying code here. A real emulator will have to deal with it, but self-modifying code should be rarer and rarer the more modern hardware we’re dealing with.

Recompiling a function

So, given a PC to execute, we’ll do some analysis where we map out all execution paths from that PC. We do this by mapping out all the basic blocks. See part 2 for more detail on what basic blocks do in LLVM.

Basic block

Basic blocks are represented as a starting PC and an end, where the execution flow is linear. The end of a basic block occurs where we see some kind of branch instruction (except for call instructions!). In this analysis we only care about these “special” instructions. Normal opcodes like arithmetic and load/store are ignored since they cannot affect control flow.

Branch delay slots

A very important part of MIPS is the use of a branch delay slot. It is a very unique design aspect of the architecture, which is considered a design flaw today because it was hard-coded to help a very specific micro-architecture. Exposing micro-architecture details like this should be considered bad taste. Whenever a branch is taken, unconditionally or not, the next instruction is always executed. Let’s see a trivial example:

int foo(int a)
{
	return a + 10;
}

00000000 <foo>:
   0:	03e00008 	jr	ra
   4:	2482000a 	addiu	v0,a0,10

“jr $ra” jumps to an address stored in a register, and $ra is used for the return address of a function. However, we can see that the add instruction comes afterwards. GCC exploits the delay slot here to do the useful computation inside it. Note that if you write MIPS assembly, you can get the assembler to perform this reordering for you. Often you will see “nop” after a branch if there is nothing useful to do in the delay slot.

One thought you might have now is, what happens if you have multiple branches back to back, branching in a delay slot? Well, if you actually thought of that then congratulations, have a cookie. This is explicitly banned in MIPS ISA, because it is non-sensical and undefined. While the hardware behavior could be well defined for a particular chip, it is still extremely broken, because if there is any exception or hardware interrupt happening in the middle of this sequence, it is impossible to recover from it. MIPS interrupt handlers typically have to deal with delay slots and fix-up the PC register accordingly.

The practical effect of the delay slot for us is that whenever we recompile a branch instruction, we recompile the following instruction first, then perform the branch. If the following instruction is also a branch instruction, we know that it cannot legally be taken, and branch instructions cannot have side effects (except for jal/jalr, but those always take the branch, illegal!), so we just skip it.

Load delay slots

MIPS I also has a delay for loads. We cannot use the target register in the instruction following a load. However, for recompilation purpose we can ignore this. While clever code might attempt to abuse the fact that a target register for a load instruction hasn’t been updated yet, this is also unsafe in the real world. If an interrupt triggers in the middle of this sequence, the register will be updated anyways, breaking the assumption of the clever code. Thus, we simply ignore the existence of the load delay slot because correct code cannot rely on this hack.

Conditional branches

MIPS has a few conditional branch instructions. When we see a conditional branch, we can branch to one of two basic blocks. Either we take the branch, or we don’t. We recursively analyze the new basic blocks we found if their target PCs haven’t been analyzed already. Some instructions to look out for are

BEQ
BNE
BLEZ
BLTZ
BGEZ
BGTZ
BC1T (floating point compare)
BC1F (floating point compare)

Direct branches

The direct branch in MIPS is the “J” instruction. We need to be careful with this instruction because it is commonly used in two ways:

Branch to basic block
Tail call to an unrelated function

If we mistakenly treat a J as a basic block where it should have been a function, we will end up inlining huge functions into our own, where we should have just “called” them instead. Too much inlining will bloat the JIT and make recompilation slower. Let’s see an example.

#include <stdlib.h>

// Make sure we don't get inlining optimizations.
__attribute__((noinline))
static void *wrapped_malloc(size_t size)
{
	return malloc(size);
}

void *my_malloc(size_t size)
{ 
	return wrapped_malloc(size * 4); // Tail-call
}

00000000 <wrapped_malloc>:
   0:	3c1c0000 	lui	gp,0x0
   4:	279c0000 	addiu	gp,gp,0
   8:	8f990000 	lw	t9,0(gp)
   c:	00000000 	nop
  10:	03200008 	jr	t9
  14:	00000000 	nop

00000018 <my_malloc>:
  18:	08000000 	j	0 <wrapped_malloc>
  1c:	00042080 	sll	a0,a0,0x2

Here we need to see that J is actually a tail call to “wrapped_malloc”, and not a branch to a basic block. The heuristic I ended up with was that if the J target refers to a basic block through the use of conditional branches elsewhere, we can assume J refers to a branch to a basic block ala if/else or switch blocks. If not, we assume it’s a tail call.

There are other static branches we can find in MIPS. The conditional branches can become static branches if the $0 register is used. This seems to be mostly useful with position-independent code since we can branch to an address relative to our PC rather than fixed address with J. We should try to detect these “static” branches as well. There is no need in analyzing code which can never be executed.

Indirect branches

Indirect branches in MIPS are a bit tricky to handle. They are implemented using the JR instruction. The edge case we need to handle is that JR is also used to return from a function. Either way, JR will always end a basic block. The implementation logic will end up being something like this:

// JR
uint32_t target_pc = registers[instr.register];
if (target_pc == return_prediction_stack.top())
{
    // This is actually a return!
    predicition_stack.pop();
    return;
}
else
{
    // This might have to recompile new code if we haven't seen target_pc before!
    auto *target = mips_resolve_call_target(target_pc);
    return target(mips_state); // Tail-call.
}

There are a few main use cases for JR:

Returning from a function, almost always using “jr $ra”.
Jump tables
Tail calls in dynamically loaded code.

We might be able to add a few optimizations for “well-behaved” code, where we can safely assume that “jr $ra” always means return, and that $ra always refers to the correct return address. That is not guaranteed, but I think GCC will always generate sane code at least.

Illegal instructions

If we find an illegal instruction, we can call out to the VM host, and request a SIGILL signal to be raised to our thread. This also ends the basic block.

Putting it together

Now we have gone through all instructions which can trigger an end of a basic block. Let’s take a more complex function and split it up into basic blocks.

int number_of_even(const int *values, int count)
{
	int res = 0;
	for (int i = 0; i < count; i++)
		if ((values[i] & 1) == 0)
			res++;
	return res;
}

 00000000 <number_of_even>:
  // ConditionalBranch -> 0x38 or 0x8
  // Note that the basic block does not end until after
  // the delay slot has executed.
   0:	18a0000d 	blez	a1,38 <number_of_even+0x38>
   4:	00052880 	sll	a1,a1,0x2

  // ConditionalBranch -> 0x28 or 0x24
   8:	00852821 	addu	a1,a0,a1
   c:	00001025 	move	v0,zero
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x28 or 0x24
  // Should split up here because 0x10 is a branch target.
  // Current implementation does not split up basic blocks
  // to allow branching to the middle of another basic block.
  // Instead we end up duplicating some code.
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop // A wild load-delay slot appears.
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x30 or 0x10
  24:	24420001 	addiu	v0,v0,1
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // ConditionalBranch -> 0x30 or 0x10
  // Same here w.r.t. code duplication,
  // 0x24 and 0x28 are both branch targets.
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  30:	03e00008 	jr	ra
  34:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  38:	03e00008 	jr	ra
  3c:	00001025 	move	v0,zero

Once we know all the basic blocks, we can create LLVM basic blocks for them, and then recompile the blocks directly and link them together with BranchInst. This way of analyzing and recompiling is fairly ISA agnostic actually and it’s not that hard to change MIPS into something else once the basic structure is in place. The recompiler itself which sets up this is actually completely MIPS agnostic, it only asks for “given a start PC, where does the basic block end, and what kind of basic block is it.”

Register allocation and branches

While we’re working on registers, we ideally want the MIPS registers to be reflected by our native hardware registers. It’s obvious a 1:1 mapping is not possible. MIPS has 32 (well, 31) general purpose registers, 32 floating-point registers and various control registers. This isn’t going to fit on x86 or Arm.

Fortunately, we do not have to really care about register allocation when using LLVM. We just need to make sure we don’t emit CreateStore/CreateLoad as much as possible, and LLVM should take care of the rest. Within a basic block, this is very easy since we always know which SSA value a register refers to as the control flow is linear. I implemented a simple RegisterTracker class which lets me translate registers to SSA values. If we haven’t used a register yet, load it from memory, if we modify a register, just replace the SSA value and remember that we eventually have to write it back to memory later, i.e. the register bank.

The real problem is how to deal with branches. We learned last time that to pass values to other basic blocks we can use PHI nodes. I tried implementing a scheme like this, where I would build a full CFG and try to link up register values using PHI nodes, but I gave up. The biggest complication is that our registers can become invalidated when calling other functions (since they modify registers as well), and we will have a real hard time handling register dirty tracking. If we have say a basic block C which can be entered from basic block A and B, A might write registers 1 through 15 and B might write registers 16 through 31. If we want to use PHI nodes, we’ll need to create one for every possible register all predecessors of C might have touched. We also don’t really know which registers are dirty and need to be moved back to memory after the function ends, and emitting branches just to conditionally move registers back to memory is dumb. Because of all these complications and pathological cases I went with a very simple scheme. At the end of a basic block or before a function call, all dirty registers are flushed to memory. On entry of a basic block, we will have to load all the registers we need from memory. Ideally, LLVM should be able to optimize this back to SSA/PHI form, but it might be rather expensive to do so. Even if LLVM does not optimize for this, the register bank should be 100% hot in L1 cache, so I’m not too worried about performance. x86 is a very register starved architecture to begin with and moving data to and from L1 cache is very common.

Call instructions

MIPS has several ways of “calling” functions. These functions do not necessarily end a basic block, since we expect control flow to return to the instruction following the branch delay slot.

JAL
JALR
BLTZAL
BGEZAL
J (deduced tail call)
BEQ/BGEZ/BLTZ (deduced position-independent tail call)

The L stands for link, which means that $pc + 8 is written to the return register $ra before jumping. As we saw earlier, we can return by jumping indirectly to $ra. Unlike x86, there is no “return address is on stack”.

JAL is the easiest one to understand, as it means “call this address”. JALR is a variant where we call a function pointer. BLTZAL and BGEZAL are very interesting as they conditionally call a function. They are also useful for position independent calls since they use the PC-relative addressing mode. All of these instructions are fundamentally implemented in the same way.

Return stack prediction

We want to be as friendly as possible to our CPUs branch predictor. The return instruction is one of the best prediction methods we can exploit. When we return, the CPU can be almost 100% sure where we are going to branch unless we were the subject of a stack smash attack or something. The CPU keeps an internal stack where it expects a return to go, and that can be used to predict returns perfectly if our code is well behaved.

Of course, we cannot assume the code we’re running is perfect, but we can optimize for it. Whenever we are executing a link instruction, we can push the link target to a prediction stack. Whenever we see a JR instruction later, we check if it equals the top of the prediction stack. If so, we can pop the stack and simply return, no extra JIT compilation necessary. If JR is not a return, we might have to compile some more code.

One problem of the return stack is that MIPS code is free to just call JAL over and over and over, since JAL just writes to the link register, and doesn’t actually affect the stack pointer $sp.

To deal with the situation where the return stack grows towards infinity, we will just need to deal with it by setting a rational upper limit. In the worst case where our return stack for some reason grows too large, we can use the nuclear option in our arsenal, longjmp! The top level code uses setjmp, and if at any time we’ve reached a hopeless situation, longjmp unwinds the entire stack at once, and we can re-enter with our new PC. However, this is kinda terrible for performance since all return instructions will now fail to optimize to a simple return, and might have to JIT out random code which followed a call instruction. We’ll hope this never happens for real.

To thunk or not to thunk

While indirect calls must have a lookup to determine what we are actually calling in runtime, it’s possible for direct call instructions to directly call another function in LLVM. In this case, we avoid any runtime lookups. We risk recursively having to recompile the callee functions to be able to link such a function, so the initial JIT step can become really slow. I added an option which lets JAL pretend to be JALR and have all call instructions go through an indirection. LLVM can support lazy JIT to alleviate this problem, but I don’t know how to make that work, so, meh. Our grand plan is to optimize all of this stuff offline anyways later 😉

Putting it all together

It’s time to look at some real code, real MIPS output and the resulting LLVM IR. In the VM, I added a mode which lets me call any function by name. This is very useful to facilitate small test cases, so I don’t have to go through the entire libc init step just to test some basic arithmetic. $ra will be 0, and I treat returning to PC 0 as “I’m done with the test, dump registers”.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        return foo(40, 50);
}

004005ec <main>:
  4005ec:       24050032        li      a1,50
  4005f0:       08100208        j       400820 <foo>
  4005f4:       24040028        li      a0,40

00400820 <foo>:
  400820:       03e00008        jr      ra
  400824:       00851021        addu    v0,a0,a1

Doesn’t get much simpler to start with this test. main calls foo through a tail call, let’s see what the LLVM looks like completely unoptimized:

; ModuleID = '_004005ec'
source_filename = "_004005ec"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004005ec(%0*) {
entry:
  br label %_004005ec

_004005ec:                                        ; preds = %entry
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 40, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 50, i32* %a1Ptr
  tail call void @_00400820(%0* %0)
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %v0 = add i32 %a0Loaded, %a1Loaded
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

The first thing we notice is

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

which expresses the MIPS state which we pass around to our JIT functions. 64 i32 values are reserved for the general purpose registers (32 + a couple other hidden registers), 64 FP registers (32 + a couple extra), and finally, the page table. We inline it in the struct to be able to load and store memory as efficiently as possible. The code should be fairly easy to follow until we reach the return in foo()

  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

Here we call to our externally defined function in the VM to check if return stack prediction worked. Either we tail call or just simply return. $ra in this case will be 0, and we just end execution here.

The registers are dumped at the end to read:

...
  v0 = 90
  v1 = 0
  a0 = 40
  a1 = 50
...

Very nice! $v0 is the return register in the MIPS ABI and $a0/$a1 are the first and second arguments respectively.

Loads and stores

Let’s have a look what happens when we cannot rely on tail calls.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        int a = foo(1, 2);
        a += foo(3, 4);
        return a;
}

004005ec <main>:
  4005ec:       27bdffe0        addiu   sp,sp,-32
  4005f0:       24050002        li      a1,2
  4005f4:       afbf001c        sw      ra,28(sp)
  4005f8:       0c100210        jal     400840 <foo>
  4005fc:       24040001        li      a0,1
  400600:       24050004        li      a1,4
  400604:       24040003        li      a0,3
  400608:       0c100210        jal     400840 <foo>
  40060c:       00401825        move    v1,v0
  400610:       8fbf001c        lw      ra,28(sp)
  400614:       00621021        addu    v0,v1,v0
  400618:       03e00008        jr      ra
  40061c:       27bd0020        addiu   sp,sp,32

We only need to load and store to stack, but we’ll see the codegen in action.

_004005ec:                                        ; preds = %entry
  %spPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded = load i32, i32* %spPtr
  %sp = add i32 %spLoaded, -32
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %SWAddr = add i32 %sp, 28

; Translate virtual address to page + offset
  %PageIndex = lshr i32 %SWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %SWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  store i32 %raLoaded, i32* %PagePtr

; Flush registers before calling foo
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 2, i32* %a1Ptr
  %spPtr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp, i32* %spPtr1
  %raPtr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195840, i32* %raPtr2

; Predict the return address
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195840)
; Direct call to foo, no indirection needed here.
  call void @_00400840(%0* %0)
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v0Loaded, i32* %v1Ptr
  %a0Ptr3 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 3, i32* %a0Ptr3
  %a1Ptr4 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4, i32* %a1Ptr4
  %raPtr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195856, i32* %raPtr5
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195856)
  call void @_00400840(%0* %0)
  %spPtr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded7 = load i32, i32* %spPtr6
  %LWAddr = add i32 %spLoaded7, 28
  %PageIndex8 = lshr i32 %LWAddr, 12
  %Page9 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex8
  %PageLoaded10 = load i8*, i8** %Page9
  %Page3211 = bitcast i8* %PageLoaded10 to i32*
  %PageOffset12 = lshr i32 %LWAddr, 2
  %PageOffsetMasked13 = and i32 %PageOffset12, 1023
  %PagePtr14 = getelementptr inbounds i32, i32* %Page3211, i32 %PageOffsetMasked13
  %Loaded = load i32, i32* %PagePtr14
  %v0Ptr15 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded16 = load i32, i32* %v0Ptr15
  %v1Ptr17 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  %v1Loaded = load i32, i32* %v1Ptr17
  %v0 = add i32 %v1Loaded, %v0Loaded16
  %sp18 = add i32 %spLoaded7, 32
  %v0Ptr19 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr19
  %spPtr20 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp18, i32* %spPtr20
  %raPtr21 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 %Loaded, i32* %raPtr21
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %Loaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004005ec
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004005ec
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

There is a fair bit of noise here with loading and storing to memory. We have to emulate the virtual address space, so that means translating addresses into pages and offsets. What about the x86-64 output?

0000000000000000 <_004005ec>:
   0:   53                      push   %rbx
   1:   48 89 fb                mov    %rdi,%rbx
   4:   8b 47 74                mov    0x74(%rdi),%eax
   7:   8b 4f 7c                mov    0x7c(%rdi),%ecx
   a:   8d 50 e0                lea    -0x20(%rax),%edx
   d:   83 c0 fc                add    $0xfffffffc,%eax
  10:   89 c6                   mov    %eax,%esi
  12:   c1 ee 0c                shr    $0xc,%esi
  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)
  28:   48 b8 01 00 00 00 02    movabs $0x200000001,%rax
  2f:   00 00 00 
  32:   48 89 47 10             mov    %rax,0x10(%rdi)
  36:   89 57 74                mov    %edx,0x74(%rdi)
  39:   c7 47 7c 00 06 40 00    movl   $0x400600,0x7c(%rdi)
  40:   be 40 08 40 00          mov    $0x400840,%esi
  45:   ba 00 06 40 00          mov    $0x400600,%edx
  4a:   e8 00 00 00 00          callq  4f <_004005ec+0x4f>
  4f:   48 89 df                mov    %rbx,%rdi
  52:   e8 79 00 00 00          callq  d0 <_00400840>
  57:   8b 43 08                mov    0x8(%rbx),%eax
  5a:   89 43 0c                mov    %eax,0xc(%rbx)
  5d:   48 b8 03 00 00 00 04    movabs $0x400000003,%rax
  64:   00 00 00 
  67:   48 89 43 10             mov    %rax,0x10(%rbx)
  6b:   c7 43 7c 10 06 40 00    movl   $0x400610,0x7c(%rbx)
  72:   be 40 08 40 00          mov    $0x400840,%esi
  77:   ba 10 06 40 00          mov    $0x400610,%edx
  7c:   48 89 df                mov    %rbx,%rdi
  7f:   e8 00 00 00 00          callq  84 <_004005ec+0x84>
  84:   48 89 df                mov    %rbx,%rdi
  87:   e8 44 00 00 00          callq  d0 <_00400840>
  8c:   8b 43 0c                mov    0xc(%rbx),%eax
  8f:   8b 4b 74                mov    0x74(%rbx),%ecx
  92:   8d 51 1c                lea    0x1c(%rcx),%edx
  95:   89 d6                   mov    %edx,%esi
  97:   c1 ee 0c                shr    $0xc,%esi
  9a:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  a1:   00 
  a2:   c1 ea 02                shr    $0x2,%edx
  a5:   81 e2 ff 03 00 00       and    $0x3ff,%edx
  ab:   8b 34 96                mov    (%rsi,%rdx,4),%esi
  ae:   83 c1 20                add    $0x20,%ecx
  b1:   01 43 08                add    %eax,0x8(%rbx)
  b4:   89 4b 74                mov    %ecx,0x74(%rbx)
  b7:   89 73 7c                mov    %esi,0x7c(%rbx)
  ba:   48 89 df                mov    %rbx,%rdi
  bd:   e8 00 00 00 00          callq  c2 <_004005ec+0xc2>
  c2:   48 85 c0                test   %rax,%rax
  c5:   74 06                   je     cd <_004005ec+0xcd>
  c7:   48 89 df                mov    %rbx,%rdi
  ca:   5b                      pop    %rbx
  cb:   ff e0                   jmpq   *%rax
  cd:   5b                      pop    %rbx
  ce:   c3                      retq   
  cf:   90                      nop

Ouch. A lot of this is noise to deal with register moves. We can see the code sequence which performs loads and stores here:

  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)

Good news is that this is very straight forward code, so the CPU should churn through most of this like butter unless we’re missing the page table reads in L1. It will be interesting to benchmark this code against natively compiled C code later.

Loops

Let’s try to JIT the number_of_even function we made earlier and see if LLVM can preserve data in registers across loop iterations.

__attribute__((noinline))
int number_of_even(const int *values, int count)
{
        int res = 0;
        for (int i = 0; i < count; i++)
                if ((values[i] & 1) == 0)
                        res++;
        return res;
}

int main(void)
{
        static const int values[] = { 1, 2, 3, 4 };
        return number_of_even(values, 4); 
}

00400820 <number_of_even>:
  400820:       18a0000d        blez    a1,400858 <number_of_even+0x38>
  400824:       00052880        sll     a1,a1,0x2
  400828:       00852821        addu    a1,a0,a1
  40082c:       00001025        move    v0,zero
  400830:       8c830000        lw      v1,0(a0)
  400834:       00000000        nop
  400838:       30630001        andi    v1,v1,0x1
  40083c:       14600002        bnez    v1,400848 <number_of_even+0x28>
  400840:       24840004        addiu   a0,a0,4
  400844:       24420001        addiu   v0,v0,1
  400848:       14a4fff9        bne     a1,a0,400830 <number_of_even+0x10>
  40084c:       00000000        nop
  400850:       03e00008        jr      ra
  400854:       00000000        nop
  400858:       03e00008        jr      ra
  40085c:       00001025        move    v0,zero

004005ec <main>:
  4005ec:       3c040047        lui     a0,0x47
  4005f0:       24050004        li      a1,4
  4005f4:       08100208        j       400820 <number_of_even>
  4005f8:       2484a330        addiu   a0,a0,-23760
  4005fc:       00000000        nop

define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %BLEZ = icmp sle i32 %a1Loaded, 0
  %a1 = shl i32 %a1Loaded, 2
  %a1Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a1, i32* %a1Ptr1
  br i1 %BLEZ, label %_00400858, label %_00400828

_00400858:                                        ; preds = %_00400820
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

_00400828:                                        ; preds = %_00400820
  %a1Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded3 = load i32, i32* %a1Ptr2
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %a14 = add i32 %a0Loaded, %a1Loaded3
  %LWAddr = add i32 %a0Loaded, 0
  %PageIndex = lshr i32 %LWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %LWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  %Loaded = load i32, i32* %PagePtr
  %v1 = and i32 %Loaded, 1
  %BNE = icmp ne i32 %v1, 0
  %a0 = add i32 %a0Loaded, 4
  %v0Ptr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr5
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v1, i32* %v1Ptr
  %a0Ptr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a0, i32* %a0Ptr6
  %a1Ptr7 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a14, i32* %a1Ptr7
  br i1 %BNE, label %_00400848, label %_00400844

_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Ptr8 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded9 = load i32, i32* %a0Ptr8
  %a1Ptr10 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded11 = load i32, i32* %a1Ptr10
  %BNE12 = icmp ne i32 %a1Loaded11, %a0Loaded9
  br i1 %BNE12, label %_00400830, label %_00400850

_00400830:                                        ; preds = %_00400844, %_00400848
  %a0Ptr13 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded14 = load i32, i32* %a0Ptr13
  %LWAddr15 = add i32 %a0Loaded14, 0
  %PageIndex16 = lshr i32 %LWAddr15, 12
  %Page17 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex16
  %PageLoaded18 = load i8*, i8** %Page17
  %Page3219 = bitcast i8* %PageLoaded18 to i32*
  %PageOffset20 = lshr i32 %LWAddr15, 2
  %PageOffsetMasked21 = and i32 %PageOffset20, 1023
  %PagePtr22 = getelementptr inbounds i32, i32* %Page3219, i32 %PageOffsetMasked21
  %Loaded23 = load i32, i32* %PagePtr22
  %v124 = and i32 %Loaded23, 1
  %BNE25 = icmp ne i32 %v124, 0
  %a026 = add i32 %a0Loaded14, 4
  %v1Ptr27 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v124, i32* %v1Ptr27
  %a0Ptr28 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a026, i32* %a0Ptr28
  br i1 %BNE25, label %_00400848, label %_00400844

_00400844:                                        ; preds = %_00400830, %_00400828
  %v0Ptr29 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr29
  %v0 = add i32 %v0Loaded, 1
  %a0Ptr30 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded31 = load i32, i32* %a0Ptr30
  %a1Ptr32 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded33 = load i32, i32* %a1Ptr32
  %BNE34 = icmp ne i32 %a1Loaded33, %a0Loaded31
  %v0Ptr35 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr35
  br i1 %BNE34, label %_00400830, label %_00400850

_00400850:                                        ; preds = %_00400844, %_00400848
  %raPtr36 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded37 = load i32, i32* %raPtr36
  %jump_addr38 = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded37)
  %jump_addr_cmp41 = icmp ne void (%0*)* %jump_addr38, null
  br i1 %jump_addr_cmp41, label %IndirectJumpPath39, label %IndirectJumpReturn40

IndirectJumpPath:                                 ; preds = %_00400858
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400858
  ret void

IndirectJumpPath39:                               ; preds = %_00400850
  tail call void %jump_addr38(%0* %0)
  ret void

IndirectJumpReturn40:                             ; preds = %_00400850
  ret void
}

Again, pretty noisy output, and this is unoptimized output after all. If we look at the x86-64 output, then as expected, it’s pretty bad:

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 47 14                mov    0x14(%rdi),%eax
  17:   8d 0c 85 00 00 00 00    lea    0x0(,%rax,4),%ecx
  1e:   85 c0                   test   %eax,%eax
  20:   89 4f 14                mov    %ecx,0x14(%rdi)
  23:   7f 19                   jg     3e <_00400820+0x2e>
  25:   8b 73 7c                mov    0x7c(%rbx),%esi
  28:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  2f:   48 89 df                mov    %rbx,%rdi
  32:   e8 00 00 00 00          callq  37 <_00400820+0x27>
  37:   48 85 c0                test   %rax,%rax
  3a:   75 50                   jne    8c <_00400820+0x7c>
  3c:   5b                      pop    %rbx
  3d:   c3                      retq   
  3e:   8b 43 10                mov    0x10(%rbx),%eax
  41:   89 c1                   mov    %eax,%ecx
  43:   c1 e9 0c                shr    $0xc,%ecx
  46:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  4d:   00 
  4e:   89 c2                   mov    %eax,%edx
  50:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  56:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  59:   01 43 14                add    %eax,0x14(%rbx)
  5c:   83 c0 04                add    $0x4,%eax
  5f:   83 e1 01                and    $0x1,%ecx
  62:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  69:   89 4b 0c                mov    %ecx,0xc(%rbx)
  6c:   89 43 10                mov    %eax,0x10(%rbx)
  6f:   75 23                   jne    94 <_00400820+0x84>
  71:   8b 43 14                mov    0x14(%rbx),%eax
  74:   ff 43 08                incl   0x8(%rbx)
  77:   3b 43 10                cmp    0x10(%rbx),%eax
  7a:   75 20                   jne    9c <_00400820+0x8c>
  7c:   8b 73 7c                mov    0x7c(%rbx),%esi
  7f:   48 89 df                mov    %rbx,%rdi
  82:   e8 00 00 00 00          callq  87 <_00400820+0x77>
  87:   48 85 c0                test   %rax,%rax
  8a:   74 06                   je     92 <_00400820+0x82>
  8c:   48 89 df                mov    %rbx,%rdi
  8f:   5b                      pop    %rbx
  90:   ff e0                   jmpq   *%rax
  92:   5b                      pop    %rbx
  93:   c3                      retq   
  94:   8b 43 14                mov    0x14(%rbx),%eax
  97:   3b 43 10                cmp    0x10(%rbx),%eax
  9a:   74 e0                   je     7c <_00400820+0x6c>
  9c:   8b 43 10                mov    0x10(%rbx),%eax
  9f:   89 c1                   mov    %eax,%ecx
  a1:   c1 e9 0c                shr    $0xc,%ecx
  a4:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  ab:   00 
  ac:   89 c2                   mov    %eax,%edx
  ae:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  b4:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  b7:   83 e1 01                and    $0x1,%ecx
  ba:   83 c0 04                add    $0x4,%eax
  bd:   89 4b 0c                mov    %ecx,0xc(%rbx)
  c0:   89 43 10                mov    %eax,0x10(%rbx)
  c3:   85 c9                   test   %ecx,%ecx
  c5:   74 aa                   je     71 <_00400820+0x61>
  c7:   eb cb                   jmp    94 <_00400820+0x84>

Not a lot of register use here, what happens if the run the LLVM through opt first though?

; ....
_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Loaded9 = phi i32 [ %a026, %_00400830 ], [ %a0, %_00400828 ]
  %v0Loaded3 = phi i32 [ %v0Loaded4, %_00400830 ], [ 0, %_00400828 ]
  %BNE12 = icmp eq i32 %a14, %a0Loaded9
  br i1 %BNE12, label %_00400850, label %_00400830

_00400830:                                        ; preds = %_00400848, %_00400844
  %a0Loaded14 = phi i32 [ %a0Loaded9, %_00400848 ], [ %a0Loaded31, %_00400844 ]
  %v0Loaded4 = phi i32 [ %v0Loaded3, %_00400848 ], [ %v0, %_00400844 ]
; ...

Sure enough, we’re seeing some loads and stores getting promoted to phi nodes, excellent. The x86-64 codegen is improved a bit as well. Still kinda hard to read though …

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 4f 14                mov    0x14(%rdi),%ecx
  17:   8d 04 8d 00 00 00 00    lea    0x0(,%rcx,4),%eax
  1e:   85 c9                   test   %ecx,%ecx
  20:   89 47 14                mov    %eax,0x14(%rdi)
  23:   0f 8e 84 00 00 00       jle    ad <_00400820+0x9d>
  29:   8b 53 10                mov    0x10(%rbx),%edx
  2c:   01 d0                   add    %edx,%eax
  2e:   89 d6                   mov    %edx,%esi
  30:   8d 4a 04                lea    0x4(%rdx),%ecx
  33:   48 c1 ea 0c             shr    $0xc,%rdx
  37:   48 8b 94 d3 00 02 00    mov    0x200(%rbx,%rdx,8),%rdx
  3e:   00 
  3f:   81 e6 fc 0f 00 00       and    $0xffc,%esi
  45:   8b 34 32                mov    (%rdx,%rsi,1),%esi
  48:   31 d2                   xor    %edx,%edx
  4a:   83 e6 01                and    $0x1,%esi
  4d:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  54:   89 73 0c                mov    %esi,0xc(%rbx)
  57:   89 4b 10                mov    %ecx,0x10(%rbx)
  5a:   89 43 14                mov    %eax,0x14(%rbx)
  5d:   74 2f                   je     8e <_00400820+0x7e>
  5f:   39 c8                   cmp    %ecx,%eax
  61:   74 34                   je     97 <_00400820+0x87>
  63:   89 ce                   mov    %ecx,%esi
  65:   c1 ee 0c                shr    $0xc,%esi
  68:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  6f:   00 
  70:   89 cf                   mov    %ecx,%edi
  72:   c1 ef 02                shr    $0x2,%edi
  75:   81 e7 ff 03 00 00       and    $0x3ff,%edi
  7b:   8b 34 be                mov    (%rsi,%rdi,4),%esi
  7e:   83 e6 01                and    $0x1,%esi
  81:   83 c1 04                add    $0x4,%ecx
  84:   89 73 0c                mov    %esi,0xc(%rbx)
  87:   89 4b 10                mov    %ecx,0x10(%rbx)
  8a:   85 f6                   test   %esi,%esi
  8c:   75 d1                   jne    5f <_00400820+0x4f>
  8e:   ff c2                   inc    %edx
  90:   39 c8                   cmp    %ecx,%eax
  92:   89 53 08                mov    %edx,0x8(%rbx)
  95:   75 cc                   jne    63 <_00400820+0x53>
  97:   8b 73 7c                mov    0x7c(%rbx),%esi
  9a:   48 89 df                mov    %rbx,%rdi
  9d:   e8 00 00 00 00          callq  a2 <_00400820+0x92>
  a2:   48 85 c0                test   %rax,%rax
  a5:   74 1d                   je     c4 <_00400820+0xb4>
  a7:   48 89 df                mov    %rbx,%rdi
  aa:   5b                      pop    %rbx
  ab:   ff e0                   jmpq   *%rax
  ad:   8b 73 7c                mov    0x7c(%rbx),%esi
  b0:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  b7:   48 89 df                mov    %rbx,%rdi
  ba:   e8 00 00 00 00          callq  bf <_00400820+0xaf>
  bf:   48 85 c0                test   %rax,%rax
  c2:   75 e3                   jne    a7 <_00400820+0x97>
  c4:   5b                      pop    %rbx
  c5:   c3                      retq

I suspect some of the issues are related with lack of noalias attributes. LLVM might think that storing to virtual memory might alias with the register bank, and generate very conservative code. Something to have a look at later.

Optimizing well-behaved calls

If we know that the application is well-behaved w.r.t. calls and returns, we can remove the thunk calls to __recompiler_predict_return and checks for JR. If jr $ra is seen, we statically translate that to a return.

Floating point

In MIPS I, floating point math is handled by coprocessor 1, CP1. We can load 32-bit values directly into the FP registers, move to and from integer registers, and fiddle with the control register. The control register controls rounding modes. I haven’t bothered emulating correct rounding modes for now, but the control register is used to deal with floating point conditional branches, so the register needs to be emulated at least. Just like SSE, the actual data type of the FP register can vary depending on the instruction, so we will need a lot of bitcasts, fortunately, this is a native construct in LLVM.

Let’s try implementing an FMA loop for good measure.

__attribute__((noinline))
float my_fma(const float *a, const float *b, int count)
{
        float res = 0.0f;
        for (int i = 0; i < count; i++)
                res += a[i] * b[i];
        return res;
}

int main(void)
{
        const float as[] = { 1.0f, 2.0f, 3.0f, 4.0f };
        const float bs[] = { 10.0f, -2.0f, 50.0f, -4.0f };
        float result = my_fma(as, bs, 4);
        return (int)result;
}

004008c0 <my_fma>:
  4008c0:       18c0000c        blez    a2,4008f4 <my_fma+0x34>
  4008c4:       00063080        sll     a2,a2,0x2
  4008c8:       44800000        mtc1    zero,$f0
  4008cc:       00863021        addu    a2,a0,a2
  4008d0:       c4820000        lwc1    $f2,0(a0)
  4008d4:       c4a40000        lwc1    $f4,0(a1)
  4008d8:       24840004        addiu   a0,a0,4
  4008dc:       46041082        mul.s   $f2,$f2,$f4
  4008e0:       24a50004        addiu   a1,a1,4
  4008e4:       14c4fffa        bne     a2,a0,4008d0 <my_fma+0x10>
  4008e8:       46020000        add.s   $f0,$f0,$f2
  4008ec:       03e00008        jr      ra
  4008f0:       00000000        nop
  4008f4:       44800000        mtc1    zero,$f0
  4008f8:       03e00008        jr      ra
  4008fc:       00000000        nop

We implement the floating point registers by bitcasting all the things, and keeping the register bank as integer always. Otherwise, the code-gen to LLVM IR is reasonably straight forward. In the generated x86-64 we end up seeing the magic instructions we want to see buried in the noise.

...  
  ea:   f3 0f 59 0c 98          mulss  (%rax,%rbx,4),%xmm1
  ef:   f3 0f 58 c1             addss  %xmm1,%xmm0
...

Syscalls

To end this post on a less intense note, let’s write hello world without the support of libc setup and run it in our VM. Unfortunately, we will have to write this in assembly as the C code we generate assumes that libc is up and running (something something $gp register), so raw assembly it is.

.data
str:
.ascii "Hello World!\n"

.text
.global __start
__start:
# write syscall is 4004
        li $v0, 4004
        li $a0, 1
        la $a1, str
        li $a2, 13
        syscall

# exit syscall is 4001
        li $v0, 4001
        li $a0, 0
        syscall

# Should never get here.
        jr $ra

; ModuleID = '_004000f0'
source_filename = "_004000f0"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004000f0(%0*) {
entry:
  br label %_004000f0

_004000f0:                                        ; preds = %entry
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4004, i32* %v0Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4260128, i32* %a1Ptr
  %a2Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 6
  store i32 13, i32* %a2Ptr
  call void @__recompiler_syscall(%0* %0, i32 4194564, i32 0)
  %v0Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4001, i32* %v0Ptr1
  %a0Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 0, i32* %a0Ptr2
  call void @__recompiler_syscall(%0* %0, i32 4194576, i32 0)
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004000f0
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004000f0
  ret void
}

declare void @__recompiler_syscall(%0*, i32, i32)

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

To handle syscalls, we simply create a hook into the VM host, and handle it. The syscall number goes in the $v0 register, and arguments follow as normal. To implement the write syscall we simply need to copy over data from our virtual address space and call write in our native environment.

void MIPS::syscall_write()
{
	int fd = scalar_registers[REG_A0];
	Address addr = scalar_registers[REG_A1];
	uint32_t count = scalar_registers[REG_A2];
	std::vector<uint8_t> output;
	output.resize(count);

	addr_space.copy_from_user(output.data(), addr, count);

	scalar_registers[REG_V0] = write(fd, output.data(), count);

	if (scalar_registers[REG_V0] < 0)
		scalar_registers[REG_A3] = errno;
	else
		scalar_registers[REG_A3] = 0;
}

Of course, to run this code on Windows, we’d have to do a lot of extra work to emulate these syscalls, but meh :p That is boring.

Syscalls are generally easy to deal with, but the exception is mmap() and friends. These interact directly with the virtual address space, and we need to implement our own virtual page allocator. glibc requires this to implement malloc(), so any non-trivial code is going to need a decent mmap() implementation. Getting all the weird edge cases working took a surprising amount of time. We also need to implement the obscure brk() syscall which predates mmap(). brk() is used by glibc until it fails, and then it falls back to mmap() to allocate heap memory. mmap() can also refer to non-memory resources, so we cannot just assume we have a nice, big and flat address space which we allocate from.

ioctl() will also be a nightmare, and I have not bothered with this syscall yet. We cannot translate generic structs between the two completely different ABIs since ioctl() just takes a void *. Fortunately, glibc does not require ioctl to work properly to host a full C++ application.

Conclusion

We have seen how we’re taking MIPS code and turning it into running code through LLVM. In the next post we will bring up a fully-fledged C application and even a C++ application, and do some benchmarking to compare native applications vs recompiled MIPS applications, stay tuned!

An unusual recompiler experiment – MIPS to LLVM IR – Part 2

In part 1 we parsed a MIPS ELF file and set up a stack, we should now consider how to translate the MIPS machine code to LLVM IR. Before we get there, we need to learn how to use the LLVM APIs and how code-gen to LLVM IR works.

If you know LLVM well already, this post is not for you, wait for part 3 when we start considering the design of the recompiler. If you don’t, this post should give you all understanding needed to be able to generate some code in LLVM yourself and understand the implementation.

This turned out rather dry, but hey, it’s limited how exciting writing C++ to generate IR can be. 😀 At least it’s very educational.

Which LLVM version to target?

The LLVM JIT APIs are unfortunately rather unstable. The APIs appear to break between major versions. A common workaround to this seems to be to simply stick to a version and depend on that unless you’re ready to update. I got it working on the current LLVM 7, and also the upcoming LLVM 8, which seems to shuffle a lot of internal APIs around. For the sample here, we’re going to consider LLVM 7, which is the latest stable major version. The API changes as far as I know are only related to the JIT runtime, not codegen, so we can hide away the ugliness behind a clean interface at least.

LLVM IR basics

To get us started, let’s make a trivial function, which adds two arguments together and returns it. In C, we would represent this as:

int LLVMAdd(int a, int b);

To do this, we need to create an LLVMContext, an LLVM module (sort of like a translation unit) and a function to that module. We’ll need to create some types as well.

// We'll need a ton of headers for later, might as well just add it now.
#include "llvm/ExecutionEngine/Orc/CompileUtils.h"
#include "llvm/ExecutionEngine/Orc/Core.h"
#include "llvm/ExecutionEngine/Orc/ExecutionUtils.h"
#include "llvm/ExecutionEngine/Orc/IRCompileLayer.h"
#include "llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/DataLayout.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/ExecutionEngine/SectionMemoryManager.h"

void test()
{
  llvm::LLVMContext context;
  llvm::Module module("mymodule", context);

  llvm::Type *int32 = llvm::Type::getInt32Ty(context);
  llvm::Type *result_type = int32;
  llvm::Type *argument_types[] = { int32, int32 };
  bool vararg = false;

  llvm::FunctionType *function_type =
    llvm::FunctionType::get(result_type, argument_types, vararg);
  llvm::Function *function =
    llvm::Function::Create(function_type,
                           llvm::Function::ExternalLinkage,
                           "LLVMAdd", module);

  llvm::errs() << module << "\n";
}

; ModuleID = 'mymodule'
source_filename = "mymodule"

declare i32 @LLVMAdd(i32, i32)

For now, we have simply declared the function. If we do nothing more, this is equivalent to using “extern” in C. We can call it, but the linker needs to find the function elsewhere. Let’s expand it with some code.

Basic blocks

Basic blocks are the foundation of LLVM and a function consists of any number of basic blocks. Basic blocks represent linear control flow within a function. At the end of a basic block, the block must be terminated. Here, and only here can we do things like:

Branch to another basic block, conditionally or not
Return
And some other special cases

For our first test, we can get away with one basic block.

Single static assignment (SSA)

Another core principle of LLVM is the use of single-static assignment. In short, it basically means that once you assign a value to a variable, it cannot ever change. If a variable is modified multiple times, each time it is assigned, it needs to be assigned to a new SSA value. You will see this when we start generating some code. This has various benefits for optimization and code-gen later in the pipeline, but that’s LLVM’s problem. Let’s implement the add function.

llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", function);
llvm::IRBuilder<> builder(bb);
llvm::Value *arg0 = &function->arg_begin()[0];
llvm::Value *arg1 = &function->arg_begin()[1];
llvm::Value *value = builder.CreateAdd(arg0, arg1, "added_value");
builder.CreateRet(value);

Here we add a basic block to our function. The first basic block is where the function starts executing. The IRBuilder is the class which lets you build actual code. Here we create an add instruction, and return it. This ends the basic block, and we’re done. Not too bad! If you print it using the existing line of code, you’ll now see:

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32) {
entry:
  %added_value = add i32 %0, %1
  ret i32 %added_value
}

Compiling IR to native code

We could just dump this directly into the JIT engine now, but let’s try compiling this as a dynamic library, using the offline tools. Copy this IR code into a file, e.g. test.ll. Let’s compile this to native code.

$ llc -relocation-model=pic -filetype obj -o test.o test.ll -O2
$ ld -shared -o test.so test.o
$ objdump -d test.so

0000000000001000 <LLVMAdd>:
    1000:       8d 04 37                lea    (%rdi,%rsi,1),%eax
    1003:       c3                      retq

LLVM can usually target other architectures with one binary, so try adding -mtriple=mips to llc, disassemble with the MIPS binutils, and you’ll get:

00000000 <LLVMAdd>:
   0:   03e00008        jr      ra
   4:   00851021        addu    v0,a0,a1

This code might confuse you, it’s returning before the add? We’ll get to that later, it’s one of the weirder parts of MIPS, delay slots.

Branches

We’ll need some control flow. Let’s try implementing this function:

int foo(int a, int b, int c)
{
    if (c > 0)
        return a * b;
    else
        return a + b;
}

using namespace llvm;
// ...
// Remember to update this to take 3 arguments!
Type *argument_types[] = { int32, int32, int32 };
// ...
BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];
Value *cmp = builder.CreateICmpSGT(
    arg0,
    ConstantInt::get(Type::getInt32Ty(context), 0),
    "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *mul = builder.CreateMul(arg1, arg2, "multiplied");
builder.CreateRet(mul);

builder.SetInsertPoint(false_path);
Value *add = builder.CreateMul(arg1, arg2, "added");
builder.CreateRet(add);

Notice that the compare instruction is “signed greater than”. Normally, LLVM does not care about signedness, except for operations where this matters. To branch, we create a branch instruction, which ends the first basic block.

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %multiplied = mul i32 %1, %2
  ret i32 %multiplied

false:                                            ; preds = %entry
  %added = add i32 %1, %2
  ret i32 %added
}

And this becomes in x86-64 (sorry for the AT&T syntax):

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 06                   jle    a <LLVMAdd+0xa>
   4:   0f af f2                imul   %edx,%esi
   7:   89 f0                   mov    %esi,%eax
   9:   c3                      retq   
   a:   01 d6                   add    %edx,%esi
   c:   89 f0                   mov    %esi,%eax
   e:   c3                      retq

Memory and pointers

We can’t always use pure SSA values, and we have to deal with good old memory. LLVM has an “interesting” approach to pointers, but let’s start easy. Loading and storing memory in LLVM is usually done in two stages:

Compute address with “get element pointer”
Load/store with CreateLoad/CreateStore

The backend is responsible for translating this according to the memory addressing the CPU can deal with. Get element pointer translates very naturally to C-like pointer arithmetic. We don’t need to access in byte offsets and all sorts of gunk, but array elements and struct members. Let’s try an example with structs.

struct Foo
{
    int a;
    float b;
    int c;
    float d;
};

int LLVMAdd(Foo *foo)
{
    foo[1].c += 10;
}

Type *int32 = Type::getInt32Ty(context);
Type *float32 = Type::getFloatTy(context);

Type *struct_types[] = {
    int32,
    float32,
    int32,
    float32
};

Type *struct_type = StructType::get(context, struct_types, false);
Type *p_struct_type = PointerType::get(struct_type, 0);

Type *result_type = int32;
Type *argument_types[] = { p_struct_type };

// ...
// ...

BasicBlock *bb = BasicBlock::Create(context, "entry", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *indices[] = {
    ConstantInt::get(int32, 1), // pointer -> array index
    ConstantInt::get(int32, 2), // struct -> member index
};
Value *ptr = builder.CreateInBoundsGEP(arg0, indices, "ptr");
Value *loaded = builder.CreateLoad(ptr, "loaded");
Value *added = builder.CreateAdd(
    loaded,
    ConstantInt::get(int32, 10), "added");
builder.CreateStore(added, ptr);
builder.CreateRet(added);

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd({ i32, float, i32, float }*) {
entry:
  %ptr = getelementptr inbounds { i32, float, i32, float }, { i32, float, i32, float }* %0, i32 1, i32 2
  %loaded = load i32, i32* %ptr
  %added = add i32 %loaded, 10
  store i32 %added, i32* %ptr
  ret i32 %added
}

0000000000000000 <LLVMAdd>:
   0:   8b 47 18                mov    0x18(%rdi),%eax
   3:   83 c0 0a                add    $0xa,%eax
   6:   89 47 18                mov    %eax,0x18(%rdi)
   9:   c3                      retq

The mystical PHI node

I can’t really talk about SSA without mentioning PHI nodes. SSA values do not live in memory, so what happens when the control flow changes? Let’s look at an example function:

int foo(int a, int b, int c)
{
    int v;
    if (a > 0)
        v = b + c;
    else
        v = b - c;

    return v + a;
}

If we translate this directly to IR, we’ll see that v suddenly has two versions of it, but SSA says *single* static assignment. How is this resolved? A PHI node is used. The purpose of this node is to pull together multiple versions of a value into one, and you specify all basic blocks which can branch into your block. This is rather bizarre, since it’s kinda like an inverse goto.

Of course, we could translate this to memory allocated on the stack and just store to v rather than deal with this, and have LLVM optimize it. For completion, let’s try that first.

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);
BasicBlock *merge = BasicBlock::Create(context, "merge", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];

Value *v = builder.CreateAlloca(int32, ConstantInt::get(int32, 1),
    "alloca");
Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0),
    "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *bc_add = builder.CreateAdd(arg1, arg2, "b_plus_c");
builder.CreateStore(bc_add, v);
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *bc_sub = builder.CreateSub(arg1, arg2, "b_minus_c");
builder.CreateStore(bc_sub, v);
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
Value *loaded = builder.CreateLoad(v, "loaded");
Value *added = builder.CreateAdd(loaded, arg0, "added");
builder.CreateRet(added);

source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %alloca = alloca i32
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %b_plus_c = add i32 %1, %2
  store i32 %b_plus_c, i32* %alloca
  br label %merge

false:                                            ; preds = %entry
  %b_minus_c = sub i32 %1, %2
  store i32 %b_minus_c, i32* %alloca
  br label %merge

merge:                                            ; preds = %false, %true
  %loaded = load i32, i32* %alloca
  %added = add i32 %loaded, %0
  ret i32 %added
}

If we just give this to llc we see that it is sadly using stack space.

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 0d                   jle    11 <LLVMAdd+0x11>
   4:   01 d6                   add    %edx,%esi
   6:   89 74 24 fc             mov    %esi,-0x4(%rsp)
   a:   03 7c 24 fc             add    -0x4(%rsp),%edi
   e:   89 f8                   mov    %edi,%eax
  10:   c3                      retq   
  11:   29 d6                   sub    %edx,%esi
  13:   89 74 24 fc             mov    %esi,-0x4(%rsp)
  17:   03 7c 24 fc             add    -0x4(%rsp),%edi
  1b:   89 f8                   mov    %edi,%eax
  1d:   c3                      retq

What’s going on, we gave llc the -O3 option? Well, we didn’t optimize the IR first. This problem is extremely common when compiling C and C++ code of course, so we can optimize in the IR domain. Here we can do the classic “mem2reg” optimization, which replaces stack space with SSA values, which is much easier for the backend to deal with.

$ opt -O3 -o test.bc test.ll
$ llvm-dis test.bc -o optimized.ll
$ cat optimized.ll

; ModuleID = 'test.bc'
source_filename = "mymodule"

; Function Attrs: norecurse nounwind readnone
define i32 @LLVMAdd(i32, i32, i32) local_unnamed_addr #0 {
entry:
  %cmp = icmp sgt i32 %0, 0
  %3 = sub i32 0, %2
  %alloca.0.p = select i1 %cmp, i32 %2, i32 %3
  %alloca.0 = add i32 %1, %0
  %added = add i32 %alloca.0, %alloca.0.p
  ret i32 %added
}

attributes #0 = { norecurse nounwind readnone }

0000000000000000 <LLVMAdd>:
   0:   89 d1                   mov    %edx,%ecx
   2:   f7 d9                   neg    %ecx
   4:   85 ff                   test   %edi,%edi
   6:   0f 4f ca                cmovg  %edx,%ecx
   9:   8d 04 3e                lea    (%rsi,%rdi,1),%eax
   c:   01 c8                   add    %ecx,%eax
   e:   c3                      retq

That’s pretty nifty, and it even replaced the branch with select, but it also rewrote the entire function because it found some other optimizations. Let’s not rely on opt and rewrite this without stack space using PHI.

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];

Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0), "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *bc_add = builder.CreateAdd(arg1, arg2, "b_plus_c");
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *bc_sub = builder.CreateSub(arg1, arg2, "b_minus_c");
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
PHINode *phi = builder.CreatePHI(int32, 2, "phi");
phi->addIncoming(bc_add, true_path);
phi->addIncoming(bc_sub, false_path);
Value *added = builder.CreateAdd(phi, arg0, "added");
builder.CreateRet(added);

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %b_plus_c = add i32 %1, %2
  br label %merge

false:                                            ; preds = %entry
  %b_minus_c = sub i32 %1, %2
  br label %merge

merge:                                            ; preds = %false, %true
  %phi = phi i32 [ %b_plus_c, %true ], [ %b_minus_c, %false ]
  %added = add i32 %phi, %0
  ret i32 %added
}

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 07                   jle    b <LLVMAdd+0xb>
   4:   01 d6                   add    %edx,%esi
   6:   01 fe                   add    %edi,%esi
   8:   89 f0                   mov    %esi,%eax
   a:   c3                      retq   
   b:   29 d6                   sub    %edx,%esi
   d:   01 fe                   add    %edi,%esi
   f:   89 f0                   mov    %esi,%eax
  11:   c3                      retq

This is a very pure translation of our IR, nice.

Calling functions

This is quite straight forward fortunately. Let’s replace the two branches with actual function calls. You can probably tell by now what the C code was supposed to look like.

FunctionType *function_type = FunctionType::get(result_type, argument_types, vararg);
Function *function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMAdd",
    module);
Function *true_function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMTrue",
    module);

Function *false_function = Function::Create(
    function_type, 
    Function::ExternalLinkage,
    "LLVMFalse", module);

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);
BasicBlock *merge = BasicBlock::Create(context, "merge", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];
Value *call_args[] = { arg0, arg1, arg2 };

Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0), "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *true_call = builder.CreateCall(true_function, call_args);
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *false_call = builder.CreateCall(false_function, call_args);
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
PHINode *phi = builder.CreatePHI(int32, 2, "phi");
phi->addIncoming(true_call, true_path);
phi->addIncoming(false_call, false_path);
Value *added = builder.CreateAdd(phi, arg0, "added");
builder.CreateRet(added);

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %3 = call i32 @LLVMTrue(i32 %0, i32 %1, i32 %2)
  br label %merge

false:                                            ; preds = %entry
  %4 = call i32 @LLVMFalse(i32 %0, i32 %1, i32 %2)
  br label %merge

merge:                                            ; preds = %false, %true
  %phi = phi i32 [ %3, %true ], [ %4, %false ]
  %added = add i32 %phi, %0
  ret i32 %added
}

declare i32 @LLVMTrue(i32, i32, i32)

declare i32 @LLVMFalse(i32, i32, i32)

Here we’re trying to call functions which haven’t been defined in the module. It is up to the linker to resolve this later. Let’s see what happens if we build this as a dynamic library.

0000000000001000 <.plt>:
    1000:       ff 35 02 30 00 00       pushq  0x3002(%rip)        # 4008 <_GLOBAL_OFFSET_TABLE_+0x8>
    1006:       ff 25 04 30 00 00       jmpq   *0x3004(%rip)        # 4010 <_GLOBAL_OFFSET_TABLE_+0x10>
    100c:       0f 1f 40 00             nopl   0x0(%rax)

0000000000001010 <LLVMFalse@plt>:
    1010:       ff 25 02 30 00 00       jmpq   *0x3002(%rip)        # 4018 <LLVMFalse>
    1016:       68 00 00 00 00          pushq  $0x0
    101b:       e9 e0 ff ff ff          jmpq   1000 <.plt>

0000000000001020 <LLVMTrue@plt>:
    1020:       ff 25 fa 2f 00 00       jmpq   *0x2ffa(%rip)        # 4020 <LLVMTrue>
    1026:       68 01 00 00 00          pushq  $0x1
    102b:       e9 d0 ff ff ff          jmpq   1000 <.plt>

Disassembly of section .text:

0000000000001030 <LLVMAdd>:
    1030:       53                      push   %rbx
    1031:       89 fb                   mov    %edi,%ebx
    1033:       85 ff                   test   %edi,%edi
    1035:       7e 0b                   jle    1042 <LLVMAdd+0x12>
    1037:       89 df                   mov    %ebx,%edi
    1039:       e8 e2 ff ff ff          callq  1020 <LLVMTrue@plt>
    103e:       01 d8                   add    %ebx,%eax
    1040:       5b                      pop    %rbx
    1041:       c3                      retq   
    1042:       89 df                   mov    %ebx,%edi
    1044:       e8 c7 ff ff ff          callq  1010 <LLVMFalse@plt>
    1049:       01 d8                   add    %ebx,%eax
    104b:       5b                      pop    %rbx
    104c:       c3                      retq

The noise before the function is dynamic library gunk. If we had linked with –no-undefined, we would see:

ld: test.o: in function `LLVMAdd':
mymodule:(.text+0xa): undefined reference to `LLVMTrue'
ld: mymodule:(.text+0x15): undefined reference to `LLVMFalse'

This is basically how we will call into our emulator host code to do various things. Handle syscalls, deal with function calls to other emulated code, etc. The possibilities are endless now. When we JIT, we can pass function pointers of our own host code into the symbol resolver and the JIT can patch in straight function calls into the code. Nifty!

At this point, with some searching around, you should be able to figure out how to generate the IR you want. We’ve covered the basics I think.

JIT compilation

JIT-ing these llvm::Modules is in theory quite straight forward, there’s just a lot of API noise to go through. We create:

llvm::LLVMContext
llvm::orc::ExecutionSession
llvm::orc::RTDyldObjectLinkingLayer
llvm::orc::IRCompileLayer<>
llvm::TargetMachine
llvm::orc::MangleAndInterner
llvm::DataLayout

There is too much code to deal with, just have a look at the gists I made instead for now 😀 This should be reusable if you want to play around with it.

For LLVM 7, you’ll want to define JITTER_LLVM_VERSION_LEGACY. For LLVM 8, don’t.

The API usage should be fairly simple.

int my_add(int a, int b) { return a + b };

// ...

using namespace llvm;
JITTIR::Jitter jitter;

// Let JIT link against this symbol.
jitter.add_external_symbol("my_add", my_add);

// Allocate a module.
auto module = jitter.create_module("mymodule");
auto &context = module->getContext();

// Build the IR
Type *int32 = Type::getInt32Ty(context);
Type *result_type = int32;
Type *argument_types[] = { int32, int32 };
bool vararg = false;

FunctionType *function_type = FunctionType::get(
    result_type,
    argument_types,
    vararg);

Function *function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMAdd",
    *module);

Function *my_add_function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "my_add",
    *module);

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *args[] = { arg0, arg1 };
Value *added = builder.CreateCall(my_add_function, args, "added");
builder.CreateRet(added);

// Add the module to the JIT, it's immediately compiled.
jitter.add_module(std::move(module));

// Get function pointer to symbol and execute it.
auto *fn_ptr =
    reinterpret_cast<int (*)(int, int)>(jitter.get_symbol_address("LLVMAdd"));
printf("10 + 20 = %d\n", fn_ptr(10, 20));

Which should print 30.

(gdb) disas fn_ptr,fn_ptr+20
Dump of assembler code from 0x7fb9d7bcc000 to 0x7fb9d7bcc014:
   0x00007fb9d7bcc000:	push   %rax
   0x00007fb9d7bcc001:	movabs $0x55d0584c33ea,%rax
   0x00007fb9d7bcc00b:	callq  *%rax
   0x00007fb9d7bcc00d:	pop    %rcx
   0x00007fb9d7bcc00e:	retq   
   0x00007fb9d7bcc00f:	add    %al,(%rax) <- zero memory
   0x00007fb9d7bcc011:	add    %al,(%rax)
   0x00007fb9d7bcc013:	add    %al,(%rax)

This is just the bare minimum, simplest approach we can take to JIT compilation in LLVM, but it works. The code which calls external functions seems a bit strange though, but it might be this way so it’s easy to patch in other call addresses later, I’m not sure.

Conclusion

Hopefully this was informative at least. In part 3, we will use these tools at our disposal and bring up a MIPS.

An unusual recompiler experiment – MIPS to LLVM IR – Part 1

While not graphics per-se, recompilers in emulation is an interesting topic. They are notoriously complex and difficult to implement, but promise incredible performance improvements. While emulating weaker hardware is serviced well with interpreters, more powerful hardware needs recompilers to shine. Starting with the PlayStation 1 era of hardware, recompilers are common, and beyond that, recompilers are required to have any hope of reaching real-time performance.

This is a multi-part post. I’m not sure how many posts it will take, but there’s a lot of stuff to write about. In this round, I’ll introduce the experiment, we’ll parse a MIPS ELF file, and make it ready for execution in our emulated Linux environment.

What is the goal of a recompiler

The main purpose of the recompiler is to look at the foreign machine code an application is running, and converting that to equivalent machine code on the hardware you are running on.

Conversely, an interpreter looks at instructions one at a time, and performs the action it requires, which wastes a lot of work in decoding instructions and branching dynamically to whatever code snippet you need to execute.

Isn’t this what Java and .NET runtimes do?

Basically, yes. Just replace “foreign machine code” with “byte code”.

The portability problem

Since a recompiler needs to target the raw machine code of the hardware you’re running on, this is a serious hazard for portability. Typical recompilers aiming to be portable need to write backends for all kinds of machines you want to run on, and deal with operating-system specific ABIs along the way. Today, the most relevant targets would be:

x86
x86-64/amd64
ARMv7 (older mobile devices)
AArch64 (newer mobile devices)

To target something exotic, like homebrew on consoles, you might have recompilers targeting PowerPC, which was very popular as well for two console generations.

This is too much, so I’ve been interested in the prospect of leveraging LLVM instead. If I can just target LLVM IR, LLVM could generate machine code in runtime (ORC JIT) for me, targeting any of these architectures. It is also a good excuse to learn some LLVM internals. We’ll be generating LLVM code with the C++ API, and pass that along to the JIT runtime to generate machine code on the fly.

Ahead-of-time recompilation and re-targeting?

If we target LLVM, we could in theory dump all LLVM IR code we have encountered to disk, optimize the code more aggressively (maybe some LTO), and build everything into a single dynamic library for any target we’d like. Once we have warmed up all the known code paths we should be able to avoid almost all run-time recompilation on subsequent runs. I wonder how practical it is, but it’s something I’d like to experiment with.

Patching speed critical sections with native C?

If we can dump LLVM IR to disk it doesn’t seem to farfetched to replace functions at known addresses with our own native versions written in C or something.

Recompilation efficiency

A big advantage of having a dedicated recompiler is how quickly the code can be generated as it barely needs to qualify as a compiler to get the job done. LLVM is a complex beast which needs to target all sorts of use cases, and using it as a just-in-time compiler like this is going to create some performance issues. It will be interesting to see how slow it is in practice.

Why MIPS?

MIPS is found in lots of gaming console hardware from the 90s and early 00s.

PlayStation 1
Nintendo 64 (CPU and RSP)
PlayStation 2
PlayStation Portable

MIPS is also a very simple ISA to understand and get started with. The core, original MIPS I ISA is probably the simplest, practical ISA to learn, and it’s often part of the curriculum when studying for an electronics degree. As part of my undergrad digital design course, we hacked on a trivial MIPS CPU core in VHDL, which was very fun and educational.

What should we emulate?

I felt like doing something I could get results from quickly, so rather than emulating a full game console, I wanted to try pretending to be a MIPS Linux kernel, running fully fleshed-out MIPS ELF binaries compiled with GCC 8.2 backed by glibc and libstdc++. That way I could build my way up from running trivial test cases written in assembly without any run-time all the way to emulating complicated programs using STL and the modern C/C++ run-times. We can also get a much better picture of performance differences between a natively compiled C++ application compared to a recompiled one.

The MIPS we’re going to target is a 32-bit MIPS I with whatever extra instructions we need when running real applications. GCC can target MIPS I with -march=mips1, but there are a few instructions from MIPS II and up GCC will use anyways to run any glibc application, due to a couple extra fundamental features:

RDHWR – Reads a hardware register, used to query thread local storage (TLS). To run any non-trivial C program, we need this because of errno, which is defined by POSIX to be thread local.
LL/SC – Load linked and store conditional serve as the backbone for atomic operations. glibc needs it in a few places. We’re not going to bother with threading, so we can trivially implement it as normal load/store.

As for endianness, MIPS can support both little and big-endian. Little is the easiest to start with since it matches our target hardware, but we’ll want to support big-endian as well, as it is the default, and the only MIPS endianness I know of in the wild.

User-space Linux application – ELF parsing and syscalls

Our program will need to read an ELF file, set up a virtual 32-bit address space and begin execution. We will need to implement various Linux syscalls required to host a real application. Common linux syscalls like:

open
read
write
exit
kill
llseek
mmap
munmap
brk

… and so on will be enough to get us over the printf stage. We do not have to concern ourselves with interrupts, CPU exceptions, memory-mapped I/O and other complicated things a game console emulator would have to deal with, fortunately.

Step #0 – Getting a MIPS cross compiler

The first step is getting some code to compile to MIPS. This took a surprisingly long time as the PKGBUILDs for Arch Linux did not work properly with some weird incompatibilities. To cut the long story short, I made some PKGBUILDs for Arch Linux which worked for me to create fully functional cross-compilers for both little and big-endian 32-bit MIPS. https://github.com/Themaister/MIPS-Toolchain-PKGBUILD/

To build a little-endian binary once you build the toolchain:

mipsel-linux-gnu-gcc -o test test.c -static -march=mips1 -O2

# CMake toolchain file for little endian.
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR mipsel)
set(CMAKE_C_COMPILER mipsel-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER mipsel-linux-gnu-g++)
set(CMAKE_C_FLAGS "-mgp32 -march=mips1")
set(CMAKE_CXX_FLAGS "-mgp32 -march=mips1")
set(CMAKE_ASM_FLAGS "-mgp32 -march=mips1")
set(CMAKE_C_LINK_FLAGS "-static")
set(CMAKE_CXX_LINK_FLAGS "-static")

The -static flag is important as we do not want to have to deal with dynamically loading the C runtime in our ELF loader. Fortunately, static libgcc/libstdc++ seems to work just fine for our purpose here.

Step #1 – Parsing an ELF file

Before starting I did not know much about ELF (Executable and Linkable Format). It is the executable format used on Linux and many other systems. It is surprisingly simple when we just want to run a statically linked executable like this. It is helpful to use the readelf tool (mipsel-linux-gnu-readelf) to study the ELF, this will help us to understand what’s going on.

# mipsel-linux-gnu-readelf -h mips.elf
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           MIPS R3000
  Version:                           0x1
  Entry point address:               0x400630
  Start of program headers:          52 (bytes into file)
  Start of section headers:          628984 (bytes into file)
  Flags:                             0x1007, noreorder, pic, cpic, o32, mips1
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         38
  Section header string table index: 37

This is the first few bytes of the binary. The structure is defined in the system header “elf.h” on Linux, which is very handy when we want to write a parser. There are a few things we care about here:

We can verify that we have a MIPS binary, and in which endianness.
What the entry point address is, i.e. where do we start executing.
How to find the program headers
How to find the section headers

The program headers

The program headers contain information about where data is located in the binary and what to do with it. We only care about LOAD and TLS blobs.

# mipsel-linux-gnu-readelf -l mips.elf                                            

Elf file type is EXEC (Executable file)
Entry point 0x400630
There are 7 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  ABIFLAGS       0x000138 0x00400138 0x00400138 0x00018 0x00018 R   0x8
  REGINFO        0x000150 0x00400150 0x00400150 0x00018 0x00018 R   0x4
  LOAD           0x000000 0x00400000 0x00400000 0x813d4 0x813d4 R E 0x10000
  LOAD           0x082000 0x00492000 0x00492000 0x041e0 0x04f9c RW  0x10000
  NOTE           0x000114 0x00400114 0x00400114 0x00020 0x00020 R   0x4
  NOTE           0x000168 0x00400168 0x00400168 0x00024 0x00024 R   0x4
  TLS            0x0820b0 0x004920b0 0x004920b0 0x00010 0x00030 R   0x4

...

We see two LOAD blocks. One has the read/execute flags set. This is our code segments, which we load from Offset in the ELF into virtual address “VirtAddr”. The read/write is the data segment, where global variables live. Note that FileSiz may be != MemSiz. This is to support zero-initialized global variables, they simply need memory allocated to them.

TLS is a bit special, it marks which global data needs to be thread local. Any data here needs to be copied out to a new allocation per-thread if we create a new one (we won’t, but we still need to implement it).

The section headers

The section headers don’t seem vital to execution, but they contain the symbol table, which is useful for debugging.

The virtual address space

Linux has a virtual address space, so we need to copy all relevant data out to our own virtualized address space. This is simply represented as a page table (4k pages) spanning the entire 32-bit address space. Through our syscall emulation, applications/glibc can use mmap() to either allocate memory dynamically or memory map files. We cannot assume a fixed address space layout if we want to emulate Linux binaries.

Having an address space like this means all memory access becomes indirect. This will certainly be a performance problem. There might be a better way if we abuse mmap(), but sounds very hard.

Setting up the stack

Before we can call the entry point we must set up a stack. Normally, we would think this stack only contains “argc” and “argv”, ala:

int main(int argc, char **argv)

but we actually need to pass a lot more information. All of this extra information is used by the C runtime entry point, which seems to be called __start in MIPS. We allocate the stack space at the top of the virtual address space, and push some data to the stack. The $sp register will point to:

argc
argv argument #0 (char *)
argv argument #1
…
NULL // Terminates argv
environment variable key #0 (char *)
environment variable value #0 (char *)
environment variable key #1
environment variable value #1
…
NULL // Terminates envp

The environment variables are passed in like this, and the C runtime will parse it. However, there is more data we need to pass on the stack on Linux, which was rather surprising. glibc will crash deep into its initialization if this is not done properly.

// ELF AUXV, see <elf.h>
stack_data.push_back(AT_PHDR);
stack_data.push_back(misc.phdr_addr);
stack_data.push_back(AT_PHENT);
stack_data.push_back(ehdr.e_phentsize);
stack_data.push_back(AT_PHNUM);
stack_data.push_back(ehdr.e_phnum);
stack_data.push_back(AT_PAGESZ);
stack_data.push_back(VirtualAddressSpace::PageSize);
stack_data.push_back(AT_ENTRY);
stack_data.push_back(ehdr.e_entry);
stack_data.push_back(AT_UID);
stack_data.push_back(getuid());
stack_data.push_back(AT_EUID);
stack_data.push_back(geteuid());
stack_data.push_back(AT_GID);
stack_data.push_back(getgid());
stack_data.push_back(AT_EGID);
stack_data.push_back(getegid());
stack_data.push_back(AT_RANDOM);
stack_data.push_back(stack_top); // Just point to something. glibc needs this.
stack_data.push_back(AT_NULL);

glibc needs to have a pointer to its own ELF headers and some other information like user IDs, page sizes, and some other things. The headers are used to set up TLS properly. It also needs a random number created by the Linux kernel, which it uses in early initialization to set up stack protection canaries.

Now, everything is set up, and we can start executing … I mean generating some LLVM IR … in part 2.

Why coverage based pixel art filtering is terrible

I recently made a blog talking about how to properly filter pixel art in 3D, where I chose a cosine kernel as our low-pass. Please read through this to get a better idea of what I’m discussing here.

Pseudo-bandlimited pixel art filtering in 3D – a mathematical derivation

In this blog, I would like to analyze some alternative filters which we could potentially use, and try to explore the quality of the other methods from a signal processing standpoint.

“Ideal” sinc

The theoretical low-pass filter. Just as a reference.

Cosine

The filter kernel I chose as my filter.

d/dx (smoothstep)

I mentioned smoothstep in the previous post. smoothstep was mentioned as the integral of the filter kernel in many implementations, so to get the underlying filter to analyze, we take the derivative of smoothstep. Smoothstep is defined as 3x^2 – 2x^3. Its derivative is 6x – 6x^2, and if we normalize it to [-1, 1] range instead of [0, 1], we get:

Triangle

The LINEAR filter. This is essentially the case we would get if we upscale pixel art a lot with NEAREST, then filter the result.

Rect

This is the worst low-pass filter ever as I mentioned in the last post. Interestingly, this is the filter we would use for coverage-based pixel art filtering if the texture and screen were aligned. (I don’t expect the result for rotated textures to be much different, but the rigorous analysis becomes a bit too hard for me). If our input signal is turned into a series of rects (pixel extent) ahead of time (which is what we do in our implementation), our coverage filter is basically the equivalent of convolving a rect against that signal. Basically, rect is the filter we would end up with if we cranked SSAA up to infinity.

One filter which implements this is (from what I can tell): https://github.com/libretro/slang-shaders/blob/master/retro/shaders/pixellate.slang

Windowed sinc (Lanczos2)

Lanczos2 is a fairly popular filter for image scaling. I’ve included it here to have some reference on how these filters behave in frequency space.

Filter response

Now, it’s time to drone on about the results. All of this on frequency response, stop bands, blah blah, is probably going to be too pedantic for purposes of graphics, but we can, so why not.

So, first we see the response around 0.5 in this graph. This is the Nyquist frequency. Our ideal sinc (or well – to be pedantic – a 16 lobe Hamming windowed sinc) falls immediately after 0.5. This is a completely impractical filter obviously.

Another consideration is how fast we reach the “stop band”, i.e., when the filter response has rolled off completely. Cosine and smoothstep are better here, while Lanczos2 is a bit slower to roll off. This is because of the window function. Applying a window function is the same as “blurring” frequency space, so the perfect sinc is smearing out too far. Linear and Rect don’t hit their first low-point until twice the Nyquist :(.

As for stop-band attenuation, windowed sinc is the clear winner, but again, this filter is impractical for our purposes (analytic windowed sinc integral with negative lobes breaking bilinear optimization? just no). Cosine seems to have a slight edge over smoothstep, with about 2 dB improvement. 2 dB is quite significant (For reference, halving root-mean-square error is ~6 dB). Triangle kernel (LINEAR filter) is about 3 dB worse again, and rect sniffs glue alone in the corner. Triangle and rect look very similar because a triangle kernel is just two rects convolved with each other, and thus its frequency response is squared (the more you know).

Conclusion

Cosine is a solid filter for what it’s trying to do, given our constraints on needing to analytically integrate whatever filter kernel we choose. The quality should be very similar to smoothstep, but I think I’ll consider cosine the winner here with 2 dB better stop-band attenuation. It also has slightly better response in the pass-band, which will preserve sharpness slightly better. This should make sense, because the cosine kernel has a higher peak and rolls of faster to zero than the smoothstep kernel. This frequency response can be expanded or contracted based on how we multiply the “d” parameter. Multiply it by 2 for example, and we have the equivalent of LOD bias +1.

Basically, coverage/area based filtering is pretty terrible from a signal processing standpoint. Rect is a terrible filter, and you should avoid it if you can.

Reference script

I captured this with Octave using this script for reference:

cosine_kernel = pi/4 * cos(pi/2 * (-1024:1024) / 1024);
windowed_sinc = sinc((-2048:2048) / 1024) .* sinc((-2048:2048) / 2048);
perfect_windowed_sinc = sinc((-64*1024:64*1024) / 1024) .* hamming(128 * 1024 + 1)';
rect_kernel = ones(1, 1024);
linear_kernel = conv(rect_kernel, rect_kernel) / 1024;

ramp = (0:2048) / 2048;
smoothstep_kernel = 3 * ramp - 3 * (ramp .* ramp);

cosine_fft = fft(cosine_kernel, 1024 * 1024)(1 : 16 * 1024);
windowed_fft = fft(windowed_sinc, 1024 * 1024)(1 : 16 * 1024);
perfect_windowed_fft = fft(perfect_windowed_sinc, 1024 * 1024)(1 : 16 * 1024);
rect_fft = fft(rect_kernel, 1024 * 1024)(1 : 16 * 1024);
linear_fft = fft(linear_kernel, 1024 * 1024)(1 : 16 * 1024);
smoothstep_fft = fft(smoothstep_kernel, 1024 * 1024)(1 : 16 * 1024);

offset = 20.0 * log10(1024);

cosine_fft = 20.0 * log10(abs(cosine_fft)) - offset;
windowed_fft = 20.0 * log10(abs(windowed_fft)) - offset;
perfect_windowed_fft = 20.0 * log10(abs(perfect_windowed_fft)) - offset;
rect_fft = 20.0 * log10(abs(rect_fft)) - offset;
linear_fft = 20.0 * log10(abs(linear_fft)) - offset;
smoothstep_fft = 20.0 * log10(abs(smoothstep_fft)) - offset;

x = linspace(0, 1024 * 16 * 1024 / (1024 * 1024), 16 * 1024);

figure
plot(x, cosine_fft, 'r', x, windowed_fft, 'g', x, rect_fft, 'b', x, linear_fft, 'b--', x, smoothstep_fft, 'k', x, perfect_windowed_fft, 'c-')
legend('cosine', 'lanczos2', 'rect', 'triangle', 'smoothstep', 'ideal sinc')
xlabel('frequency / sampling rate')
ylabel('Filter response (dB)')
xticks(linspace(0, 1024 * 16 * 1024 / (1024 * 1024), 65))

Pseudo-bandlimited pixel art filtering in 3D – a mathematical derivation

Recently, I’ve been playing Octopath Traveler, and I’m very disappointed with the poor texture filtering seen in this game. It is PS1-level, with severe shimmering artifacts, which ruins the nice pixel art. Tastefully merging retro pixel art with 3D environment is a very cool aesthetic to me, so I want it to look right. Taking a signal processing approach to the problem, I wanted to see if I could solve the issue in a mathematically sound way.

Correctly filtering pixel art is a challenge, especially in a 3D environment, because none of the GPU hardware assisted filtering methods work well. We have two GPU filters available to us:

NEAREST/POINT: Razor sharp filtering, but exhibits severe aliasing artifacts.
LINEAR: Smooth, but way too blurry.

Our goal is to preserve the pixellated nature of the textures, yet have an alias-free result. A classic solution for this problem is to pre-scale the texture by some integer multiple with NEAREST, then sample this texture with LINEAR filtering. While this works reasonably well, it costs a lot of extra VRAM and bandwidth to sample huge textures which mostly contain duplicate pixels. Duplicating pixels also puts a maximum level on how sharp the pixels can become. On top of that, straight LINEAR still has some level of aliasing, as LINEAR is not a very good low-pass filter with its triangular kernel.

A flawed assumption of texture sampling using LINEAR is also that each sample point in the texture is basically a dirac delta function, i.e. the sample value only exists right at the texel center, and thus has an infinitesimal area. The pixel art assumption is that each texel has proper area, the sample point exists across the entire texel. For bandlimited signals, i.e. “natural” images, the dirac delta assumption is how you sample, but pixel art is not a sampled bandlimited signal.

Filtering textures in 3D means we need to work well in many different scenarios:

Magnification
Minification
Rotation
Scale
Anisotropy / uneven scaling (look at the texture at an angle)

For minification, we are going to rely on the existing texture filtering hardware to do mip-mapping and anisotropic filtering for us. For minification, we are going to assume it is a good enough approximation to the true solution. What we want to focus on is magnification, as this is where the blurring issues with LINEAR become obvious.

Aliasing from hell

It’s even worse in motion.

Zoomed in 4x with pixel duplication.

Blurry mess

Straight LINEAR, in linear space.

Zoomed in 4x with pixel duplication.

Smooth and sharp

Here’s my method.

Zoomed in 4x with pixel duplication.

Bandlimited sampling

To understand the derivation, we must understand what bandlimited sampling means. All signals (audio or images) have a frequency response. Nyquist’s theorem tells us that if we sample at some frequency F, there must be no frequency component over F / 2 in the signal, or it aliases. (F / 2 is also called the Nyquist frequency.) Our input signal, i.e. a series of dirac delta functions, has an infinite frequency response. Therefore, we must convolve a low-pass filter on that signal to reject as much energy above the Nyquist frequency as possible before we sample it again. In LINEAR sampling, a triangular kernel is applied, which acts as a low-pass filter, but it is not very good at removing high frequencies. NEAREST is effectively just applying a rect filter, which is the worst low-pass filter you can have.

Triangle kernel:

Rect kernel:

For completeness, there exists a perfect low-pass filter, sinc. However, it is purely theoretical as its width is infinitely large, a common approach is to limit the extent of the sinc using a cleverly chosen window function, but now we lose the perfect low-pass. We can get as close as we wish to the perfect low pass by using a larger windowing function, but it’s rarely practical to go this far except for image/video resizing, where we are able to use separable filtering in multiple passes with precomputed filter coefficients. We will not have that luxury for texture filtering in a 3D setting.

We also get negative filter coefficients, which usually manifests itself as “ringing” or “haloing” when filtering graphics. This is something we would like to avoid when filtering pixel art as well. We also need to avoid negative values in the kernel because we will need it for a crucial optimization later.

We will just need to come to terms that we can never get perfect bandlimiting, and we will need to be practical about choosing our filter, hence pseudo-bandlimited.

Picking a practical filter kernel

To sample pixel art, we actually need to apply two filters at once, which complicates things. First, we need to apply a rect kernel (NEAREST filter) to give our texel some proper area. We then apply a low-pass filter which will aim to band-limit the rect. Applying two filters after each other is the same as convolving them together. Convolution is an integral, so now we have some constraints on our filter kernel, because it needs to be cheap to analytically integrate. LUTs will be too costly and annoying to use.

A key point is that the low-pass filter kernel needs to adapt to our sampling rate of the texture. Basically, we need to be band-limited in screen space, not texture space. If we have a filter kernel

we should be able to change it to

where d is the screenspace partial derivative in either X or Y.

// Get derivatives in texel space.
// Need a non-zero derivative since we need to divide it later.
vec2 d = max(fwidth(uv) * texture_size, 1.0 / 256.0);

For our filter kernel, we could pick between:

Triangle (piece-wise integration, annoying)
Rect (lol)
Polynomial (Easy to integrate)
Cosine (Easy to integrate)
Gaussian (No analytic integration possible)
Windowed sinc (negative lobes and super difficult integral, no thanks)

I chose a cosine kernel. If we think about Taylor expansions, a polynomial kernel and cosine is basically in the same ballpark. Cosine is not a perfect low-pass by any means, but it’s pretty good for our purpose here.

The cosine kernel will be

The normalization factor is to make sure the area of the kernel is 1. The rect kernel is

To get our filter with a given d, we will convolve.

The integration boundaries need to be limited to the range of W and R. If we solve this, we get

The brackets in the integration range denote a signed saturate, i.e. clamp(x, -1, 1).

Here’s how the kernel look for different d values:

d = 1:

d = 0.5:

d = 0.25:

d = 0.1:

Nice, so as we can see, the filter kernel starts off fairly smooth, but sharpens into a smoothly rolling off rect as we sample with a higher and higher resolution.

The 2×2 filter (d <= 0.5)

From the filter kernel above, we can see that if d <= 0.5 (LOD = -1), the extent of the filter kernel is 1 pixel. This means we can implement the kernel by a simple 2×2 kernel, or as we shall see, a single bilinear filter tap. For the implementation, we are going to assume we are sampling between two texels, where x is the phase, in range 0 to 1.

We can implement this as a simple lerp. Instead of evaluating two sines, we’re just going for one which implements the transition from 0 to 1.

This is very similar to a smoothstep, which explains why smoothstep techniques for this kind of filtering works so well. sin might be rather expensive on your GPU targets, so instead we can use a simple Taylor expansion to get a very good approximation to the sine

In fact, if we use smoothstep, we would get a filter kernel which is the derivative of smoothstep. Now, if we compute the result for both and X and Y dimensions, we will have lerp factors for both dimensions. Since our filter weights are all positive, and our kernel is separable we can make use of bilinear filtering to get the correct result.

// Get base pixel and phase, range [0, 1).
vec2 pixel = uv * texture_size - 0.5;
vec2 base_pixel = floor(pixel);
vec2 phase = pixel - base_pixel;

// We can resolve the filter by just sampling a single 2x2 block.

mediump vec2 shift = 0.5 + 0.5 * taylor_sin(PI_half * clamp((phase - 0.5) / min(d, vec2(0.5)), -1.0, 1.0));
vec2 filter_uv = (base_pixel + 0.5 + shift) * inverse_texture_size;

As d increases, we should no longer use our filter since we can only support up to d = 0.5, so we implement something ala trilinear filtering where we lerp between our ideal LOD = -1, and LOD = 0, where we fully sample with a normal trilinear/anisotropy filter. This implementation will only require two bilinear texture lookups, one for the d <= 0.5 sampling, and one for the d > 0.5 sampling. These two results need to be lerped.

The 4×4 filter (d <= 1.5)

Now, I’m heading into more interesting territory. While the good old smoothstep with fwidth is a well known hack, it cannot deal with larger kernels than 2×2. Using our success with replacing 4 filter taps with a single bilinear we’re going to continue implementing a 4×4 kernel with 4 bilinear taps. If we support a 4×4 filter kernel we can have our nice filter even for some slight minification as well. It’s going to require a lot of ALU, so we can split the implementation into the case where d <= 0.5, use the single tap, d <= 1.5, use this 4 tap method, otherwise, just sample normally. If we remove this case, we effectively have a “speed hack” mode for slower devices.

The filter coefficients for each element in the 4×4 grid will be

u and v are the phase variables in range [0, 1] as mentioned earlier. Since the filter is separable, we can compute X and Y separately and perform an outer product to complete the kernel.

To evaluate four F values, we only need to compute 5 sines (or Taylor approximations), not 8, because F(u) shares a value with F(u + 1), and so on. For d = 1.5, the filter kernel for one dimension will look like

Since we compute X and Y separate, we end up with a cost of 10 Taylor approximations per pixel, ouch, but GPUs crunch this like butter.

Now, each 2×2 block of this kernel can be replaced with one bilinear lookup and a weight.

// Given weights, compute a bilinear filter which implements the weight.
// All weights are known to be non-negative, and separable.
mediump vec3 compute_uv_phase_weight(mediump vec2 weights_u, mediump vec2 weights_v)
{
	// The sum of a bilinear sample has combined weight of 1, we will need to adjust the resulting sample
	// to match our actual weight sum.
	mediump float w = dot(weights_u.xyxy, weights_v.xxyy);
	mediump float x = weights_u.y / max(weights_u.x + weights_u.y, 0.001);
	mediump float y = weights_v.y / max(weights_v.x + weights_v.y, 0.001);
	return vec3(x, y, w);
}

Is 4×4 worth it?

The difference between 4×4 and 2×2 is very subtle and hard to show with still images. The difference manifests itself around LOD -0.5 to 0.5, i.e. close to 1:1 sampling. 2×2 is sharper with more aliasing while the 4×4 kernel remains a little blurrier but alias-free. For most use cases, I expect the 2×2-only method to be good enough, i.e. the good old “smoothstep/sine & fwidth”, but now I’ve come to that conclusion through math and not random graphics hackery.

Performance

A quick and dirty check on RX 470 @ 1440p

Full screen with all pixels hitting 4×4 sampling case: 441.44 µs
Full screen with all pixels hitting 2×2 sampling case: 207.68 µs

Anisotropy

We naturally get support for anisotropy because we compute different filters for X and Y dimensions. However, once max(d.x, d.y) is too large, we must fall back to normal sampling. Either a very blurry trilinear or actual anisotropic filtering in hardware.

Here’s a shot where we gradually go from crisp texels to normal, very blurry trilinear.

The green region is where d <= 0.5, 1 bilinear fetch, blue is where d < 0.5, d <= 1.5, 4 taps, red region is regular trilinear fallback. The blue region will fade towards the normal trilinear fallback to avoid any abrupt artifacts.

Code

The complete shader implementation (reuseable GLSL header) can be found in: https://github.com/Themaister/Granite/blob/master/assets/shaders/inc/bandlimited_pixel_filter.h
A test project to play around with the filter: https://github.com/Themaister/Granite/blob/master/tests/bandlimited_pixel_test.cpp

Go forth and filter your pixels correctly.

Improving VK_KHR_display in Mesa – or, let’s make DRM better!

VK_KHR_display was recently added to Mesa, and I was very excited. Finally, we could get direct-to-display, lowest possible latency in Vulkan for programs like RetroArch (VK_KHR_display is not just for VR!). We have had support for KMS/GBM for a long time, and it works really well when you want to get the most direct access to the display at the cost of convenience. Since we have full control of how page flips happen we avoid many of the pitfalls with compositors. When they work well, you should get optimal conditions in full-screen, but it’s very hard to guarantee. We have no control if X or Wayland choose to go into a direct-to-display mode without compositors when we fullscreen the window surface. VK_KHR_display or the EGL equivalent is also the only realistic way to get good display performance on more embedded Linux systems which are fairly popular for emulation boxes. There’s no need for X or Wayland getting in the way when you have a dedicated, 10-foot UI setup.

We can also control how much total buffering we want. Sometimes we want 2 buffers where CPU emulation + GPU rendering needs to complete in 16.66 ms (great for retro console emulation), and sometimes we want 3 buffers where CPU, GPU and display can overlap (usually the case when running GPU-intensive emulation). Controlling this is almost impossible to do reliably with compositors. We synchronize vkAcquireNextImageKHR using a VkFence instead of a VkSemaphore. Most implementations do not actually support async AcquireNextImageKHR, and certainly not Mesa’s implementation of VK_KHR_display.

In EGL, we made use of GBM to allocate DRM buffers directly and pumped through our own page flips using the super low-level DRM API. In Vulkan however, we cannot go that low-level as we go through VK_KHR_display. There are tradeoffs. Nvidia for example supports VK_KHR_display on Linux, but not GBM. VK_KHR_display is a cleaner abstraction than raw DRM.

I had some issues with Mesa’s implementation of VK_KHR_display however, so I tried to fix them.

Lack of MAILBOX or IMMEDIATE present modes

This is the first glaring omission. In emulators, a key feature is being able to fast-forward. Usually, we also want to fast-forward completely seamlessly. Are you’re playing an old console RPG and want to make random encounters less dull? Just hit fast forward and blast through. Unfortunately, the Mesa implementation of VK_KHR_display does not support the display modes which facilitate this use case. In GL, we trivially support his by hitting glXSwapInterval(0) or 1 and it should “just work”.

MAILBOX is preferred because it is tear-free, but IMMEDIATE is a good fallback too.

So, I tried patching^H^H^Hhacking this up in Mesa.

MAILBOX

Basically, the FIFO implementation revolves around using drmModePageFlip. This queues up a framebuffer to be display on the next VBlank, if the buffer DRM is rendering to has completed its rendering. Apparently, the kernel tracks images on DRM, and whatever semaphores you pass in to vkQueuePresent don’t seem to matter at all.

Now, one glaring problem with drmModePageFlip is that once you have queued up a flip, you cannot queue up another one until it has completed on the next vblank. The page flip itself can be polled through the DRM FD. The page flip event will say when the page flip happened, and which image was flipped in.

After the page flip, the VK_KHR_display has a thread which checks if there are any queued up frames which can be setup with drmModePageFlip, and that keeps the FIFO queue going. If there are multiple frames queued up, the first image queued by the application is selected for the next page flip. So, I implemented MAILBOX using a really basic idea: When queuing up for a new page flip, pick the latest image to be queued by the application. Other frames which were queued up were transitioned back into the IDLE state, because they would never end up being displayed anyways. This worked wonderfully for my use case. Hundreds of FPS without tearing achieved.

I also implemented a better AcquireNextImageKHR. If there are multiple IDLE state images, I picked the image which was presented earliest. This way, you get a round-robin-like model, whereas the old model just picked the first IDLE frame. This might work just fine for FIFO, but not for MAILBOX.

Another thing I did was to force at least 4 images for MAILBOX. Unlike the Wayland implementation, I didn’t force a 4-deep swapchain in minImageCount. It’s perfectly fine for a swapchain to return more images than what is being requested. This is probably an API wart. It is a bit awkward to not have different minImageCount queries per present mode …

IMMEDIATE

For this mode, you can have tearing, and I found a particular flag in the DRM API. You can pass down DRM_MODE_PAGE_FLIP_ASYNC flag to drmModePageFlip, and the page flip will just happen when it happens. Unfortunately, this only worked as expected on AMD. Apparently, Intel cannot support this flip mode.

Lack of seamless transition between present modes

One annoying aspect of Vulkan WSI is that you cannot just change the present interval on the fly. Once you have a swapchain you cannot just call glXSwapInterval or similar to get vsync or no vsync. What you need to do in this case is to create a new VkSwapchainKHR, setting oldSwapchain to hopefully reuse the images in the old swapchain, and then, delete the old swapchain. Assuming a decent implementation, you should be getting a seamless transition over to the new present mode.

Unfortunately, this did not work well. First of all, the VK_KHR_display implementation did not bother with oldSwapchain. When deleting the old swapchain after creating a new one, the screen would black out for a while, before starting up again, usually 3-4 seconds, kinda like doing a full mode change. I tracked this down to drmModeRmFB which deletes the old framebuffer references. Apparently, if you delete a framebuffer which is being displayed in DRM, the screen blacks out. This kind of makes sense, as there is nothing to display anymore. However, for toggling vsync-state this is just unacceptable.

The patching of this got a bit hairier.

I ended up with a scheme which can “steal” images from oldSwapchain assuming the formats and dimensions match up. I tried to pilfer the displayed image especially, because its framebuffer cannot be allowed to die or we get a black screen. (Note to self, there might be a ton of edge cases here with synchronization …) To make sure I don’t get stale page flip events, I block to make sure any pending flip has completed before continuing. The reason for this is to avoid a race condition where I end up freeing the oldSwapchain before the page flip handler. The page flip handler has a reference to the swapchain it came from. After being pilfered from, the old swapchain is considered dead, and I return VK_ERROR_OUT_OF_DATE_KHR on any following requests to acquire from it.

This combined with my crude MAILBOX implementation allowed me to get seamless transitions between tear-free fast-forward and butter smooth vsync gameplay with very low latency.

Doing MAILBOX properly in DRM

Unfortunately, the MAILBOX implementation I made is just a crude hack, and is not a valid implementation. Effectively, when you queue up a page flip in DRM, you’re expecting what is going to be the next image to display at the next vblank, long before that vblank actually occurs. This increases latency by quite a lot, but you also risk overshooting quite a lot, where the GPU cannot complete the frame in time, and you cannot flip anything on the next vblank. This is just dumb.

True MAILBOX needs to be handled in the VBlank handler inside DRM somehow, and DRM has no API available which can support this present mode. The way I expect this to work is:

drmModePageFlip can be called multiple times, up to N times. If you pass down a flag called say, DRM_MODE_PAGE_FLIP_REPLACE. This would allow multiple pending page flips to be in flight, and not return EBUSY if a pending flip is active.
In the vblank handler, the latest entry in the queue, whose rendering is also complete on the GPU, is selected. This frame is programmed into the display controller.
The earlier frames in the queue which were available to be presented, but were not selected for page flip, will be reported through a “discard” event, similar to the PAGE_FLIP event. This allows the WSI implementation to release the images to the application.
The current page flip callback will report the actual image which was selected for presentation.

Some further improvements to this scheme could be that discard events are returned as early as possible. If a swapchain image completes in the middle of the frame, it knows that it can “discard” earlier completed frames ahead of time. This way, we avoid stalling the GPU at DisplayRate * (SwapchainImages – 1) frame rates, which can happen if we render all available swapchain images, and we need to wait for next vblank to discard all frames.

This is way beyond me unfortunately, but it sounds like a very valuable thing to have. It seems like I will need to maintain my own hacky patch set on top of Mesa for now. 🙂