February 2019 – Maister's Graphics Adventures

Recreating the tone filter from NieR:Automata

Audio tech in games is rarely particularly interesting. Sadly, most of it seems to have been made into commodity over the last couple of decades. Once in while, some games have very clever audio tech, and in this case it was NieR:Automata’s tone filter which caught my attention. The blog post explaining this tech is found here. If you haven’t played the game, it is highly recommended to read the post and watch the videos to understand what it’s doing. That saves me a lot of explanation. Their blog is very sparse on technical implementation details, but I wanted to try recreating it as there was just enough high-level detail in there to get me started.

Outside graphics, I’ve done a fair bit of audio programming in the past. It’s been too long since I did any significant audio DSP programming.

In short, the goal of the filter is to attempt to turn normal high-fidelity soundtracks into something with an 8-bit feel on-demand. Being able to introduce dynamic aspects to the music with a pure filter is interesting.

The tone filter as explained

The goal of the filter is to extract musical notes, and emphasize them. By having a few notes playing with a classic waveform like square or saw waves, we can recreate a retro 8-bit feel.

The blog describes 48 filters, spanning 4 octaves. Each octave in the (western music) scale is divided into 12 tones. What I deduced from this is that the 48 filters should be 48 very sharp bandpass filters.

A theoretically perfect bandpass filter will output a pure sine wave if the tone exists, and nothing if it doesn’t.

The distortion is tough to do well, and I spent a lot of time fiddling with this. We want to try making the sine wave become something like a square wave. I tried many variants but I ended up with something very simple like

https://www.desmos.com/calculator/qvymx5qf8t

If you’ve done HDR tone-mapping, this formula will look very familiar to you. Sometimes cross-domain knowledge comes in handy.

After distorting, there is the levelling stage, which I spent a lot of time fiddling with. Basically, if we run this system as is, we end up with a ton of noise in the signal with all the chromatic tones playing over each other. Needless to say, this sounded absolutely terrible.

There are little to no details on how this should be implemented, so I tried a very crude model which seems to work reasonably well. Basically, each of the 48 channels have a running power estimate which is computed right after the filter. We can compare that against the running power estimate of the unfiltered audio. This lets us get an idea how much of the audio energy is concentrated into each individual tone. If the energy is low enough, it falls off in a power-of-4 fashion to avoid leaking in audio from completely unrelated tones. Percussion sounds will generally have energy in almost the entire audio spectrum, and we need to filter that out as well as we can. If the ratio is too high, we just cap it. This is the fiddly part. There’s a lot of magic constants to tweak to get it sounding pleasing.

At the end we mix our mono output signal back into the original audio, and when it works well, it gives a nice harmonic edge. I believe it’s reasonably close to the original game now. Here’s an example from the NieR:Automata OST. I’m visualizing all the 48 bands, and the colors are:

Blue: Below threshold, severely muted.
Green: Over threshold, should be heard.
Red: Saturated, hitting max threshold.

The tones in one octave form one row, and the four octaves are stacked on top of each other. The top-left starts at A3 – 220 Hz. If you know some music theory, maybe you can figure out which key the tune is in? 🙂

https://www.youtube.com/watch?v=yZvHErcuAUk

Implementation

First we mix stereo down to mono. This is kind of trivial. Just take the average of left and right channels.

Ultra-sharp bandpass, resonance filters

I went through a few failed iterations to get here. My first attempts were to do all of this in the frequency domain with FFTs, but that plan failed very quickly. What I ended up with in the end was a simple biquad resonance filter. This filter is characterized by having two zeroes and two poles in DSP parlance, or in other words, FIR (finite impulse response) and IIR (infinite impulse response). In code, this would look something like:

y[t] = n0 * x[t] + n1 * x[t - 1] + n2 * x[t - 2] - d0 * y[t - 1] - d1 * y[t - 2]

In the Z-domain, this looks like

H(z) = 1/n0 * (1 + z^-1 * n1/g + z^-2 * n2/g) / (1 + z^-1 * d0 + z^-2 * d1)

The zeroes and poles occur where the roots of the polynomials go to zero in the numerator and denominator respectively. Basically, I designed the filter by deliberately placing zeroes and poles in the Z-domain, factoring the expressions out and converting it back to a normal FIR and IIR form.

I placed a zero at DC and the Nyquist frequency (w = pi). The poles were placed very close to the unit circle at w = +/- 2 * pi * freq / samplerate, and amplitude 0.9999. Then I evaluated the filter response at the resonance frequency and adjusted the FIR portion of the filter so that we got an estimated unit gain at the resonance frequency.

Basically, the frequency response at the resonance frequency will be very close to dividing by zero, so near-infinite response, but not quite. Numerical stability can easily throw off the filter if we’re not careful. This is one of the major issues with IIR filters in general. I initially tried an 8-pole filter but it was impossible to get this stable even in FP64, so I just gave up and tried a simple biquad instead which worked just fine.

SIMD

Since we’re doing 48 IIR filters in parallel, this was a perfect case for SIMD optimizations. I made everything into a struct-of-arrays (SoA) form, and just vectorized the scalar IIR filter directly. Normally, small IIR filters are tricky to vectorize since there are inter-dependencies between samples, but not here.

I optimized the filter in NEON, SSE1 and AVX and got a very nice performance boost, more on that later.

This would have been a great case for ISPC, but I considered it a too large dependency for something simple like this.

Distortion

The distortion function must be nicely SIMD-friendly and not too expensive. I landed on the classic x/(1+abs(x)) operator. The divide can be done fast with reciprocal estimations. We didn’t need high accuracy.

Slight low-pass

After we have mixed together the 48 distorted streams, we run a weak low-pass filter on top to remove some of the harshest harmonics. This is done with a trivial 1-pole IIR filter.

Performance

I tested performance on a Ryzen 7 1800x @ 3.8 GHz as well as a high-end phone (Galaxy S9 Exynos) to measure NEON performance. The benchmark pushes 20 million white noise samples through the filter and then times the result. The test doesn’t take that long, so this should be assumed to be absolute peak performance without any thermal / power consideration. The results below are given in samples processed per second. Normal audio clips are 44.1 kHz, so 0.041 M/s should correspond to 1x real-time performance. The C++ version is written without any intrinsics with -O3 -ffast-math. The SIMD versions are written with the standard intrinsics.

Chip	C++	SSE	AVX	AArch64 NEON
Samsung Exynos 9810	1.8 M/s			6.8 M/s
Ryzen 7 1800x @ 3.8 GHz	3.6 M/s	7.1 M/s	11.5 M/s

Basically, we’re 100x realtime performance here, even on a mobile CPU, nice. I’m surprised how close the performance ended up when comparing SSE and NEON. I didn’t see any auto-vectorization activate in the C++ variant, so I wonder what is going on with just 2x scaling in SSE. I got similar results on MSVC and GCC for what it’s worth … NEON gets close to ideal 4x scaling though, nice.

This uses quite a bit of processing power, so we can’t run wild with effects like this right now. But I look forward to being able to take advantage of systems like this for even more precise operations in the future.

The original implementation probably does more work on more gimped CPU hardware (AMD Jaguar consoles), but 100x real-time is pretty fast in my book. 😉

Source

The implementation is out there, but don’t expect to be able to use it as-is. This is a hobby project after all.

https://github.com/Themaister/Granite/blob/master/audio/dsp/tone_filter.cpp

https://github.com/Themaister/Granite/blob/master/tests/tone_filter_bench.cpp

VST plugin

I implemented a simple VST plugin with builds for Windows and macOS, both 64-bit. Feel free to try it out. It’s ultra bare bones.

EDIT: Old links were stale. Found the old code lying around in a private repo and made it public with the VST headers removed.

https://github.com/Themaister/ToneFilterVST

ToneFilterVST (Win64 build, untested)

An unusual recompiler experiment – MIPS to LLVM IR – Part 4

This is the final part in my blog series on my adventure recompiling MIPS to LLVM IR. If you’re new to this series you can read:

Part 1 – Explains the goals, MIPS ELF format, etc.
Part 2 – Explains how to generate code using the LLVM APIs.
Part 3 – Explains how we recompile MIPS code to LLVM.

In this post, I’m going to test performance on some applications and get a feel for how the various different codegen options we have can affect performance.

Due our lack of extensive syscall support, there is a limit to what we can test without going out of our way to port stuff, so I’ll be focusing on some tests which don’t require much beyond simple stdio.

STB PNG read + write

This test is based on the STB library’s PNG implementation. The test will load a PNG file from disk and compress it again.

#include "stb_image.h"
#include "stb_image_write.h"
#include <stdlib.h>

int main(int argc, char *argv[])
{
	for (int i = 0; i < 20; i++)
	{
		int x, y, chan;
		stbi_uc *data = stbi_load("/tmp/test.png", &x, &y, &chan, 4);
		if (!data)
			return 1;

		if (!stbi_write_png("/tmp/output.png", x, y, 4, data, 4 * x))
			return 2;
	}

	return 0;
}

Native performance (32-bit)

To make the comparison a bit more fair, we’ll compile this using 32-bit x86 targeting i486 with -O3.

Time: 20.6 s

For reference, this matters quite a lot, in x86-64, we get 15.38 s. I will use the i486 result as a baseline, since both i486 and MIPS I are ancient ISAs from around the same era of computing.

MIPS on-demand JIT (baseline)

To begin our benchmarking, we’re going to test fully on-line JIT-ing. This is what needs to happen at least the first time we’re running an application. The results here will be affected by a balance between optimization in run-time and having to do less work while JIT-ing.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf

In this first test, we will apply the following options:

Function calls will link directly to their targets. This increases JIT workload significantly, since we need to JIT all possible call paths to be able to link code directly. However, runtime should be faster once we have JIT-ed.
No IR optimizations are enabled.

Time: 71.43 s

The up-front cost of JIT-ing is quite long. But overall, 3.5x slower isn’t terrible. Let’s see if we can do it better.

On-demand JIT with optimizations

The JIT-er can perform some in-place optimizations. We’ll see if it helps here.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --optimize

Time: 75.43 s

It seems like the optimization passes made it a bit slower.

On-demand JIT with thunked calls

Rather than aggressively JIT-ing call possible call paths, we can try just JIT-ing functions we are actually calling. All direct calls are translated into indirect calls, and every call requires a lookup. This should reduce the JIT overhead a lot, but potentially have worse runtime performance. Without –optimize, this should be the most efficient option if we want to avoid JIT overhead.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls

Time: 70.0 s

Interesting. This might be the sweet spot for on-demand JIT.

On-demand JIT with thunked all the things

We can also use thunked load-store operations, rather than emit IR code to translate addresses for every memory operation. This should reduce code bloat, and might help when we’re doing on-demand JIT.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls --thunk-load-store

Time: 90.4 s

Ouch.

Assuming well behaved calls and returns?

Unfortunately, my assumption that GCC would generate expected code for returns was wrong, or my implementation was buggy. I couldn’t get it to work for non-trivial test cases, so I can’t test performance here.

Ahead-of-time recompiled IR

Now we’re starting to get into interesting territory which I haven’t seen much of in the past.

We need to run the application here, dump IR code to disk, and recompile into a dynamic library. For this case, we should be able to generate pretty good code and avoid any run-time recompilation. This is the ideal scenario if we can deduce all known call-paths.

Let’s start with the optimal case. No thunking.

~/git/jitter/cmake-build-release/mipsvm --dump-llvm /tmp/llvm stb-test.elf

This dumps out a whopping 68 MB of LLVM IR. Time to turn this ball of mud into a dynamic library.

#!/bin/bash

OUTPUT="$1"
LLDIR="$2"

echo "== Linking LLVM IR =="
llvm-link -o __llvm_linked.bc "$LLDIR"/*.ll
echo "== Optimizing offline LLVM IR =="
opt -O3 -o __llvm_opt.bc __llvm_linked.bc -disable-inlining
#cp __llvm_linked.bc __llvm_opt.bc
echo "== Compiling static library to object file with LLC =="
llc -relocation-model=pic -filetype obj -o __linked.o __llvm_opt.bc -O3
echo "== Linking shared library =="
gcc -o "$OUTPUT" -shared __linked.o

rm -f __llvm_linked.bc
rm -f __llvm_opt.bc
rm -f __linked.o

This operation takes 53.4 seconds and generates a 3.4 MB binary. The original binary is 792 kB due to the statically linked glibc.

This should yield us the absolute best performance we can hope for. So let’s try it.

~/git/jitter/cmake-build-release/mipsvm test --static-lib ~/git/jitter/test_linked.so --static-symbols /tmp/llvm/addr.bin

Time: 46.2 s

That’s a pretty great improvement. Compared to 20 seconds for a native binary with 32-bit/i486. It starts up basically instantly since there is no recompilation necessary. We only need to recompile if we find new code we haven’t looked at yet.

From here, we can get a better idea of what runtime cost we have by removing optimizations, and adding thunking. Let’s see if opt -O3 helps at all by just going straight to llc.

Building the native binary just takes 24 seconds now.

Time: 57.0 s

opt -O3 is clearly doing something well. Let’s add back the optimization and use thunked calls. For thunked calls, the IR dump is just 21 MB. We can see here that we were JIT-ing out a lot of useless code we never had to actually run.

The binary is 984 kB now.

~/git/jitter/cmake-build-release/mipsvm stb-test.elf --static-lib ~/git/jitter/test_unlinked.so --static-symbols /tmp/llvm-nolink/addr.bin

Time: 60.0 s

The win from linking directly is nothing to sneeze at. 46.2s to 60.0 s. Let’s thunk the load-store calls and see where we get.

Time: 92.7 s

Yup. Clearly, we can get a 2x speedup by just inlining the load-store code and directly calling functions rather than rely on thunking. We’re not that far away from 2x differential from native code in the best case!

Best of both worlds codegen?

If we’re dumping code with thunking to disk to improve JIT overhead, we can imagine that we can optimize the thunked calls to direct code off-line if we write our own LLVM optimization pass. Just an idea …

We need to go one level deeper

Let’s try something silly. We will recompile a cross-compiled cross-compiler. What? Well, I’ve built SPIRV-Cross for MIPS big-endian this time around. This was actually a useful exercise, because I can now verify that SPIRV-Cross works for both MIPS and big-endian at the same time 😛 Nice. SPIRV-Cross uses C++11, a fair bit of STL and exceptions. Can we host libstdc++ properly? Let’s see. With a statically linked libstdc++, the binary is 3.2 MB.

Let’s dump some LLVM …

~/git/jitter/cmake-build-release/mipsvm spirv-cross --dump-llvm /tmp/llvm-spirv -- spirv-cross /tmp/test.spv

I tried running this with a test shader in the SPIRV-Cross repository. It makes use of FP64, so we can see if we support doubles. Here we also see that we can pass arguments to argv. We end up with 168 MB of LLVM IR, which sure is intense. Let’s recompile it. This process takes over 2 minutes and creates a 8.6 MB binary.

~/git/jitter/cmake-build-release/mipsvm spirv-cross --static-lib ~/git/jitter/test_spirv.so --static-symbols /tmp/llvm-spirv/addr.bin -- spirv-cross /tmp/test.spv

Now it runs almost instantly and correctly.

Conclusion

This has been a fun little side project. The overhead of JIT-ing is rather high as we would expect, but the peak runtime performance is surprisingly good. We’re in the 2-3x slower ballpark against natively compiled code for ahead-of-time compiled code. I haven’t tested a lot of code out there, but STB’s PNG implementation, SPIRV-Cross, glibc and libstdc++ should represent reasonably complex and varied code.

Release

I’ve released this project on GitHub under an MIT license. Please read the disclaimer.

An unusual recompiler experiment – MIPS to LLVM IR – Part 3

In part 1 and part 2 we laid the groundwork to start recompiling MIPS code to LLVM IR. Strap your seatbelts, we’re going to MIPS and x86 assembly land.

The top-level run loop

The top level code fundamentally needs to be able to translate the program counter (short-hand, PC) to an executable function pointer. We can choose a hash map (large address space) or flat array (small address space) here.

If we need to call a PC we have not seen before, we will need to recompile a new LLVM module, starting at that PC, and then we can execute it.

Self-modifying code?

An immediate question is self-modifying code. This is a fairly ugly topic to deal with since our previously compiled function might become invalid if the underlying code changes. I think the solution for that is to keep a JIT block cache which translates a hash to function pointer and do some analysis of code blocks we don’t have a function pointer for yet. Any i-cache invalidations will clear out the relevant function pointers which triggers hashing in some form. Most likely the code for our function in particular did not change, so we can likely reuse the code blocks we generated.

For our purpose, we will not deal with self-modifying code here. A real emulator will have to deal with it, but self-modifying code should be rarer and rarer the more modern hardware we’re dealing with.

Recompiling a function

So, given a PC to execute, we’ll do some analysis where we map out all execution paths from that PC. We do this by mapping out all the basic blocks. See part 2 for more detail on what basic blocks do in LLVM.

Basic block

Basic blocks are represented as a starting PC and an end, where the execution flow is linear. The end of a basic block occurs where we see some kind of branch instruction (except for call instructions!). In this analysis we only care about these “special” instructions. Normal opcodes like arithmetic and load/store are ignored since they cannot affect control flow.

Branch delay slots

A very important part of MIPS is the use of a branch delay slot. It is a very unique design aspect of the architecture, which is considered a design flaw today because it was hard-coded to help a very specific micro-architecture. Exposing micro-architecture details like this should be considered bad taste. Whenever a branch is taken, unconditionally or not, the next instruction is always executed. Let’s see a trivial example:

int foo(int a)
{
	return a + 10;
}

00000000 <foo>:
   0:	03e00008 	jr	ra
   4:	2482000a 	addiu	v0,a0,10

“jr $ra” jumps to an address stored in a register, and $ra is used for the return address of a function. However, we can see that the add instruction comes afterwards. GCC exploits the delay slot here to do the useful computation inside it. Note that if you write MIPS assembly, you can get the assembler to perform this reordering for you. Often you will see “nop” after a branch if there is nothing useful to do in the delay slot.

One thought you might have now is, what happens if you have multiple branches back to back, branching in a delay slot? Well, if you actually thought of that then congratulations, have a cookie. This is explicitly banned in MIPS ISA, because it is non-sensical and undefined. While the hardware behavior could be well defined for a particular chip, it is still extremely broken, because if there is any exception or hardware interrupt happening in the middle of this sequence, it is impossible to recover from it. MIPS interrupt handlers typically have to deal with delay slots and fix-up the PC register accordingly.

The practical effect of the delay slot for us is that whenever we recompile a branch instruction, we recompile the following instruction first, then perform the branch. If the following instruction is also a branch instruction, we know that it cannot legally be taken, and branch instructions cannot have side effects (except for jal/jalr, but those always take the branch, illegal!), so we just skip it.

Load delay slots

MIPS I also has a delay for loads. We cannot use the target register in the instruction following a load. However, for recompilation purpose we can ignore this. While clever code might attempt to abuse the fact that a target register for a load instruction hasn’t been updated yet, this is also unsafe in the real world. If an interrupt triggers in the middle of this sequence, the register will be updated anyways, breaking the assumption of the clever code. Thus, we simply ignore the existence of the load delay slot because correct code cannot rely on this hack.

Conditional branches

MIPS has a few conditional branch instructions. When we see a conditional branch, we can branch to one of two basic blocks. Either we take the branch, or we don’t. We recursively analyze the new basic blocks we found if their target PCs haven’t been analyzed already. Some instructions to look out for are

BEQ
BNE
BLEZ
BLTZ
BGEZ
BGTZ
BC1T (floating point compare)
BC1F (floating point compare)

Direct branches

The direct branch in MIPS is the “J” instruction. We need to be careful with this instruction because it is commonly used in two ways:

Branch to basic block
Tail call to an unrelated function

If we mistakenly treat a J as a basic block where it should have been a function, we will end up inlining huge functions into our own, where we should have just “called” them instead. Too much inlining will bloat the JIT and make recompilation slower. Let’s see an example.

#include <stdlib.h>

// Make sure we don't get inlining optimizations.
__attribute__((noinline))
static void *wrapped_malloc(size_t size)
{
	return malloc(size);
}

void *my_malloc(size_t size)
{ 
	return wrapped_malloc(size * 4); // Tail-call
}

00000000 <wrapped_malloc>:
   0:	3c1c0000 	lui	gp,0x0
   4:	279c0000 	addiu	gp,gp,0
   8:	8f990000 	lw	t9,0(gp)
   c:	00000000 	nop
  10:	03200008 	jr	t9
  14:	00000000 	nop

00000018 <my_malloc>:
  18:	08000000 	j	0 <wrapped_malloc>
  1c:	00042080 	sll	a0,a0,0x2

Here we need to see that J is actually a tail call to “wrapped_malloc”, and not a branch to a basic block. The heuristic I ended up with was that if the J target refers to a basic block through the use of conditional branches elsewhere, we can assume J refers to a branch to a basic block ala if/else or switch blocks. If not, we assume it’s a tail call.

There are other static branches we can find in MIPS. The conditional branches can become static branches if the $0 register is used. This seems to be mostly useful with position-independent code since we can branch to an address relative to our PC rather than fixed address with J. We should try to detect these “static” branches as well. There is no need in analyzing code which can never be executed.

Indirect branches

Indirect branches in MIPS are a bit tricky to handle. They are implemented using the JR instruction. The edge case we need to handle is that JR is also used to return from a function. Either way, JR will always end a basic block. The implementation logic will end up being something like this:

// JR
uint32_t target_pc = registers[instr.register];
if (target_pc == return_prediction_stack.top())
{
    // This is actually a return!
    predicition_stack.pop();
    return;
}
else
{
    // This might have to recompile new code if we haven't seen target_pc before!
    auto *target = mips_resolve_call_target(target_pc);
    return target(mips_state); // Tail-call.
}

There are a few main use cases for JR:

Returning from a function, almost always using “jr $ra”.
Jump tables
Tail calls in dynamically loaded code.

We might be able to add a few optimizations for “well-behaved” code, where we can safely assume that “jr $ra” always means return, and that $ra always refers to the correct return address. That is not guaranteed, but I think GCC will always generate sane code at least.

Illegal instructions

If we find an illegal instruction, we can call out to the VM host, and request a SIGILL signal to be raised to our thread. This also ends the basic block.

Putting it together

Now we have gone through all instructions which can trigger an end of a basic block. Let’s take a more complex function and split it up into basic blocks.

int number_of_even(const int *values, int count)
{
	int res = 0;
	for (int i = 0; i < count; i++)
		if ((values[i] & 1) == 0)
			res++;
	return res;
}

 00000000 <number_of_even>:
  // ConditionalBranch -> 0x38 or 0x8
  // Note that the basic block does not end until after
  // the delay slot has executed.
   0:	18a0000d 	blez	a1,38 <number_of_even+0x38>
   4:	00052880 	sll	a1,a1,0x2

  // ConditionalBranch -> 0x28 or 0x24
   8:	00852821 	addu	a1,a0,a1
   c:	00001025 	move	v0,zero
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x28 or 0x24
  // Should split up here because 0x10 is a branch target.
  // Current implementation does not split up basic blocks
  // to allow branching to the middle of another basic block.
  // Instead we end up duplicating some code.
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop // A wild load-delay slot appears.
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x30 or 0x10
  24:	24420001 	addiu	v0,v0,1
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // ConditionalBranch -> 0x30 or 0x10
  // Same here w.r.t. code duplication,
  // 0x24 and 0x28 are both branch targets.
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  30:	03e00008 	jr	ra
  34:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  38:	03e00008 	jr	ra
  3c:	00001025 	move	v0,zero

Once we know all the basic blocks, we can create LLVM basic blocks for them, and then recompile the blocks directly and link them together with BranchInst. This way of analyzing and recompiling is fairly ISA agnostic actually and it’s not that hard to change MIPS into something else once the basic structure is in place. The recompiler itself which sets up this is actually completely MIPS agnostic, it only asks for “given a start PC, where does the basic block end, and what kind of basic block is it.”

Register allocation and branches

While we’re working on registers, we ideally want the MIPS registers to be reflected by our native hardware registers. It’s obvious a 1:1 mapping is not possible. MIPS has 32 (well, 31) general purpose registers, 32 floating-point registers and various control registers. This isn’t going to fit on x86 or Arm.

Fortunately, we do not have to really care about register allocation when using LLVM. We just need to make sure we don’t emit CreateStore/CreateLoad as much as possible, and LLVM should take care of the rest. Within a basic block, this is very easy since we always know which SSA value a register refers to as the control flow is linear. I implemented a simple RegisterTracker class which lets me translate registers to SSA values. If we haven’t used a register yet, load it from memory, if we modify a register, just replace the SSA value and remember that we eventually have to write it back to memory later, i.e. the register bank.

The real problem is how to deal with branches. We learned last time that to pass values to other basic blocks we can use PHI nodes. I tried implementing a scheme like this, where I would build a full CFG and try to link up register values using PHI nodes, but I gave up. The biggest complication is that our registers can become invalidated when calling other functions (since they modify registers as well), and we will have a real hard time handling register dirty tracking. If we have say a basic block C which can be entered from basic block A and B, A might write registers 1 through 15 and B might write registers 16 through 31. If we want to use PHI nodes, we’ll need to create one for every possible register all predecessors of C might have touched. We also don’t really know which registers are dirty and need to be moved back to memory after the function ends, and emitting branches just to conditionally move registers back to memory is dumb. Because of all these complications and pathological cases I went with a very simple scheme. At the end of a basic block or before a function call, all dirty registers are flushed to memory. On entry of a basic block, we will have to load all the registers we need from memory. Ideally, LLVM should be able to optimize this back to SSA/PHI form, but it might be rather expensive to do so. Even if LLVM does not optimize for this, the register bank should be 100% hot in L1 cache, so I’m not too worried about performance. x86 is a very register starved architecture to begin with and moving data to and from L1 cache is very common.

Call instructions

MIPS has several ways of “calling” functions. These functions do not necessarily end a basic block, since we expect control flow to return to the instruction following the branch delay slot.

JAL
JALR
BLTZAL
BGEZAL
J (deduced tail call)
BEQ/BGEZ/BLTZ (deduced position-independent tail call)

The L stands for link, which means that $pc + 8 is written to the return register $ra before jumping. As we saw earlier, we can return by jumping indirectly to $ra. Unlike x86, there is no “return address is on stack”.

JAL is the easiest one to understand, as it means “call this address”. JALR is a variant where we call a function pointer. BLTZAL and BGEZAL are very interesting as they conditionally call a function. They are also useful for position independent calls since they use the PC-relative addressing mode. All of these instructions are fundamentally implemented in the same way.

Return stack prediction

We want to be as friendly as possible to our CPUs branch predictor. The return instruction is one of the best prediction methods we can exploit. When we return, the CPU can be almost 100% sure where we are going to branch unless we were the subject of a stack smash attack or something. The CPU keeps an internal stack where it expects a return to go, and that can be used to predict returns perfectly if our code is well behaved.

Of course, we cannot assume the code we’re running is perfect, but we can optimize for it. Whenever we are executing a link instruction, we can push the link target to a prediction stack. Whenever we see a JR instruction later, we check if it equals the top of the prediction stack. If so, we can pop the stack and simply return, no extra JIT compilation necessary. If JR is not a return, we might have to compile some more code.

One problem of the return stack is that MIPS code is free to just call JAL over and over and over, since JAL just writes to the link register, and doesn’t actually affect the stack pointer $sp.

To deal with the situation where the return stack grows towards infinity, we will just need to deal with it by setting a rational upper limit. In the worst case where our return stack for some reason grows too large, we can use the nuclear option in our arsenal, longjmp! The top level code uses setjmp, and if at any time we’ve reached a hopeless situation, longjmp unwinds the entire stack at once, and we can re-enter with our new PC. However, this is kinda terrible for performance since all return instructions will now fail to optimize to a simple return, and might have to JIT out random code which followed a call instruction. We’ll hope this never happens for real.

To thunk or not to thunk

While indirect calls must have a lookup to determine what we are actually calling in runtime, it’s possible for direct call instructions to directly call another function in LLVM. In this case, we avoid any runtime lookups. We risk recursively having to recompile the callee functions to be able to link such a function, so the initial JIT step can become really slow. I added an option which lets JAL pretend to be JALR and have all call instructions go through an indirection. LLVM can support lazy JIT to alleviate this problem, but I don’t know how to make that work, so, meh. Our grand plan is to optimize all of this stuff offline anyways later 😉

Putting it all together

It’s time to look at some real code, real MIPS output and the resulting LLVM IR. In the VM, I added a mode which lets me call any function by name. This is very useful to facilitate small test cases, so I don’t have to go through the entire libc init step just to test some basic arithmetic. $ra will be 0, and I treat returning to PC 0 as “I’m done with the test, dump registers”.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        return foo(40, 50);
}

004005ec <main>:
  4005ec:       24050032        li      a1,50
  4005f0:       08100208        j       400820 <foo>
  4005f4:       24040028        li      a0,40

00400820 <foo>:
  400820:       03e00008        jr      ra
  400824:       00851021        addu    v0,a0,a1

Doesn’t get much simpler to start with this test. main calls foo through a tail call, let’s see what the LLVM looks like completely unoptimized:

; ModuleID = '_004005ec'
source_filename = "_004005ec"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004005ec(%0*) {
entry:
  br label %_004005ec

_004005ec:                                        ; preds = %entry
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 40, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 50, i32* %a1Ptr
  tail call void @_00400820(%0* %0)
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %v0 = add i32 %a0Loaded, %a1Loaded
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

The first thing we notice is

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

which expresses the MIPS state which we pass around to our JIT functions. 64 i32 values are reserved for the general purpose registers (32 + a couple other hidden registers), 64 FP registers (32 + a couple extra), and finally, the page table. We inline it in the struct to be able to load and store memory as efficiently as possible. The code should be fairly easy to follow until we reach the return in foo()

  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

Here we call to our externally defined function in the VM to check if return stack prediction worked. Either we tail call or just simply return. $ra in this case will be 0, and we just end execution here.

The registers are dumped at the end to read:

...
  v0 = 90
  v1 = 0
  a0 = 40
  a1 = 50
...

Very nice! $v0 is the return register in the MIPS ABI and $a0/$a1 are the first and second arguments respectively.

Loads and stores

Let’s have a look what happens when we cannot rely on tail calls.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        int a = foo(1, 2);
        a += foo(3, 4);
        return a;
}

004005ec <main>:
  4005ec:       27bdffe0        addiu   sp,sp,-32
  4005f0:       24050002        li      a1,2
  4005f4:       afbf001c        sw      ra,28(sp)
  4005f8:       0c100210        jal     400840 <foo>
  4005fc:       24040001        li      a0,1
  400600:       24050004        li      a1,4
  400604:       24040003        li      a0,3
  400608:       0c100210        jal     400840 <foo>
  40060c:       00401825        move    v1,v0
  400610:       8fbf001c        lw      ra,28(sp)
  400614:       00621021        addu    v0,v1,v0
  400618:       03e00008        jr      ra
  40061c:       27bd0020        addiu   sp,sp,32

We only need to load and store to stack, but we’ll see the codegen in action.

_004005ec:                                        ; preds = %entry
  %spPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded = load i32, i32* %spPtr
  %sp = add i32 %spLoaded, -32
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %SWAddr = add i32 %sp, 28

; Translate virtual address to page + offset
  %PageIndex = lshr i32 %SWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %SWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  store i32 %raLoaded, i32* %PagePtr

; Flush registers before calling foo
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 2, i32* %a1Ptr
  %spPtr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp, i32* %spPtr1
  %raPtr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195840, i32* %raPtr2

; Predict the return address
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195840)
; Direct call to foo, no indirection needed here.
  call void @_00400840(%0* %0)
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v0Loaded, i32* %v1Ptr
  %a0Ptr3 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 3, i32* %a0Ptr3
  %a1Ptr4 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4, i32* %a1Ptr4
  %raPtr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195856, i32* %raPtr5
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195856)
  call void @_00400840(%0* %0)
  %spPtr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded7 = load i32, i32* %spPtr6
  %LWAddr = add i32 %spLoaded7, 28
  %PageIndex8 = lshr i32 %LWAddr, 12
  %Page9 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex8
  %PageLoaded10 = load i8*, i8** %Page9
  %Page3211 = bitcast i8* %PageLoaded10 to i32*
  %PageOffset12 = lshr i32 %LWAddr, 2
  %PageOffsetMasked13 = and i32 %PageOffset12, 1023
  %PagePtr14 = getelementptr inbounds i32, i32* %Page3211, i32 %PageOffsetMasked13
  %Loaded = load i32, i32* %PagePtr14
  %v0Ptr15 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded16 = load i32, i32* %v0Ptr15
  %v1Ptr17 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  %v1Loaded = load i32, i32* %v1Ptr17
  %v0 = add i32 %v1Loaded, %v0Loaded16
  %sp18 = add i32 %spLoaded7, 32
  %v0Ptr19 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr19
  %spPtr20 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp18, i32* %spPtr20
  %raPtr21 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 %Loaded, i32* %raPtr21
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %Loaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004005ec
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004005ec
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

There is a fair bit of noise here with loading and storing to memory. We have to emulate the virtual address space, so that means translating addresses into pages and offsets. What about the x86-64 output?

0000000000000000 <_004005ec>:
   0:   53                      push   %rbx
   1:   48 89 fb                mov    %rdi,%rbx
   4:   8b 47 74                mov    0x74(%rdi),%eax
   7:   8b 4f 7c                mov    0x7c(%rdi),%ecx
   a:   8d 50 e0                lea    -0x20(%rax),%edx
   d:   83 c0 fc                add    $0xfffffffc,%eax
  10:   89 c6                   mov    %eax,%esi
  12:   c1 ee 0c                shr    $0xc,%esi
  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)
  28:   48 b8 01 00 00 00 02    movabs $0x200000001,%rax
  2f:   00 00 00 
  32:   48 89 47 10             mov    %rax,0x10(%rdi)
  36:   89 57 74                mov    %edx,0x74(%rdi)
  39:   c7 47 7c 00 06 40 00    movl   $0x400600,0x7c(%rdi)
  40:   be 40 08 40 00          mov    $0x400840,%esi
  45:   ba 00 06 40 00          mov    $0x400600,%edx
  4a:   e8 00 00 00 00          callq  4f <_004005ec+0x4f>
  4f:   48 89 df                mov    %rbx,%rdi
  52:   e8 79 00 00 00          callq  d0 <_00400840>
  57:   8b 43 08                mov    0x8(%rbx),%eax
  5a:   89 43 0c                mov    %eax,0xc(%rbx)
  5d:   48 b8 03 00 00 00 04    movabs $0x400000003,%rax
  64:   00 00 00 
  67:   48 89 43 10             mov    %rax,0x10(%rbx)
  6b:   c7 43 7c 10 06 40 00    movl   $0x400610,0x7c(%rbx)
  72:   be 40 08 40 00          mov    $0x400840,%esi
  77:   ba 10 06 40 00          mov    $0x400610,%edx
  7c:   48 89 df                mov    %rbx,%rdi
  7f:   e8 00 00 00 00          callq  84 <_004005ec+0x84>
  84:   48 89 df                mov    %rbx,%rdi
  87:   e8 44 00 00 00          callq  d0 <_00400840>
  8c:   8b 43 0c                mov    0xc(%rbx),%eax
  8f:   8b 4b 74                mov    0x74(%rbx),%ecx
  92:   8d 51 1c                lea    0x1c(%rcx),%edx
  95:   89 d6                   mov    %edx,%esi
  97:   c1 ee 0c                shr    $0xc,%esi
  9a:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  a1:   00 
  a2:   c1 ea 02                shr    $0x2,%edx
  a5:   81 e2 ff 03 00 00       and    $0x3ff,%edx
  ab:   8b 34 96                mov    (%rsi,%rdx,4),%esi
  ae:   83 c1 20                add    $0x20,%ecx
  b1:   01 43 08                add    %eax,0x8(%rbx)
  b4:   89 4b 74                mov    %ecx,0x74(%rbx)
  b7:   89 73 7c                mov    %esi,0x7c(%rbx)
  ba:   48 89 df                mov    %rbx,%rdi
  bd:   e8 00 00 00 00          callq  c2 <_004005ec+0xc2>
  c2:   48 85 c0                test   %rax,%rax
  c5:   74 06                   je     cd <_004005ec+0xcd>
  c7:   48 89 df                mov    %rbx,%rdi
  ca:   5b                      pop    %rbx
  cb:   ff e0                   jmpq   *%rax
  cd:   5b                      pop    %rbx
  ce:   c3                      retq   
  cf:   90                      nop

Ouch. A lot of this is noise to deal with register moves. We can see the code sequence which performs loads and stores here:

  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)

Good news is that this is very straight forward code, so the CPU should churn through most of this like butter unless we’re missing the page table reads in L1. It will be interesting to benchmark this code against natively compiled C code later.

Loops

Let’s try to JIT the number_of_even function we made earlier and see if LLVM can preserve data in registers across loop iterations.

__attribute__((noinline))
int number_of_even(const int *values, int count)
{
        int res = 0;
        for (int i = 0; i < count; i++)
                if ((values[i] & 1) == 0)
                        res++;
        return res;
}

int main(void)
{
        static const int values[] = { 1, 2, 3, 4 };
        return number_of_even(values, 4); 
}

00400820 <number_of_even>:
  400820:       18a0000d        blez    a1,400858 <number_of_even+0x38>
  400824:       00052880        sll     a1,a1,0x2
  400828:       00852821        addu    a1,a0,a1
  40082c:       00001025        move    v0,zero
  400830:       8c830000        lw      v1,0(a0)
  400834:       00000000        nop
  400838:       30630001        andi    v1,v1,0x1
  40083c:       14600002        bnez    v1,400848 <number_of_even+0x28>
  400840:       24840004        addiu   a0,a0,4
  400844:       24420001        addiu   v0,v0,1
  400848:       14a4fff9        bne     a1,a0,400830 <number_of_even+0x10>
  40084c:       00000000        nop
  400850:       03e00008        jr      ra
  400854:       00000000        nop
  400858:       03e00008        jr      ra
  40085c:       00001025        move    v0,zero

004005ec <main>:
  4005ec:       3c040047        lui     a0,0x47
  4005f0:       24050004        li      a1,4
  4005f4:       08100208        j       400820 <number_of_even>
  4005f8:       2484a330        addiu   a0,a0,-23760
  4005fc:       00000000        nop

define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %BLEZ = icmp sle i32 %a1Loaded, 0
  %a1 = shl i32 %a1Loaded, 2
  %a1Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a1, i32* %a1Ptr1
  br i1 %BLEZ, label %_00400858, label %_00400828

_00400858:                                        ; preds = %_00400820
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

_00400828:                                        ; preds = %_00400820
  %a1Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded3 = load i32, i32* %a1Ptr2
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %a14 = add i32 %a0Loaded, %a1Loaded3
  %LWAddr = add i32 %a0Loaded, 0
  %PageIndex = lshr i32 %LWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %LWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  %Loaded = load i32, i32* %PagePtr
  %v1 = and i32 %Loaded, 1
  %BNE = icmp ne i32 %v1, 0
  %a0 = add i32 %a0Loaded, 4
  %v0Ptr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr5
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v1, i32* %v1Ptr
  %a0Ptr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a0, i32* %a0Ptr6
  %a1Ptr7 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a14, i32* %a1Ptr7
  br i1 %BNE, label %_00400848, label %_00400844

_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Ptr8 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded9 = load i32, i32* %a0Ptr8
  %a1Ptr10 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded11 = load i32, i32* %a1Ptr10
  %BNE12 = icmp ne i32 %a1Loaded11, %a0Loaded9
  br i1 %BNE12, label %_00400830, label %_00400850

_00400830:                                        ; preds = %_00400844, %_00400848
  %a0Ptr13 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded14 = load i32, i32* %a0Ptr13
  %LWAddr15 = add i32 %a0Loaded14, 0
  %PageIndex16 = lshr i32 %LWAddr15, 12
  %Page17 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex16
  %PageLoaded18 = load i8*, i8** %Page17
  %Page3219 = bitcast i8* %PageLoaded18 to i32*
  %PageOffset20 = lshr i32 %LWAddr15, 2
  %PageOffsetMasked21 = and i32 %PageOffset20, 1023
  %PagePtr22 = getelementptr inbounds i32, i32* %Page3219, i32 %PageOffsetMasked21
  %Loaded23 = load i32, i32* %PagePtr22
  %v124 = and i32 %Loaded23, 1
  %BNE25 = icmp ne i32 %v124, 0
  %a026 = add i32 %a0Loaded14, 4
  %v1Ptr27 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v124, i32* %v1Ptr27
  %a0Ptr28 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a026, i32* %a0Ptr28
  br i1 %BNE25, label %_00400848, label %_00400844

_00400844:                                        ; preds = %_00400830, %_00400828
  %v0Ptr29 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr29
  %v0 = add i32 %v0Loaded, 1
  %a0Ptr30 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded31 = load i32, i32* %a0Ptr30
  %a1Ptr32 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded33 = load i32, i32* %a1Ptr32
  %BNE34 = icmp ne i32 %a1Loaded33, %a0Loaded31
  %v0Ptr35 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr35
  br i1 %BNE34, label %_00400830, label %_00400850

_00400850:                                        ; preds = %_00400844, %_00400848
  %raPtr36 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded37 = load i32, i32* %raPtr36
  %jump_addr38 = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded37)
  %jump_addr_cmp41 = icmp ne void (%0*)* %jump_addr38, null
  br i1 %jump_addr_cmp41, label %IndirectJumpPath39, label %IndirectJumpReturn40

IndirectJumpPath:                                 ; preds = %_00400858
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400858
  ret void

IndirectJumpPath39:                               ; preds = %_00400850
  tail call void %jump_addr38(%0* %0)
  ret void

IndirectJumpReturn40:                             ; preds = %_00400850
  ret void
}

Again, pretty noisy output, and this is unoptimized output after all. If we look at the x86-64 output, then as expected, it’s pretty bad:

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 47 14                mov    0x14(%rdi),%eax
  17:   8d 0c 85 00 00 00 00    lea    0x0(,%rax,4),%ecx
  1e:   85 c0                   test   %eax,%eax
  20:   89 4f 14                mov    %ecx,0x14(%rdi)
  23:   7f 19                   jg     3e <_00400820+0x2e>
  25:   8b 73 7c                mov    0x7c(%rbx),%esi
  28:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  2f:   48 89 df                mov    %rbx,%rdi
  32:   e8 00 00 00 00          callq  37 <_00400820+0x27>
  37:   48 85 c0                test   %rax,%rax
  3a:   75 50                   jne    8c <_00400820+0x7c>
  3c:   5b                      pop    %rbx
  3d:   c3                      retq   
  3e:   8b 43 10                mov    0x10(%rbx),%eax
  41:   89 c1                   mov    %eax,%ecx
  43:   c1 e9 0c                shr    $0xc,%ecx
  46:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  4d:   00 
  4e:   89 c2                   mov    %eax,%edx
  50:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  56:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  59:   01 43 14                add    %eax,0x14(%rbx)
  5c:   83 c0 04                add    $0x4,%eax
  5f:   83 e1 01                and    $0x1,%ecx
  62:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  69:   89 4b 0c                mov    %ecx,0xc(%rbx)
  6c:   89 43 10                mov    %eax,0x10(%rbx)
  6f:   75 23                   jne    94 <_00400820+0x84>
  71:   8b 43 14                mov    0x14(%rbx),%eax
  74:   ff 43 08                incl   0x8(%rbx)
  77:   3b 43 10                cmp    0x10(%rbx),%eax
  7a:   75 20                   jne    9c <_00400820+0x8c>
  7c:   8b 73 7c                mov    0x7c(%rbx),%esi
  7f:   48 89 df                mov    %rbx,%rdi
  82:   e8 00 00 00 00          callq  87 <_00400820+0x77>
  87:   48 85 c0                test   %rax,%rax
  8a:   74 06                   je     92 <_00400820+0x82>
  8c:   48 89 df                mov    %rbx,%rdi
  8f:   5b                      pop    %rbx
  90:   ff e0                   jmpq   *%rax
  92:   5b                      pop    %rbx
  93:   c3                      retq   
  94:   8b 43 14                mov    0x14(%rbx),%eax
  97:   3b 43 10                cmp    0x10(%rbx),%eax
  9a:   74 e0                   je     7c <_00400820+0x6c>
  9c:   8b 43 10                mov    0x10(%rbx),%eax
  9f:   89 c1                   mov    %eax,%ecx
  a1:   c1 e9 0c                shr    $0xc,%ecx
  a4:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  ab:   00 
  ac:   89 c2                   mov    %eax,%edx
  ae:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  b4:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  b7:   83 e1 01                and    $0x1,%ecx
  ba:   83 c0 04                add    $0x4,%eax
  bd:   89 4b 0c                mov    %ecx,0xc(%rbx)
  c0:   89 43 10                mov    %eax,0x10(%rbx)
  c3:   85 c9                   test   %ecx,%ecx
  c5:   74 aa                   je     71 <_00400820+0x61>
  c7:   eb cb                   jmp    94 <_00400820+0x84>

Not a lot of register use here, what happens if the run the LLVM through opt first though?

; ....
_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Loaded9 = phi i32 [ %a026, %_00400830 ], [ %a0, %_00400828 ]
  %v0Loaded3 = phi i32 [ %v0Loaded4, %_00400830 ], [ 0, %_00400828 ]
  %BNE12 = icmp eq i32 %a14, %a0Loaded9
  br i1 %BNE12, label %_00400850, label %_00400830

_00400830:                                        ; preds = %_00400848, %_00400844
  %a0Loaded14 = phi i32 [ %a0Loaded9, %_00400848 ], [ %a0Loaded31, %_00400844 ]
  %v0Loaded4 = phi i32 [ %v0Loaded3, %_00400848 ], [ %v0, %_00400844 ]
; ...

Sure enough, we’re seeing some loads and stores getting promoted to phi nodes, excellent. The x86-64 codegen is improved a bit as well. Still kinda hard to read though …

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 4f 14                mov    0x14(%rdi),%ecx
  17:   8d 04 8d 00 00 00 00    lea    0x0(,%rcx,4),%eax
  1e:   85 c9                   test   %ecx,%ecx
  20:   89 47 14                mov    %eax,0x14(%rdi)
  23:   0f 8e 84 00 00 00       jle    ad <_00400820+0x9d>
  29:   8b 53 10                mov    0x10(%rbx),%edx
  2c:   01 d0                   add    %edx,%eax
  2e:   89 d6                   mov    %edx,%esi
  30:   8d 4a 04                lea    0x4(%rdx),%ecx
  33:   48 c1 ea 0c             shr    $0xc,%rdx
  37:   48 8b 94 d3 00 02 00    mov    0x200(%rbx,%rdx,8),%rdx
  3e:   00 
  3f:   81 e6 fc 0f 00 00       and    $0xffc,%esi
  45:   8b 34 32                mov    (%rdx,%rsi,1),%esi
  48:   31 d2                   xor    %edx,%edx
  4a:   83 e6 01                and    $0x1,%esi
  4d:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  54:   89 73 0c                mov    %esi,0xc(%rbx)
  57:   89 4b 10                mov    %ecx,0x10(%rbx)
  5a:   89 43 14                mov    %eax,0x14(%rbx)
  5d:   74 2f                   je     8e <_00400820+0x7e>
  5f:   39 c8                   cmp    %ecx,%eax
  61:   74 34                   je     97 <_00400820+0x87>
  63:   89 ce                   mov    %ecx,%esi
  65:   c1 ee 0c                shr    $0xc,%esi
  68:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  6f:   00 
  70:   89 cf                   mov    %ecx,%edi
  72:   c1 ef 02                shr    $0x2,%edi
  75:   81 e7 ff 03 00 00       and    $0x3ff,%edi
  7b:   8b 34 be                mov    (%rsi,%rdi,4),%esi
  7e:   83 e6 01                and    $0x1,%esi
  81:   83 c1 04                add    $0x4,%ecx
  84:   89 73 0c                mov    %esi,0xc(%rbx)
  87:   89 4b 10                mov    %ecx,0x10(%rbx)
  8a:   85 f6                   test   %esi,%esi
  8c:   75 d1                   jne    5f <_00400820+0x4f>
  8e:   ff c2                   inc    %edx
  90:   39 c8                   cmp    %ecx,%eax
  92:   89 53 08                mov    %edx,0x8(%rbx)
  95:   75 cc                   jne    63 <_00400820+0x53>
  97:   8b 73 7c                mov    0x7c(%rbx),%esi
  9a:   48 89 df                mov    %rbx,%rdi
  9d:   e8 00 00 00 00          callq  a2 <_00400820+0x92>
  a2:   48 85 c0                test   %rax,%rax
  a5:   74 1d                   je     c4 <_00400820+0xb4>
  a7:   48 89 df                mov    %rbx,%rdi
  aa:   5b                      pop    %rbx
  ab:   ff e0                   jmpq   *%rax
  ad:   8b 73 7c                mov    0x7c(%rbx),%esi
  b0:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  b7:   48 89 df                mov    %rbx,%rdi
  ba:   e8 00 00 00 00          callq  bf <_00400820+0xaf>
  bf:   48 85 c0                test   %rax,%rax
  c2:   75 e3                   jne    a7 <_00400820+0x97>
  c4:   5b                      pop    %rbx
  c5:   c3                      retq

I suspect some of the issues are related with lack of noalias attributes. LLVM might think that storing to virtual memory might alias with the register bank, and generate very conservative code. Something to have a look at later.

Optimizing well-behaved calls

If we know that the application is well-behaved w.r.t. calls and returns, we can remove the thunk calls to __recompiler_predict_return and checks for JR. If jr $ra is seen, we statically translate that to a return.

Floating point

In MIPS I, floating point math is handled by coprocessor 1, CP1. We can load 32-bit values directly into the FP registers, move to and from integer registers, and fiddle with the control register. The control register controls rounding modes. I haven’t bothered emulating correct rounding modes for now, but the control register is used to deal with floating point conditional branches, so the register needs to be emulated at least. Just like SSE, the actual data type of the FP register can vary depending on the instruction, so we will need a lot of bitcasts, fortunately, this is a native construct in LLVM.

Let’s try implementing an FMA loop for good measure.

__attribute__((noinline))
float my_fma(const float *a, const float *b, int count)
{
        float res = 0.0f;
        for (int i = 0; i < count; i++)
                res += a[i] * b[i];
        return res;
}

int main(void)
{
        const float as[] = { 1.0f, 2.0f, 3.0f, 4.0f };
        const float bs[] = { 10.0f, -2.0f, 50.0f, -4.0f };
        float result = my_fma(as, bs, 4);
        return (int)result;
}

004008c0 <my_fma>:
  4008c0:       18c0000c        blez    a2,4008f4 <my_fma+0x34>
  4008c4:       00063080        sll     a2,a2,0x2
  4008c8:       44800000        mtc1    zero,$f0
  4008cc:       00863021        addu    a2,a0,a2
  4008d0:       c4820000        lwc1    $f2,0(a0)
  4008d4:       c4a40000        lwc1    $f4,0(a1)
  4008d8:       24840004        addiu   a0,a0,4
  4008dc:       46041082        mul.s   $f2,$f2,$f4
  4008e0:       24a50004        addiu   a1,a1,4
  4008e4:       14c4fffa        bne     a2,a0,4008d0 <my_fma+0x10>
  4008e8:       46020000        add.s   $f0,$f0,$f2
  4008ec:       03e00008        jr      ra
  4008f0:       00000000        nop
  4008f4:       44800000        mtc1    zero,$f0
  4008f8:       03e00008        jr      ra
  4008fc:       00000000        nop

We implement the floating point registers by bitcasting all the things, and keeping the register bank as integer always. Otherwise, the code-gen to LLVM IR is reasonably straight forward. In the generated x86-64 we end up seeing the magic instructions we want to see buried in the noise.

...  
  ea:   f3 0f 59 0c 98          mulss  (%rax,%rbx,4),%xmm1
  ef:   f3 0f 58 c1             addss  %xmm1,%xmm0
...

Syscalls

To end this post on a less intense note, let’s write hello world without the support of libc setup and run it in our VM. Unfortunately, we will have to write this in assembly as the C code we generate assumes that libc is up and running (something something $gp register), so raw assembly it is.

.data
str:
.ascii "Hello World!\n"

.text
.global __start
__start:
# write syscall is 4004
        li $v0, 4004
        li $a0, 1
        la $a1, str
        li $a2, 13
        syscall

# exit syscall is 4001
        li $v0, 4001
        li $a0, 0
        syscall

# Should never get here.
        jr $ra

; ModuleID = '_004000f0'
source_filename = "_004000f0"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004000f0(%0*) {
entry:
  br label %_004000f0

_004000f0:                                        ; preds = %entry
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4004, i32* %v0Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4260128, i32* %a1Ptr
  %a2Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 6
  store i32 13, i32* %a2Ptr
  call void @__recompiler_syscall(%0* %0, i32 4194564, i32 0)
  %v0Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4001, i32* %v0Ptr1
  %a0Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 0, i32* %a0Ptr2
  call void @__recompiler_syscall(%0* %0, i32 4194576, i32 0)
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004000f0
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004000f0
  ret void
}

declare void @__recompiler_syscall(%0*, i32, i32)

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

To handle syscalls, we simply create a hook into the VM host, and handle it. The syscall number goes in the $v0 register, and arguments follow as normal. To implement the write syscall we simply need to copy over data from our virtual address space and call write in our native environment.

void MIPS::syscall_write()
{
	int fd = scalar_registers[REG_A0];
	Address addr = scalar_registers[REG_A1];
	uint32_t count = scalar_registers[REG_A2];
	std::vector<uint8_t> output;
	output.resize(count);

	addr_space.copy_from_user(output.data(), addr, count);

	scalar_registers[REG_V0] = write(fd, output.data(), count);

	if (scalar_registers[REG_V0] < 0)
		scalar_registers[REG_A3] = errno;
	else
		scalar_registers[REG_A3] = 0;
}

Of course, to run this code on Windows, we’d have to do a lot of extra work to emulate these syscalls, but meh :p That is boring.

Syscalls are generally easy to deal with, but the exception is mmap() and friends. These interact directly with the virtual address space, and we need to implement our own virtual page allocator. glibc requires this to implement malloc(), so any non-trivial code is going to need a decent mmap() implementation. Getting all the weird edge cases working took a surprising amount of time. We also need to implement the obscure brk() syscall which predates mmap(). brk() is used by glibc until it fails, and then it falls back to mmap() to allocate heap memory. mmap() can also refer to non-memory resources, so we cannot just assume we have a nice, big and flat address space which we allocate from.

ioctl() will also be a nightmare, and I have not bothered with this syscall yet. We cannot translate generic structs between the two completely different ABIs since ioctl() just takes a void *. Fortunately, glibc does not require ioctl to work properly to host a full C++ application.

Conclusion

We have seen how we’re taking MIPS code and turning it into running code through LLVM. In the next post we will bring up a fully-fledged C application and even a C++ application, and do some benchmarking to compare native applications vs recompiled MIPS applications, stay tuned!