An unusual recompiler experiment – MIPS to LLVM IR – Part 4

This is the final part in my blog series on my adventure recompiling MIPS to LLVM IR. If you’re new to this series you can read:

  • Part 1 – Explains the goals, MIPS ELF format, etc.
  • Part 2 – Explains how to generate code using the LLVM APIs.
  • Part 3 – Explains how we recompile MIPS code to LLVM.

In this post, I’m going to test performance on some applications and get a feel for how the various different codegen options we have can affect performance.

Due our lack of extensive syscall support, there is a limit to what we can test without going out of our way to port stuff, so I’ll be focusing on some tests which don’t require much beyond simple stdio.

STB PNG read + write

This test is based on the STB library’s PNG implementation. The test will load a PNG file from disk and compress it again.

#include "stb_image.h"
#include "stb_image_write.h"
#include <stdlib.h>

int main(int argc, char *argv[])
{
	for (int i = 0; i < 20; i++)
	{
		int x, y, chan;
		stbi_uc *data = stbi_load("/tmp/test.png", &x, &y, &chan, 4);
		if (!data)
			return 1;

		if (!stbi_write_png("/tmp/output.png", x, y, 4, data, 4 * x))
			return 2;
	}

	return 0;
}

Native performance (32-bit)

To make the comparison a bit more fair, we’ll compile this using 32-bit x86 targeting i486 with -O3.

Time: 20.6 s

For reference, this matters quite a lot, in x86-64, we get 15.38 s. I will use the i486 result as a baseline, since both i486 and MIPS I are ancient ISAs from around the same era of computing.

MIPS on-demand JIT (baseline)

To begin our benchmarking, we’re going to test fully on-line JIT-ing. This is what needs to happen at least the first time we’re running an application. The results here will be affected by a balance between optimization in run-time and having to do less work while JIT-ing.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf

In this first test, we will apply the following options:

  • Function calls will link directly to their targets. This increases JIT workload significantly, since we need to JIT all possible call paths to be able to link code directly. However, runtime should be faster once we have JIT-ed.
  • No IR optimizations are enabled.

Time: 71.43 s

The up-front cost of JIT-ing is quite long. But overall, 3.5x slower isn’t terrible. Let’s see if we can do it better.

On-demand JIT with optimizations

The JIT-er can perform some in-place optimizations. We’ll see if it helps here.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --optimize

Time: 75.43 s

It seems like the optimization passes made it a bit slower.

On-demand JIT with thunked calls

Rather than aggressively JIT-ing call possible call paths, we can try just JIT-ing functions we are actually calling. All direct calls are translated into indirect calls, and every call requires a lookup. This should reduce the JIT overhead a lot, but potentially have worse runtime performance. Without –optimize, this should be the most efficient option if we want to avoid JIT overhead.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls

Time: 70.0 s

Interesting. This might be the sweet spot for on-demand JIT.

On-demand JIT with thunked all the things

We can also use thunked load-store operations, rather than emit IR code to translate addresses for every memory operation. This should reduce code bloat, and might help when we’re doing on-demand JIT.

time ~/git/jitter/cmake-build-release/mipsvm stb-test.elf --disable-inline-calls --thunk-load-store

Time: 90.4 s

Ouch.

Assuming well behaved calls and returns?

Unfortunately, my assumption that GCC would generate expected code for returns was wrong, or my implementation was buggy. I couldn’t get it to work for non-trivial test cases, so I can’t test performance here.

Ahead-of-time recompiled IR

Now we’re starting to get into interesting territory which I haven’t seen much of in the past.

We need to run the application here, dump IR code to disk, and recompile into a dynamic library. For this case, we should be able to generate pretty good code and avoid any run-time recompilation. This is the ideal scenario if we can deduce all known call-paths.

Let’s start with the optimal case. No thunking.

~/git/jitter/cmake-build-release/mipsvm --dump-llvm /tmp/llvm stb-test.elf

This dumps out a whopping 68 MB of LLVM IR. Time to turn this ball of mud into a dynamic library.

#!/bin/bash

OUTPUT="$1"
LLDIR="$2"

echo "== Linking LLVM IR =="
llvm-link -o __llvm_linked.bc "$LLDIR"/*.ll
echo "== Optimizing offline LLVM IR =="
opt -O3 -o __llvm_opt.bc __llvm_linked.bc -disable-inlining
#cp __llvm_linked.bc __llvm_opt.bc
echo "== Compiling static library to object file with LLC =="
llc -relocation-model=pic -filetype obj -o __linked.o __llvm_opt.bc -O3
echo "== Linking shared library =="
gcc -o "$OUTPUT" -shared __linked.o

rm -f __llvm_linked.bc
rm -f __llvm_opt.bc
rm -f __linked.o

This operation takes 53.4 seconds and generates a 3.4 MB binary. The original binary is 792 kB due to the statically linked glibc.

This should yield us the absolute best performance we can hope for. So let’s try it.

~/git/jitter/cmake-build-release/mipsvm test --static-lib ~/git/jitter/test_linked.so --static-symbols /tmp/llvm/addr.bin

Time: 46.2 s

That’s a pretty great improvement. Compared to 20 seconds for a native binary with 32-bit/i486. It starts up basically instantly since there is no recompilation necessary. We only need to recompile if we find new code we haven’t looked at yet.

From here, we can get a better idea of what runtime cost we have by removing optimizations, and adding thunking. Let’s see if opt -O3 helps at all by just going straight to llc.

Building the native binary just takes 24 seconds now.

Time: 57.0 s

opt -O3 is clearly doing something well. Let’s add back the optimization and use thunked calls. For thunked calls, the IR dump is just 21 MB. We can see here that we were JIT-ing out a lot of useless code we never had to actually run.

The binary is 984 kB now.

~/git/jitter/cmake-build-release/mipsvm stb-test.elf --static-lib ~/git/jitter/test_unlinked.so --static-symbols /tmp/llvm-nolink/addr.bin

Time: 60.0 s

The win from linking directly is nothing to sneeze at. 46.2s to 60.0 s. Let’s thunk the load-store calls and see where we get.

Time: 92.7 s

Yup. Clearly, we can get a 2x speedup by just inlining the load-store code and directly calling functions rather than rely on thunking. We’re not that far away from 2x differential from native code in the best case!

Best of both worlds codegen?

If we’re dumping code with thunking to disk to improve JIT overhead, we can imagine that we can optimize the thunked calls to direct code off-line if we write our own LLVM optimization pass. Just an idea …

We need to go one level deeper

Let’s try something silly. We will recompile a cross-compiled cross-compiler. What? Well, I’ve built SPIRV-Cross for MIPS big-endian this time around. This was actually a useful exercise, because I can now verify that SPIRV-Cross works for both MIPS and big-endian at the same time 😛 Nice. SPIRV-Cross uses C++11, a fair bit of STL and exceptions. Can we host libstdc++ properly? Let’s see. With a statically linked libstdc++, the binary is 3.2 MB.

Let’s dump some LLVM …

~/git/jitter/cmake-build-release/mipsvm spirv-cross --dump-llvm /tmp/llvm-spirv -- spirv-cross /tmp/test.spv

I tried running this with a test shader in the SPIRV-Cross repository. It makes use of FP64, so we can see if we support doubles. Here we also see that we can pass arguments to argv. We end up with 168 MB of LLVM IR, which sure is intense. Let’s recompile it. This process takes over 2 minutes and creates a 8.6 MB binary.

~/git/jitter/cmake-build-release/mipsvm spirv-cross --static-lib ~/git/jitter/test_spirv.so --static-symbols /tmp/llvm-spirv/addr.bin -- spirv-cross /tmp/test.spv

Now it runs almost instantly and correctly.

Conclusion

This has been a fun little side project. The overhead of JIT-ing is rather high as we would expect, but the peak runtime performance is surprisingly good. We’re in the 2-3x slower ballpark against natively compiled code for ahead-of-time compiled code. I haven’t tested a lot of code out there, but STB’s PNG implementation, SPIRV-Cross, glibc and libstdc++ should represent reasonably complex and varied code.

Release

I’ve released this project on GitHub under an MIT license. Please read the disclaimer.

An unusual recompiler experiment – MIPS to LLVM IR – Part 3

In part 1 and part 2 we laid the groundwork to start recompiling MIPS code to LLVM IR. Strap your seatbelts, we’re going to MIPS and x86 assembly land.

The top-level run loop

The top level code fundamentally needs to be able to translate the program counter (short-hand, PC) to an executable function pointer. We can choose a hash map (large address space) or flat array (small address space) here.

If we need to call a PC we have not seen before, we will need to recompile a new LLVM module, starting at that PC, and then we can execute it.

Self-modifying code?

An immediate question is self-modifying code. This is a fairly ugly topic to deal with since our previously compiled function might become invalid if the underlying code changes. I think the solution for that is to keep a JIT block cache which translates a hash to function pointer and do some analysis of code blocks we don’t have a function pointer for yet. Any i-cache invalidations will clear out the relevant function pointers which triggers hashing in some form. Most likely the code for our function in particular did not change, so we can likely reuse the code blocks we generated.

For our purpose, we will not deal with self-modifying code here. A real emulator will have to deal with it, but self-modifying code should be rarer and rarer the more modern hardware we’re dealing with.

Recompiling a function

So, given a PC to execute, we’ll do some analysis where we map out all execution paths from that PC. We do this by mapping out all the basic blocks. See part 2 for more detail on what basic blocks do in LLVM.

Basic block

Basic blocks are represented as a starting PC and an end, where the execution flow is linear. The end of a basic block occurs where we see some kind of branch instruction (except for call instructions!). In this analysis we only care about these “special” instructions. Normal opcodes like arithmetic and load/store are ignored since they cannot affect control flow.

Branch delay slots

A very important part of MIPS is the use of a branch delay slot. It is a very unique design aspect of the architecture, which is considered a design flaw today because it was hard-coded to help a very specific micro-architecture. Exposing micro-architecture details like this should be considered bad taste. Whenever a branch is taken, unconditionally or not, the next instruction is always executed. Let’s see a trivial example:

int foo(int a)
{
	return a + 10;
}
00000000 <foo>:
   0:	03e00008 	jr	ra
   4:	2482000a 	addiu	v0,a0,10

“jr $ra” jumps to an address stored in a register, and $ra is used for the return address of a function. However, we can see that the add instruction comes afterwards. GCC exploits the delay slot here to do the useful computation inside it. Note that if you write MIPS assembly, you can get the assembler to perform this reordering for you. Often you will see “nop” after a branch if there is nothing useful to do in the delay slot.

One thought you might have now is, what happens if you have multiple branches back to back, branching in a delay slot? Well, if you actually thought of that then congratulations, have a cookie. This is explicitly banned in MIPS ISA, because it is non-sensical and undefined. While the hardware behavior could be well defined for a particular chip, it is still extremely broken, because if there is any exception or hardware interrupt happening in the middle of this sequence, it is impossible to recover from it. MIPS interrupt handlers typically have to deal with delay slots and fix-up the PC register accordingly.

The practical effect of the delay slot for us is that whenever we recompile a branch instruction, we recompile the following instruction first, then perform the branch. If the following instruction is also a branch instruction, we know that it cannot legally be taken, and branch instructions cannot have side effects (except for jal/jalr, but those always take the branch, illegal!), so we just skip it.

Load delay slots

MIPS I also has a delay for loads. We cannot use the target register in the instruction following a load. However, for recompilation purpose we can ignore this. While clever code might attempt to abuse the fact that a target register for a load instruction hasn’t been updated yet, this is also unsafe in the real world. If an interrupt triggers in the middle of this sequence, the register will be updated anyways, breaking the assumption of the clever code. Thus, we simply ignore the existence of the load delay slot because correct code cannot rely on this hack.

Conditional branches

MIPS has a few conditional branch instructions. When we see a conditional branch, we can branch to one of two basic blocks. Either we take the branch, or we don’t. We recursively analyze the new basic blocks we found if their target PCs haven’t been analyzed already. Some instructions to look out for are

  • BEQ
  • BNE
  • BLEZ
  • BLTZ
  • BGEZ
  • BGTZ
  • BC1T (floating point compare)
  • BC1F (floating point compare)

Direct branches

The direct branch in MIPS is the “J” instruction. We need to be careful with this instruction because it is commonly used in two ways:

  • Branch to basic block
  • Tail call to an unrelated function

If we mistakenly treat a J as a basic block where it should have been a function, we will end up inlining huge functions into our own, where we should have just “called” them instead. Too much inlining will bloat the JIT and make recompilation slower. Let’s see an example.

#include <stdlib.h>

// Make sure we don't get inlining optimizations.
__attribute__((noinline))
static void *wrapped_malloc(size_t size)
{
	return malloc(size);
}

void *my_malloc(size_t size)
{ 
	return wrapped_malloc(size * 4); // Tail-call
}
00000000 <wrapped_malloc>:
   0:	3c1c0000 	lui	gp,0x0
   4:	279c0000 	addiu	gp,gp,0
   8:	8f990000 	lw	t9,0(gp)
   c:	00000000 	nop
  10:	03200008 	jr	t9
  14:	00000000 	nop

00000018 <my_malloc>:
  18:	08000000 	j	0 <wrapped_malloc>
  1c:	00042080 	sll	a0,a0,0x2

Here we need to see that J is actually a tail call to “wrapped_malloc”, and not a branch to a basic block. The heuristic I ended up with was that if the J target refers to a basic block through the use of conditional branches elsewhere, we can assume J refers to a branch to a basic block ala if/else or switch blocks. If not, we assume it’s a tail call.

There are other static branches we can find in MIPS. The conditional branches can become static branches if the $0 register is used. This seems to be mostly useful with position-independent code since we can branch to an address relative to our PC rather than fixed address with J. We should try to detect these “static” branches as well. There is no need in analyzing code which can never be executed.

Indirect branches

Indirect branches in MIPS are a bit tricky to handle. They are implemented using the JR instruction. The edge case we need to handle is that JR is also used to return from a function. Either way, JR will always end a basic block. The implementation logic will end up being something like this:

// JR
uint32_t target_pc = registers[instr.register];
if (target_pc == return_prediction_stack.top())
{
    // This is actually a return!
    predicition_stack.pop();
    return;
}
else
{
    // This might have to recompile new code if we haven't seen target_pc before!
    auto *target = mips_resolve_call_target(target_pc);
    return target(mips_state); // Tail-call.
}

There are a few main use cases for JR:

  • Returning from a function, almost always using “jr $ra”.
  • Jump tables
  • Tail calls in dynamically loaded code.

We might be able to add a few optimizations for “well-behaved” code, where we can safely assume that “jr $ra” always means return, and that $ra always refers to the correct return address. That is not guaranteed, but I think GCC will always generate sane code at least.

Illegal instructions

If we find an illegal instruction, we can call out to the VM host, and request a SIGILL signal to be raised to our thread. This also ends the basic block.

Putting it together

Now we have gone through all instructions which can trigger an end of a basic block. Let’s take a more complex function and split it up into basic blocks.

int number_of_even(const int *values, int count)
{
	int res = 0;
	for (int i = 0; i < count; i++)
		if ((values[i] & 1) == 0)
			res++;
	return res;
}
 00000000 <number_of_even>:
  // ConditionalBranch -> 0x38 or 0x8
  // Note that the basic block does not end until after
  // the delay slot has executed.
   0:	18a0000d 	blez	a1,38 <number_of_even+0x38>
   4:	00052880 	sll	a1,a1,0x2

  // ConditionalBranch -> 0x28 or 0x24
   8:	00852821 	addu	a1,a0,a1
   c:	00001025 	move	v0,zero
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x28 or 0x24
  // Should split up here because 0x10 is a branch target.
  // Current implementation does not split up basic blocks
  // to allow branching to the middle of another basic block.
  // Instead we end up duplicating some code.
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop // A wild load-delay slot appears.
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x30 or 0x10
  24:	24420001 	addiu	v0,v0,1
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // ConditionalBranch -> 0x30 or 0x10
  // Same here w.r.t. code duplication,
  // 0x24 and 0x28 are both branch targets.
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  30:	03e00008 	jr	ra
  34:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  38:	03e00008 	jr	ra
  3c:	00001025 	move	v0,zero

Once we know all the basic blocks, we can create LLVM basic blocks for them, and then recompile the blocks directly and link them together with BranchInst. This way of analyzing and recompiling is fairly ISA agnostic actually and it’s not that hard to change MIPS into something else once the basic structure is in place. The recompiler itself which sets up this is actually completely MIPS agnostic, it only asks for “given a start PC, where does the basic block end, and what kind of basic block is it.”

Register allocation and branches

While we’re working on registers, we ideally want the MIPS registers to be reflected by our native hardware registers. It’s obvious a 1:1 mapping is not possible. MIPS has 32 (well, 31) general purpose registers, 32 floating-point registers and various control registers. This isn’t going to fit on x86 or Arm.

Fortunately, we do not have to really care about register allocation when using LLVM. We just need to make sure we don’t emit CreateStore/CreateLoad as much as possible, and LLVM should take care of the rest. Within a basic block, this is very easy since we always know which SSA value a register refers to as the control flow is linear. I implemented a simple RegisterTracker class which lets me translate registers to SSA values. If we haven’t used a register yet, load it from memory, if we modify a register, just replace the SSA value and remember that we eventually have to write it back to memory later, i.e. the register bank.

The real problem is how to deal with branches. We learned last time that to pass values to other basic blocks we can use PHI nodes. I tried implementing a scheme like this, where I would build a full CFG and try to link up register values using PHI nodes, but I gave up. The biggest complication is that our registers can become invalidated when calling other functions (since they modify registers as well), and we will have a real hard time handling register dirty tracking. If we have say a basic block C which can be entered from basic block A and B, A might write registers 1 through 15 and B might write registers 16 through 31. If we want to use PHI nodes, we’ll need to create one for every possible register all predecessors of C might have touched. We also don’t really know which registers are dirty and need to be moved back to memory after the function ends, and emitting branches just to conditionally move registers back to memory is dumb. Because of all these complications and pathological cases I went with a very simple scheme. At the end of a basic block or before a function call, all dirty registers are flushed to memory. On entry of a basic block, we will have to load all the registers we need from memory. Ideally, LLVM should be able to optimize this back to SSA/PHI form, but it might be rather expensive to do so. Even if LLVM does not optimize for this, the register bank should be 100% hot in L1 cache, so I’m not too worried about performance. x86 is a very register starved architecture to begin with and moving data to and from L1 cache is very common.

Call instructions

MIPS has several ways of “calling” functions. These functions do not necessarily end a basic block, since we expect control flow to return to the instruction following the branch delay slot.

  • JAL
  • JALR
  • BLTZAL
  • BGEZAL
  • J (deduced tail call)
  • BEQ/BGEZ/BLTZ (deduced position-independent tail call)

The L stands for link, which means that $pc + 8 is written to the return register $ra before jumping. As we saw earlier, we can return by jumping indirectly to $ra. Unlike x86, there is no “return address is on stack”.

JAL is the easiest one to understand, as it means “call this address”. JALR is a variant where we call a function pointer. BLTZAL and BGEZAL are very interesting as they conditionally call a function. They are also useful for position independent calls since they use the PC-relative addressing mode. All of these instructions are fundamentally implemented in the same way.

Return stack prediction

We want to be as friendly as possible to our CPUs branch predictor. The return instruction is one of the best prediction methods we can exploit. When we return, the CPU can be almost 100% sure where we are going to branch unless we were the subject of a stack smash attack or something. The CPU keeps an internal stack where it expects a return to go, and that can be used to predict returns perfectly if our code is well behaved.

Of course, we cannot assume the code we’re running is perfect, but we can optimize for it. Whenever we are executing a link instruction, we can push the link target to a prediction stack. Whenever we see a JR instruction later, we check if it equals the top of the prediction stack. If so, we can pop the stack and simply return, no extra JIT compilation necessary. If JR is not a return, we might have to compile some more code.

One problem of the return stack is that MIPS code is free to just call JAL over and over and over, since JAL just writes to the link register, and doesn’t actually affect the stack pointer $sp.

To deal with the situation where the return stack grows towards infinity, we will just need to deal with it by setting a rational upper limit. In the worst case where our return stack for some reason grows too large, we can use the nuclear option in our arsenal, longjmp! The top level code uses setjmp, and if at any time we’ve reached a hopeless situation, longjmp unwinds the entire stack at once, and we can re-enter with our new PC. However, this is kinda terrible for performance since all return instructions will now fail to optimize to a simple return, and might have to JIT out random code which followed a call instruction. We’ll hope this never happens for real.

To thunk or not to thunk

While indirect calls must have a lookup to determine what we are actually calling in runtime, it’s possible for direct call instructions to directly call another function in LLVM. In this case, we avoid any runtime lookups. We risk recursively having to recompile the callee functions to be able to link such a function, so the initial JIT step can become really slow. I added an option which lets JAL pretend to be JALR and have all call instructions go through an indirection. LLVM can support lazy JIT to alleviate this problem, but I don’t know how to make that work, so, meh. Our grand plan is to optimize all of this stuff offline anyways later 😉

Putting it all together

It’s time to look at some real code, real MIPS output and the resulting LLVM IR. In the VM, I added a mode which lets me call any function by name. This is very useful to facilitate small test cases, so I don’t have to go through the entire libc init step just to test some basic arithmetic. $ra will be 0, and I treat returning to PC 0 as “I’m done with the test, dump registers”.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        return foo(40, 50);
}
004005ec <main>:
  4005ec:       24050032        li      a1,50
  4005f0:       08100208        j       400820 <foo>
  4005f4:       24040028        li      a0,40

00400820 <foo>:
  400820:       03e00008        jr      ra
  400824:       00851021        addu    v0,a0,a1

Doesn’t get much simpler to start with this test. main calls foo through a tail call, let’s see what the LLVM looks like completely unoptimized:

; ModuleID = '_004005ec'
source_filename = "_004005ec"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004005ec(%0*) {
entry:
  br label %_004005ec

_004005ec:                                        ; preds = %entry
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 40, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 50, i32* %a1Ptr
  tail call void @_00400820(%0* %0)
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %v0 = add i32 %a0Loaded, %a1Loaded
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

The first thing we notice is

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

which expresses the MIPS state which we pass around to our JIT functions. 64 i32 values are reserved for the general purpose registers (32 + a couple other hidden registers), 64 FP registers (32 + a couple extra), and finally, the page table. We inline it in the struct to be able to load and store memory as efficiently as possible. The code should be fairly easy to follow until we reach the return in foo()

  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

Here we call to our externally defined function in the VM to check if return stack prediction worked. Either we tail call or just simply return. $ra in this case will be 0, and we just end execution here.

The registers are dumped at the end to read:

...
  v0 = 90
  v1 = 0
  a0 = 40
  a1 = 50
...

Very nice! $v0 is the return register in the MIPS ABI and $a0/$a1 are the first and second arguments respectively.

Loads and stores

Let’s have a look what happens when we cannot rely on tail calls.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        int a = foo(1, 2);
        a += foo(3, 4);
        return a;
}
004005ec <main>:
  4005ec:       27bdffe0        addiu   sp,sp,-32
  4005f0:       24050002        li      a1,2
  4005f4:       afbf001c        sw      ra,28(sp)
  4005f8:       0c100210        jal     400840 <foo>
  4005fc:       24040001        li      a0,1
  400600:       24050004        li      a1,4
  400604:       24040003        li      a0,3
  400608:       0c100210        jal     400840 <foo>
  40060c:       00401825        move    v1,v0
  400610:       8fbf001c        lw      ra,28(sp)
  400614:       00621021        addu    v0,v1,v0
  400618:       03e00008        jr      ra
  40061c:       27bd0020        addiu   sp,sp,32

We only need to load and store to stack, but we’ll see the codegen in action.

_004005ec:                                        ; preds = %entry
  %spPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded = load i32, i32* %spPtr
  %sp = add i32 %spLoaded, -32
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %SWAddr = add i32 %sp, 28

; Translate virtual address to page + offset
  %PageIndex = lshr i32 %SWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %SWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  store i32 %raLoaded, i32* %PagePtr

; Flush registers before calling foo
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 2, i32* %a1Ptr
  %spPtr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp, i32* %spPtr1
  %raPtr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195840, i32* %raPtr2

; Predict the return address
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195840)
; Direct call to foo, no indirection needed here.
  call void @_00400840(%0* %0)
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v0Loaded, i32* %v1Ptr
  %a0Ptr3 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 3, i32* %a0Ptr3
  %a1Ptr4 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4, i32* %a1Ptr4
  %raPtr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195856, i32* %raPtr5
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195856)
  call void @_00400840(%0* %0)
  %spPtr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded7 = load i32, i32* %spPtr6
  %LWAddr = add i32 %spLoaded7, 28
  %PageIndex8 = lshr i32 %LWAddr, 12
  %Page9 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex8
  %PageLoaded10 = load i8*, i8** %Page9
  %Page3211 = bitcast i8* %PageLoaded10 to i32*
  %PageOffset12 = lshr i32 %LWAddr, 2
  %PageOffsetMasked13 = and i32 %PageOffset12, 1023
  %PagePtr14 = getelementptr inbounds i32, i32* %Page3211, i32 %PageOffsetMasked13
  %Loaded = load i32, i32* %PagePtr14
  %v0Ptr15 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded16 = load i32, i32* %v0Ptr15
  %v1Ptr17 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  %v1Loaded = load i32, i32* %v1Ptr17
  %v0 = add i32 %v1Loaded, %v0Loaded16
  %sp18 = add i32 %spLoaded7, 32
  %v0Ptr19 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr19
  %spPtr20 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp18, i32* %spPtr20
  %raPtr21 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 %Loaded, i32* %raPtr21
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %Loaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004005ec
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004005ec
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

There is a fair bit of noise here with loading and storing to memory. We have to emulate the virtual address space, so that means translating addresses into pages and offsets. What about the x86-64 output?

0000000000000000 <_004005ec>:
   0:   53                      push   %rbx
   1:   48 89 fb                mov    %rdi,%rbx
   4:   8b 47 74                mov    0x74(%rdi),%eax
   7:   8b 4f 7c                mov    0x7c(%rdi),%ecx
   a:   8d 50 e0                lea    -0x20(%rax),%edx
   d:   83 c0 fc                add    $0xfffffffc,%eax
  10:   89 c6                   mov    %eax,%esi
  12:   c1 ee 0c                shr    $0xc,%esi
  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)
  28:   48 b8 01 00 00 00 02    movabs $0x200000001,%rax
  2f:   00 00 00 
  32:   48 89 47 10             mov    %rax,0x10(%rdi)
  36:   89 57 74                mov    %edx,0x74(%rdi)
  39:   c7 47 7c 00 06 40 00    movl   $0x400600,0x7c(%rdi)
  40:   be 40 08 40 00          mov    $0x400840,%esi
  45:   ba 00 06 40 00          mov    $0x400600,%edx
  4a:   e8 00 00 00 00          callq  4f <_004005ec+0x4f>
  4f:   48 89 df                mov    %rbx,%rdi
  52:   e8 79 00 00 00          callq  d0 <_00400840>
  57:   8b 43 08                mov    0x8(%rbx),%eax
  5a:   89 43 0c                mov    %eax,0xc(%rbx)
  5d:   48 b8 03 00 00 00 04    movabs $0x400000003,%rax
  64:   00 00 00 
  67:   48 89 43 10             mov    %rax,0x10(%rbx)
  6b:   c7 43 7c 10 06 40 00    movl   $0x400610,0x7c(%rbx)
  72:   be 40 08 40 00          mov    $0x400840,%esi
  77:   ba 10 06 40 00          mov    $0x400610,%edx
  7c:   48 89 df                mov    %rbx,%rdi
  7f:   e8 00 00 00 00          callq  84 <_004005ec+0x84>
  84:   48 89 df                mov    %rbx,%rdi
  87:   e8 44 00 00 00          callq  d0 <_00400840>
  8c:   8b 43 0c                mov    0xc(%rbx),%eax
  8f:   8b 4b 74                mov    0x74(%rbx),%ecx
  92:   8d 51 1c                lea    0x1c(%rcx),%edx
  95:   89 d6                   mov    %edx,%esi
  97:   c1 ee 0c                shr    $0xc,%esi
  9a:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  a1:   00 
  a2:   c1 ea 02                shr    $0x2,%edx
  a5:   81 e2 ff 03 00 00       and    $0x3ff,%edx
  ab:   8b 34 96                mov    (%rsi,%rdx,4),%esi
  ae:   83 c1 20                add    $0x20,%ecx
  b1:   01 43 08                add    %eax,0x8(%rbx)
  b4:   89 4b 74                mov    %ecx,0x74(%rbx)
  b7:   89 73 7c                mov    %esi,0x7c(%rbx)
  ba:   48 89 df                mov    %rbx,%rdi
  bd:   e8 00 00 00 00          callq  c2 <_004005ec+0xc2>
  c2:   48 85 c0                test   %rax,%rax
  c5:   74 06                   je     cd <_004005ec+0xcd>
  c7:   48 89 df                mov    %rbx,%rdi
  ca:   5b                      pop    %rbx
  cb:   ff e0                   jmpq   *%rax
  cd:   5b                      pop    %rbx
  ce:   c3                      retq   
  cf:   90                      nop

Ouch. A lot of this is noise to deal with register moves. We can see the code sequence which performs loads and stores here:

  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)

Good news is that this is very straight forward code, so the CPU should churn through most of this like butter unless we’re missing the page table reads in L1. It will be interesting to benchmark this code against natively compiled C code later.

Loops

Let’s try to JIT the number_of_even function we made earlier and see if LLVM can preserve data in registers across loop iterations.

__attribute__((noinline))
int number_of_even(const int *values, int count)
{
        int res = 0;
        for (int i = 0; i < count; i++)
                if ((values[i] & 1) == 0)
                        res++;
        return res;
}

int main(void)
{
        static const int values[] = { 1, 2, 3, 4 };
        return number_of_even(values, 4); 
}
00400820 <number_of_even>:
  400820:       18a0000d        blez    a1,400858 <number_of_even+0x38>
  400824:       00052880        sll     a1,a1,0x2
  400828:       00852821        addu    a1,a0,a1
  40082c:       00001025        move    v0,zero
  400830:       8c830000        lw      v1,0(a0)
  400834:       00000000        nop
  400838:       30630001        andi    v1,v1,0x1
  40083c:       14600002        bnez    v1,400848 <number_of_even+0x28>
  400840:       24840004        addiu   a0,a0,4
  400844:       24420001        addiu   v0,v0,1
  400848:       14a4fff9        bne     a1,a0,400830 <number_of_even+0x10>
  40084c:       00000000        nop
  400850:       03e00008        jr      ra
  400854:       00000000        nop
  400858:       03e00008        jr      ra
  40085c:       00001025        move    v0,zero

004005ec <main>:
  4005ec:       3c040047        lui     a0,0x47
  4005f0:       24050004        li      a1,4
  4005f4:       08100208        j       400820 <number_of_even>
  4005f8:       2484a330        addiu   a0,a0,-23760
  4005fc:       00000000        nop
define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %BLEZ = icmp sle i32 %a1Loaded, 0
  %a1 = shl i32 %a1Loaded, 2
  %a1Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a1, i32* %a1Ptr1
  br i1 %BLEZ, label %_00400858, label %_00400828

_00400858:                                        ; preds = %_00400820
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

_00400828:                                        ; preds = %_00400820
  %a1Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded3 = load i32, i32* %a1Ptr2
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %a14 = add i32 %a0Loaded, %a1Loaded3
  %LWAddr = add i32 %a0Loaded, 0
  %PageIndex = lshr i32 %LWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %LWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  %Loaded = load i32, i32* %PagePtr
  %v1 = and i32 %Loaded, 1
  %BNE = icmp ne i32 %v1, 0
  %a0 = add i32 %a0Loaded, 4
  %v0Ptr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr5
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v1, i32* %v1Ptr
  %a0Ptr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a0, i32* %a0Ptr6
  %a1Ptr7 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a14, i32* %a1Ptr7
  br i1 %BNE, label %_00400848, label %_00400844

_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Ptr8 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded9 = load i32, i32* %a0Ptr8
  %a1Ptr10 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded11 = load i32, i32* %a1Ptr10
  %BNE12 = icmp ne i32 %a1Loaded11, %a0Loaded9
  br i1 %BNE12, label %_00400830, label %_00400850

_00400830:                                        ; preds = %_00400844, %_00400848
  %a0Ptr13 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded14 = load i32, i32* %a0Ptr13
  %LWAddr15 = add i32 %a0Loaded14, 0
  %PageIndex16 = lshr i32 %LWAddr15, 12
  %Page17 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex16
  %PageLoaded18 = load i8*, i8** %Page17
  %Page3219 = bitcast i8* %PageLoaded18 to i32*
  %PageOffset20 = lshr i32 %LWAddr15, 2
  %PageOffsetMasked21 = and i32 %PageOffset20, 1023
  %PagePtr22 = getelementptr inbounds i32, i32* %Page3219, i32 %PageOffsetMasked21
  %Loaded23 = load i32, i32* %PagePtr22
  %v124 = and i32 %Loaded23, 1
  %BNE25 = icmp ne i32 %v124, 0
  %a026 = add i32 %a0Loaded14, 4
  %v1Ptr27 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v124, i32* %v1Ptr27
  %a0Ptr28 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a026, i32* %a0Ptr28
  br i1 %BNE25, label %_00400848, label %_00400844

_00400844:                                        ; preds = %_00400830, %_00400828
  %v0Ptr29 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr29
  %v0 = add i32 %v0Loaded, 1
  %a0Ptr30 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded31 = load i32, i32* %a0Ptr30
  %a1Ptr32 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded33 = load i32, i32* %a1Ptr32
  %BNE34 = icmp ne i32 %a1Loaded33, %a0Loaded31
  %v0Ptr35 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr35
  br i1 %BNE34, label %_00400830, label %_00400850

_00400850:                                        ; preds = %_00400844, %_00400848
  %raPtr36 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded37 = load i32, i32* %raPtr36
  %jump_addr38 = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded37)
  %jump_addr_cmp41 = icmp ne void (%0*)* %jump_addr38, null
  br i1 %jump_addr_cmp41, label %IndirectJumpPath39, label %IndirectJumpReturn40

IndirectJumpPath:                                 ; preds = %_00400858
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400858
  ret void

IndirectJumpPath39:                               ; preds = %_00400850
  tail call void %jump_addr38(%0* %0)
  ret void

IndirectJumpReturn40:                             ; preds = %_00400850
  ret void
}

Again, pretty noisy output, and this is unoptimized output after all. If we look at the x86-64 output, then as expected, it’s pretty bad:

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 47 14                mov    0x14(%rdi),%eax
  17:   8d 0c 85 00 00 00 00    lea    0x0(,%rax,4),%ecx
  1e:   85 c0                   test   %eax,%eax
  20:   89 4f 14                mov    %ecx,0x14(%rdi)
  23:   7f 19                   jg     3e <_00400820+0x2e>
  25:   8b 73 7c                mov    0x7c(%rbx),%esi
  28:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  2f:   48 89 df                mov    %rbx,%rdi
  32:   e8 00 00 00 00          callq  37 <_00400820+0x27>
  37:   48 85 c0                test   %rax,%rax
  3a:   75 50                   jne    8c <_00400820+0x7c>
  3c:   5b                      pop    %rbx
  3d:   c3                      retq   
  3e:   8b 43 10                mov    0x10(%rbx),%eax
  41:   89 c1                   mov    %eax,%ecx
  43:   c1 e9 0c                shr    $0xc,%ecx
  46:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  4d:   00 
  4e:   89 c2                   mov    %eax,%edx
  50:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  56:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  59:   01 43 14                add    %eax,0x14(%rbx)
  5c:   83 c0 04                add    $0x4,%eax
  5f:   83 e1 01                and    $0x1,%ecx
  62:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  69:   89 4b 0c                mov    %ecx,0xc(%rbx)
  6c:   89 43 10                mov    %eax,0x10(%rbx)
  6f:   75 23                   jne    94 <_00400820+0x84>
  71:   8b 43 14                mov    0x14(%rbx),%eax
  74:   ff 43 08                incl   0x8(%rbx)
  77:   3b 43 10                cmp    0x10(%rbx),%eax
  7a:   75 20                   jne    9c <_00400820+0x8c>
  7c:   8b 73 7c                mov    0x7c(%rbx),%esi
  7f:   48 89 df                mov    %rbx,%rdi
  82:   e8 00 00 00 00          callq  87 <_00400820+0x77>
  87:   48 85 c0                test   %rax,%rax
  8a:   74 06                   je     92 <_00400820+0x82>
  8c:   48 89 df                mov    %rbx,%rdi
  8f:   5b                      pop    %rbx
  90:   ff e0                   jmpq   *%rax
  92:   5b                      pop    %rbx
  93:   c3                      retq   
  94:   8b 43 14                mov    0x14(%rbx),%eax
  97:   3b 43 10                cmp    0x10(%rbx),%eax
  9a:   74 e0                   je     7c <_00400820+0x6c>
  9c:   8b 43 10                mov    0x10(%rbx),%eax
  9f:   89 c1                   mov    %eax,%ecx
  a1:   c1 e9 0c                shr    $0xc,%ecx
  a4:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  ab:   00 
  ac:   89 c2                   mov    %eax,%edx
  ae:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  b4:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  b7:   83 e1 01                and    $0x1,%ecx
  ba:   83 c0 04                add    $0x4,%eax
  bd:   89 4b 0c                mov    %ecx,0xc(%rbx)
  c0:   89 43 10                mov    %eax,0x10(%rbx)
  c3:   85 c9                   test   %ecx,%ecx
  c5:   74 aa                   je     71 <_00400820+0x61>
  c7:   eb cb                   jmp    94 <_00400820+0x84>

Not a lot of register use here, what happens if the run the LLVM through opt first though?

; ....
_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Loaded9 = phi i32 [ %a026, %_00400830 ], [ %a0, %_00400828 ]
  %v0Loaded3 = phi i32 [ %v0Loaded4, %_00400830 ], [ 0, %_00400828 ]
  %BNE12 = icmp eq i32 %a14, %a0Loaded9
  br i1 %BNE12, label %_00400850, label %_00400830

_00400830:                                        ; preds = %_00400848, %_00400844
  %a0Loaded14 = phi i32 [ %a0Loaded9, %_00400848 ], [ %a0Loaded31, %_00400844 ]
  %v0Loaded4 = phi i32 [ %v0Loaded3, %_00400848 ], [ %v0, %_00400844 ]
; ...

Sure enough, we’re seeing some loads and stores getting promoted to phi nodes, excellent. The x86-64 codegen is improved a bit as well. Still kinda hard to read though …

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 4f 14                mov    0x14(%rdi),%ecx
  17:   8d 04 8d 00 00 00 00    lea    0x0(,%rcx,4),%eax
  1e:   85 c9                   test   %ecx,%ecx
  20:   89 47 14                mov    %eax,0x14(%rdi)
  23:   0f 8e 84 00 00 00       jle    ad <_00400820+0x9d>
  29:   8b 53 10                mov    0x10(%rbx),%edx
  2c:   01 d0                   add    %edx,%eax
  2e:   89 d6                   mov    %edx,%esi
  30:   8d 4a 04                lea    0x4(%rdx),%ecx
  33:   48 c1 ea 0c             shr    $0xc,%rdx
  37:   48 8b 94 d3 00 02 00    mov    0x200(%rbx,%rdx,8),%rdx
  3e:   00 
  3f:   81 e6 fc 0f 00 00       and    $0xffc,%esi
  45:   8b 34 32                mov    (%rdx,%rsi,1),%esi
  48:   31 d2                   xor    %edx,%edx
  4a:   83 e6 01                and    $0x1,%esi
  4d:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  54:   89 73 0c                mov    %esi,0xc(%rbx)
  57:   89 4b 10                mov    %ecx,0x10(%rbx)
  5a:   89 43 14                mov    %eax,0x14(%rbx)
  5d:   74 2f                   je     8e <_00400820+0x7e>
  5f:   39 c8                   cmp    %ecx,%eax
  61:   74 34                   je     97 <_00400820+0x87>
  63:   89 ce                   mov    %ecx,%esi
  65:   c1 ee 0c                shr    $0xc,%esi
  68:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  6f:   00 
  70:   89 cf                   mov    %ecx,%edi
  72:   c1 ef 02                shr    $0x2,%edi
  75:   81 e7 ff 03 00 00       and    $0x3ff,%edi
  7b:   8b 34 be                mov    (%rsi,%rdi,4),%esi
  7e:   83 e6 01                and    $0x1,%esi
  81:   83 c1 04                add    $0x4,%ecx
  84:   89 73 0c                mov    %esi,0xc(%rbx)
  87:   89 4b 10                mov    %ecx,0x10(%rbx)
  8a:   85 f6                   test   %esi,%esi
  8c:   75 d1                   jne    5f <_00400820+0x4f>
  8e:   ff c2                   inc    %edx
  90:   39 c8                   cmp    %ecx,%eax
  92:   89 53 08                mov    %edx,0x8(%rbx)
  95:   75 cc                   jne    63 <_00400820+0x53>
  97:   8b 73 7c                mov    0x7c(%rbx),%esi
  9a:   48 89 df                mov    %rbx,%rdi
  9d:   e8 00 00 00 00          callq  a2 <_00400820+0x92>
  a2:   48 85 c0                test   %rax,%rax
  a5:   74 1d                   je     c4 <_00400820+0xb4>
  a7:   48 89 df                mov    %rbx,%rdi
  aa:   5b                      pop    %rbx
  ab:   ff e0                   jmpq   *%rax
  ad:   8b 73 7c                mov    0x7c(%rbx),%esi
  b0:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  b7:   48 89 df                mov    %rbx,%rdi
  ba:   e8 00 00 00 00          callq  bf <_00400820+0xaf>
  bf:   48 85 c0                test   %rax,%rax
  c2:   75 e3                   jne    a7 <_00400820+0x97>
  c4:   5b                      pop    %rbx
  c5:   c3                      retq 

I suspect some of the issues are related with lack of noalias attributes. LLVM might think that storing to virtual memory might alias with the register bank, and generate very conservative code. Something to have a look at later.

Optimizing well-behaved calls

If we know that the application is well-behaved w.r.t. calls and returns, we can remove the thunk calls to __recompiler_predict_return and checks for JR. If jr $ra is seen, we statically translate that to a return.

Floating point

In MIPS I, floating point math is handled by coprocessor 1, CP1. We can load 32-bit values directly into the FP registers, move to and from integer registers, and fiddle with the control register. The control register controls rounding modes. I haven’t bothered emulating correct rounding modes for now, but the control register is used to deal with floating point conditional branches, so the register needs to be emulated at least. Just like SSE, the actual data type of the FP register can vary depending on the instruction, so we will need a lot of bitcasts, fortunately, this is a native construct in LLVM.

Let’s try implementing an FMA loop for good measure.

__attribute__((noinline))
float my_fma(const float *a, const float *b, int count)
{
        float res = 0.0f;
        for (int i = 0; i < count; i++)
                res += a[i] * b[i];
        return res;
}

int main(void)
{
        const float as[] = { 1.0f, 2.0f, 3.0f, 4.0f };
        const float bs[] = { 10.0f, -2.0f, 50.0f, -4.0f };
        float result = my_fma(as, bs, 4);
        return (int)result;
}
004008c0 <my_fma>:
  4008c0:       18c0000c        blez    a2,4008f4 <my_fma+0x34>
  4008c4:       00063080        sll     a2,a2,0x2
  4008c8:       44800000        mtc1    zero,$f0
  4008cc:       00863021        addu    a2,a0,a2
  4008d0:       c4820000        lwc1    $f2,0(a0)
  4008d4:       c4a40000        lwc1    $f4,0(a1)
  4008d8:       24840004        addiu   a0,a0,4
  4008dc:       46041082        mul.s   $f2,$f2,$f4
  4008e0:       24a50004        addiu   a1,a1,4
  4008e4:       14c4fffa        bne     a2,a0,4008d0 <my_fma+0x10>
  4008e8:       46020000        add.s   $f0,$f0,$f2
  4008ec:       03e00008        jr      ra
  4008f0:       00000000        nop
  4008f4:       44800000        mtc1    zero,$f0
  4008f8:       03e00008        jr      ra
  4008fc:       00000000        nop

We implement the floating point registers by bitcasting all the things, and keeping the register bank as integer always. Otherwise, the code-gen to LLVM IR is reasonably straight forward. In the generated x86-64 we end up seeing the magic instructions we want to see buried in the noise.

...  
  ea:   f3 0f 59 0c 98          mulss  (%rax,%rbx,4),%xmm1
  ef:   f3 0f 58 c1             addss  %xmm1,%xmm0
...

Syscalls

To end this post on a less intense note, let’s write hello world without the support of libc setup and run it in our VM. Unfortunately, we will have to write this in assembly as the C code we generate assumes that libc is up and running (something something $gp register), so raw assembly it is.

.data
str:
.ascii "Hello World!\n"

.text
.global __start
__start:
# write syscall is 4004
        li $v0, 4004
        li $a0, 1
        la $a1, str
        li $a2, 13
        syscall

# exit syscall is 4001
        li $v0, 4001
        li $a0, 0
        syscall

# Should never get here.
        jr $ra
; ModuleID = '_004000f0'
source_filename = "_004000f0"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004000f0(%0*) {
entry:
  br label %_004000f0

_004000f0:                                        ; preds = %entry
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4004, i32* %v0Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4260128, i32* %a1Ptr
  %a2Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 6
  store i32 13, i32* %a2Ptr
  call void @__recompiler_syscall(%0* %0, i32 4194564, i32 0)
  %v0Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4001, i32* %v0Ptr1
  %a0Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 0, i32* %a0Ptr2
  call void @__recompiler_syscall(%0* %0, i32 4194576, i32 0)
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004000f0
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004000f0
  ret void
}

declare void @__recompiler_syscall(%0*, i32, i32)

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

To handle syscalls, we simply create a hook into the VM host, and handle it. The syscall number goes in the $v0 register, and arguments follow as normal. To implement the write syscall we simply need to copy over data from our virtual address space and call write in our native environment.

void MIPS::syscall_write()
{
	int fd = scalar_registers[REG_A0];
	Address addr = scalar_registers[REG_A1];
	uint32_t count = scalar_registers[REG_A2];
	std::vector<uint8_t> output;
	output.resize(count);

	addr_space.copy_from_user(output.data(), addr, count);

	scalar_registers[REG_V0] = write(fd, output.data(), count);

	if (scalar_registers[REG_V0] < 0)
		scalar_registers[REG_A3] = errno;
	else
		scalar_registers[REG_A3] = 0;
}

Of course, to run this code on Windows, we’d have to do a lot of extra work to emulate these syscalls, but meh :p That is boring.

Syscalls are generally easy to deal with, but the exception is mmap() and friends. These interact directly with the virtual address space, and we need to implement our own virtual page allocator. glibc requires this to implement malloc(), so any non-trivial code is going to need a decent mmap() implementation. Getting all the weird edge cases working took a surprising amount of time. We also need to implement the obscure brk() syscall which predates mmap(). brk() is used by glibc until it fails, and then it falls back to mmap() to allocate heap memory. mmap() can also refer to non-memory resources, so we cannot just assume we have a nice, big and flat address space which we allocate from.

ioctl() will also be a nightmare, and I have not bothered with this syscall yet. We cannot translate generic structs between the two completely different ABIs since ioctl() just takes a void *. Fortunately, glibc does not require ioctl to work properly to host a full C++ application.

Conclusion

We have seen how we’re taking MIPS code and turning it into running code through LLVM. In the next post we will bring up a fully-fledged C application and even a C++ application, and do some benchmarking to compare native applications vs recompiled MIPS applications, stay tuned!

An unusual recompiler experiment – MIPS to LLVM IR – Part 2

In part 1 we parsed a MIPS ELF file and set up a stack, we should now consider how to translate the MIPS machine code to LLVM IR. Before we get there, we need to learn how to use the LLVM APIs and how code-gen to LLVM IR works.

If you know LLVM well already, this post is not for you, wait for part 3 when we start considering the design of the recompiler. If you don’t, this post should give you all understanding needed to be able to generate some code in LLVM yourself and understand the implementation.

This turned out rather dry, but hey, it’s limited how exciting writing C++ to generate IR can be. 😀 At least it’s very educational.

Which LLVM version to target?

The LLVM JIT APIs are unfortunately rather unstable. The APIs appear to break between major versions. A common workaround to this seems to be to simply stick to a version and depend on that unless you’re ready to update. I got it working on the current LLVM 7, and also the upcoming LLVM 8, which seems to shuffle a lot of internal APIs around. For the sample here, we’re going to consider LLVM 7, which is the latest stable major version. The API changes as far as I know are only related to the JIT runtime, not codegen, so we can hide away the ugliness behind a clean interface at least.

LLVM IR basics

To get us started, let’s make a trivial function, which adds two arguments together and returns it. In C, we would represent this as:

int LLVMAdd(int a, int b);

To do this, we need to create an LLVMContext, an LLVM module (sort of like a translation unit) and a function to that module. We’ll need to create some types as well.

// We'll need a ton of headers for later, might as well just add it now.
#include "llvm/ExecutionEngine/Orc/CompileUtils.h"
#include "llvm/ExecutionEngine/Orc/Core.h"
#include "llvm/ExecutionEngine/Orc/ExecutionUtils.h"
#include "llvm/ExecutionEngine/Orc/IRCompileLayer.h"
#include "llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/DataLayout.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/ExecutionEngine/SectionMemoryManager.h"

void test()
{
  llvm::LLVMContext context;
  llvm::Module module("mymodule", context);

  llvm::Type *int32 = llvm::Type::getInt32Ty(context);
  llvm::Type *result_type = int32;
  llvm::Type *argument_types[] = { int32, int32 };
  bool vararg = false;

  llvm::FunctionType *function_type =
    llvm::FunctionType::get(result_type, argument_types, vararg);
  llvm::Function *function =
    llvm::Function::Create(function_type,
                           llvm::Function::ExternalLinkage,
                           "LLVMAdd", module);

  llvm::errs() << module << "\n";
}
; ModuleID = 'mymodule'
source_filename = "mymodule"

declare i32 @LLVMAdd(i32, i32)

For now, we have simply declared the function. If we do nothing more, this is equivalent to using “extern” in C. We can call it, but the linker needs to find the function elsewhere. Let’s expand it with some code.

Basic blocks

Basic blocks are the foundation of LLVM and a function consists of any number of basic blocks. Basic blocks represent linear control flow within a function. At the end of a basic block, the block must be terminated. Here, and only here can we do things like:

  • Branch to another basic block, conditionally or not
  • Return
  • And some other special cases

For our first test, we can get away with one basic block.

Single static assignment (SSA)

Another core principle of LLVM is the use of single-static assignment. In short, it basically means that once you assign a value to a variable, it cannot ever change. If a variable is modified multiple times, each time it is assigned, it needs to be assigned to a new SSA value. You will see this when we start generating some code. This has various benefits for optimization and code-gen later in the pipeline, but that’s LLVM’s problem. Let’s implement the add function.

llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", function);
llvm::IRBuilder<> builder(bb);
llvm::Value *arg0 = &function->arg_begin()[0];
llvm::Value *arg1 = &function->arg_begin()[1];
llvm::Value *value = builder.CreateAdd(arg0, arg1, "added_value");
builder.CreateRet(value);

Here we add a basic block to our function. The first basic block is where the function starts executing. The IRBuilder is the class which lets you build actual code. Here we create an add instruction, and return it. This ends the basic block, and we’re done. Not too bad! If you print it using the existing line of code, you’ll now see:

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32) {
entry:
  %added_value = add i32 %0, %1
  ret i32 %added_value
}

Compiling IR to native code

We could just dump this directly into the JIT engine now, but let’s try compiling this as a dynamic library, using the offline tools. Copy this IR code into a file, e.g. test.ll. Let’s compile this to native code.

$ llc -relocation-model=pic -filetype obj -o test.o test.ll -O2
$ ld -shared -o test.so test.o
$ objdump -d test.so

0000000000001000 <LLVMAdd>:
    1000:       8d 04 37                lea    (%rdi,%rsi,1),%eax
    1003:       c3                      retq

LLVM can usually target other architectures with one binary, so try adding -mtriple=mips to llc, disassemble with the MIPS binutils, and you’ll get:

00000000 <LLVMAdd>:
   0:   03e00008        jr      ra
   4:   00851021        addu    v0,a0,a1

This code might confuse you, it’s returning before the add? We’ll get to that later, it’s one of the weirder parts of MIPS, delay slots.

Branches

We’ll need some control flow. Let’s try implementing this function:

int foo(int a, int b, int c)
{
    if (c > 0)
        return a * b;
    else
        return a + b;
}
using namespace llvm;
// ...
// Remember to update this to take 3 arguments!
Type *argument_types[] = { int32, int32, int32 };
// ...
BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];
Value *cmp = builder.CreateICmpSGT(
    arg0,
    ConstantInt::get(Type::getInt32Ty(context), 0),
    "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *mul = builder.CreateMul(arg1, arg2, "multiplied");
builder.CreateRet(mul);

builder.SetInsertPoint(false_path);
Value *add = builder.CreateMul(arg1, arg2, "added");
builder.CreateRet(add);

Notice that the compare instruction is “signed greater than”. Normally, LLVM does not care about signedness, except for operations where this matters. To branch, we create a branch instruction, which ends the first basic block.

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %multiplied = mul i32 %1, %2
  ret i32 %multiplied

false:                                            ; preds = %entry
  %added = add i32 %1, %2
  ret i32 %added
}

And this becomes in x86-64 (sorry for the AT&T syntax):

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 06                   jle    a <LLVMAdd+0xa>
   4:   0f af f2                imul   %edx,%esi
   7:   89 f0                   mov    %esi,%eax
   9:   c3                      retq   
   a:   01 d6                   add    %edx,%esi
   c:   89 f0                   mov    %esi,%eax
   e:   c3                      retq 

Memory and pointers

We can’t always use pure SSA values, and we have to deal with good old memory. LLVM has an “interesting” approach to pointers, but let’s start easy. Loading and storing memory in LLVM is usually done in two stages:

  • Compute address with “get element pointer”
  • Load/store with CreateLoad/CreateStore

The backend is responsible for translating this according to the memory addressing the CPU can deal with. Get element pointer translates very naturally to C-like pointer arithmetic. We don’t need to access in byte offsets and all sorts of gunk, but array elements and struct members. Let’s try an example with structs.

struct Foo
{
    int a;
    float b;
    int c;
    float d;
};

int LLVMAdd(Foo *foo)
{
    foo[1].c += 10;
}
Type *int32 = Type::getInt32Ty(context);
Type *float32 = Type::getFloatTy(context);

Type *struct_types[] = {
    int32,
    float32,
    int32,
    float32
};

Type *struct_type = StructType::get(context, struct_types, false);
Type *p_struct_type = PointerType::get(struct_type, 0);

Type *result_type = int32;
Type *argument_types[] = { p_struct_type };

// ...
// ...

BasicBlock *bb = BasicBlock::Create(context, "entry", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *indices[] = {
    ConstantInt::get(int32, 1), // pointer -> array index
    ConstantInt::get(int32, 2), // struct -> member index
};
Value *ptr = builder.CreateInBoundsGEP(arg0, indices, "ptr");
Value *loaded = builder.CreateLoad(ptr, "loaded");
Value *added = builder.CreateAdd(
    loaded,
    ConstantInt::get(int32, 10), "added");
builder.CreateStore(added, ptr);
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd({ i32, float, i32, float }*) {
entry:
  %ptr = getelementptr inbounds { i32, float, i32, float }, { i32, float, i32, float }* %0, i32 1, i32 2
  %loaded = load i32, i32* %ptr
  %added = add i32 %loaded, 10
  store i32 %added, i32* %ptr
  ret i32 %added
}
0000000000000000 <LLVMAdd>:
   0:   8b 47 18                mov    0x18(%rdi),%eax
   3:   83 c0 0a                add    $0xa,%eax
   6:   89 47 18                mov    %eax,0x18(%rdi)
   9:   c3                      retq   

The mystical PHI node

I can’t really talk about SSA without mentioning PHI nodes. SSA values do not live in memory, so what happens when the control flow changes? Let’s look at an example function:

int foo(int a, int b, int c)
{
    int v;
    if (a > 0)
        v = b + c;
    else
        v = b - c;

    return v + a;
}

If we translate this directly to IR, we’ll see that v suddenly has two versions of it, but SSA says *single* static assignment. How is this resolved? A PHI node is used. The purpose of this node is to pull together multiple versions of a value into one, and you specify all basic blocks which can branch into your block. This is rather bizarre, since it’s kinda like an inverse goto.

Of course, we could translate this to memory allocated on the stack and just store to v rather than deal with this, and have LLVM optimize it. For completion, let’s try that first.

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);
BasicBlock *merge = BasicBlock::Create(context, "merge", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];

Value *v = builder.CreateAlloca(int32, ConstantInt::get(int32, 1),
    "alloca");
Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0),
    "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *bc_add = builder.CreateAdd(arg1, arg2, "b_plus_c");
builder.CreateStore(bc_add, v);
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *bc_sub = builder.CreateSub(arg1, arg2, "b_minus_c");
builder.CreateStore(bc_sub, v);
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
Value *loaded = builder.CreateLoad(v, "loaded");
Value *added = builder.CreateAdd(loaded, arg0, "added");
builder.CreateRet(added);
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %alloca = alloca i32
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %b_plus_c = add i32 %1, %2
  store i32 %b_plus_c, i32* %alloca
  br label %merge

false:                                            ; preds = %entry
  %b_minus_c = sub i32 %1, %2
  store i32 %b_minus_c, i32* %alloca
  br label %merge

merge:                                            ; preds = %false, %true
  %loaded = load i32, i32* %alloca
  %added = add i32 %loaded, %0
  ret i32 %added
}

If we just give this to llc we see that it is sadly using stack space.

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 0d                   jle    11 <LLVMAdd+0x11>
   4:   01 d6                   add    %edx,%esi
   6:   89 74 24 fc             mov    %esi,-0x4(%rsp)
   a:   03 7c 24 fc             add    -0x4(%rsp),%edi
   e:   89 f8                   mov    %edi,%eax
  10:   c3                      retq   
  11:   29 d6                   sub    %edx,%esi
  13:   89 74 24 fc             mov    %esi,-0x4(%rsp)
  17:   03 7c 24 fc             add    -0x4(%rsp),%edi
  1b:   89 f8                   mov    %edi,%eax
  1d:   c3                      retq

What’s going on, we gave llc the -O3 option? Well, we didn’t optimize the IR first. This problem is extremely common when compiling C and C++ code of course, so we can optimize in the IR domain. Here we can do the classic “mem2reg” optimization, which replaces stack space with SSA values, which is much easier for the backend to deal with.

$ opt -O3 -o test.bc test.ll
$ llvm-dis test.bc -o optimized.ll
$ cat optimized.ll

; ModuleID = 'test.bc'
source_filename = "mymodule"

; Function Attrs: norecurse nounwind readnone
define i32 @LLVMAdd(i32, i32, i32) local_unnamed_addr #0 {
entry:
  %cmp = icmp sgt i32 %0, 0
  %3 = sub i32 0, %2
  %alloca.0.p = select i1 %cmp, i32 %2, i32 %3
  %alloca.0 = add i32 %1, %0
  %added = add i32 %alloca.0, %alloca.0.p
  ret i32 %added
}

attributes #0 = { norecurse nounwind readnone }

0000000000000000 <LLVMAdd>:
   0:   89 d1                   mov    %edx,%ecx
   2:   f7 d9                   neg    %ecx
   4:   85 ff                   test   %edi,%edi
   6:   0f 4f ca                cmovg  %edx,%ecx
   9:   8d 04 3e                lea    (%rsi,%rdi,1),%eax
   c:   01 c8                   add    %ecx,%eax
   e:   c3                      retq 

That’s pretty nifty, and it even replaced the branch with select, but it also rewrote the entire function because it found some other optimizations. Let’s not rely on opt and rewrite this without stack space using PHI.

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];

Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0), "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *bc_add = builder.CreateAdd(arg1, arg2, "b_plus_c");
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *bc_sub = builder.CreateSub(arg1, arg2, "b_minus_c");
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
PHINode *phi = builder.CreatePHI(int32, 2, "phi");
phi->addIncoming(bc_add, true_path);
phi->addIncoming(bc_sub, false_path);
Value *added = builder.CreateAdd(phi, arg0, "added");
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %b_plus_c = add i32 %1, %2
  br label %merge

false:                                            ; preds = %entry
  %b_minus_c = sub i32 %1, %2
  br label %merge

merge:                                            ; preds = %false, %true
  %phi = phi i32 [ %b_plus_c, %true ], [ %b_minus_c, %false ]
  %added = add i32 %phi, %0
  ret i32 %added
}

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 07                   jle    b <LLVMAdd+0xb>
   4:   01 d6                   add    %edx,%esi
   6:   01 fe                   add    %edi,%esi
   8:   89 f0                   mov    %esi,%eax
   a:   c3                      retq   
   b:   29 d6                   sub    %edx,%esi
   d:   01 fe                   add    %edi,%esi
   f:   89 f0                   mov    %esi,%eax
  11:   c3                      retq

This is a very pure translation of our IR, nice.

Calling functions

This is quite straight forward fortunately. Let’s replace the two branches with actual function calls. You can probably tell by now what the C code was supposed to look like.

FunctionType *function_type = FunctionType::get(result_type, argument_types, vararg);
Function *function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMAdd",
    module);
Function *true_function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMTrue",
    module);

Function *false_function = Function::Create(
    function_type, 
    Function::ExternalLinkage,
    "LLVMFalse", module);

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);
BasicBlock *merge = BasicBlock::Create(context, "merge", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];
Value *call_args[] = { arg0, arg1, arg2 };

Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0), "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *true_call = builder.CreateCall(true_function, call_args);
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *false_call = builder.CreateCall(false_function, call_args);
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
PHINode *phi = builder.CreatePHI(int32, 2, "phi");
phi->addIncoming(true_call, true_path);
phi->addIncoming(false_call, false_path);
Value *added = builder.CreateAdd(phi, arg0, "added");
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %3 = call i32 @LLVMTrue(i32 %0, i32 %1, i32 %2)
  br label %merge

false:                                            ; preds = %entry
  %4 = call i32 @LLVMFalse(i32 %0, i32 %1, i32 %2)
  br label %merge

merge:                                            ; preds = %false, %true
  %phi = phi i32 [ %3, %true ], [ %4, %false ]
  %added = add i32 %phi, %0
  ret i32 %added
}

declare i32 @LLVMTrue(i32, i32, i32)

declare i32 @LLVMFalse(i32, i32, i32)

Here we’re trying to call functions which haven’t been defined in the module. It is up to the linker to resolve this later. Let’s see what happens if we build this as a dynamic library.

0000000000001000 <.plt>:
    1000:       ff 35 02 30 00 00       pushq  0x3002(%rip)        # 4008 <_GLOBAL_OFFSET_TABLE_+0x8>
    1006:       ff 25 04 30 00 00       jmpq   *0x3004(%rip)        # 4010 <_GLOBAL_OFFSET_TABLE_+0x10>
    100c:       0f 1f 40 00             nopl   0x0(%rax)

0000000000001010 <LLVMFalse@plt>:
    1010:       ff 25 02 30 00 00       jmpq   *0x3002(%rip)        # 4018 <LLVMFalse>
    1016:       68 00 00 00 00          pushq  $0x0
    101b:       e9 e0 ff ff ff          jmpq   1000 <.plt>

0000000000001020 <LLVMTrue@plt>:
    1020:       ff 25 fa 2f 00 00       jmpq   *0x2ffa(%rip)        # 4020 <LLVMTrue>
    1026:       68 01 00 00 00          pushq  $0x1
    102b:       e9 d0 ff ff ff          jmpq   1000 <.plt>

Disassembly of section .text:

0000000000001030 <LLVMAdd>:
    1030:       53                      push   %rbx
    1031:       89 fb                   mov    %edi,%ebx
    1033:       85 ff                   test   %edi,%edi
    1035:       7e 0b                   jle    1042 <LLVMAdd+0x12>
    1037:       89 df                   mov    %ebx,%edi
    1039:       e8 e2 ff ff ff          callq  1020 <LLVMTrue@plt>
    103e:       01 d8                   add    %ebx,%eax
    1040:       5b                      pop    %rbx
    1041:       c3                      retq   
    1042:       89 df                   mov    %ebx,%edi
    1044:       e8 c7 ff ff ff          callq  1010 <LLVMFalse@plt>
    1049:       01 d8                   add    %ebx,%eax
    104b:       5b                      pop    %rbx
    104c:       c3                      retq  

The noise before the function is dynamic library gunk. If we had linked with –no-undefined, we would see:

ld: test.o: in function `LLVMAdd':
mymodule:(.text+0xa): undefined reference to `LLVMTrue'
ld: mymodule:(.text+0x15): undefined reference to `LLVMFalse'

This is basically how we will call into our emulator host code to do various things. Handle syscalls, deal with function calls to other emulated code, etc. The possibilities are endless now. When we JIT, we can pass function pointers of our own host code into the symbol resolver and the JIT can patch in straight function calls into the code. Nifty!

At this point, with some searching around, you should be able to figure out how to generate the IR you want. We’ve covered the basics I think.

JIT compilation

JIT-ing these llvm::Modules is in theory quite straight forward, there’s just a lot of API noise to go through. We create:

  • llvm::LLVMContext
  • llvm::orc::ExecutionSession
  • llvm::orc::RTDyldObjectLinkingLayer
  • llvm::orc::IRCompileLayer<>
  • llvm::TargetMachine
  • llvm::orc::MangleAndInterner
  • llvm::DataLayout

There is too much code to deal with, just have a look at the gists I made instead for now 😀 This should be reusable if you want to play around with it.

For LLVM 7, you’ll want to define JITTER_LLVM_VERSION_LEGACY. For LLVM 8, don’t.

The API usage should be fairly simple.

int my_add(int a, int b) { return a + b };

// ...

using namespace llvm;
JITTIR::Jitter jitter;

// Let JIT link against this symbol.
jitter.add_external_symbol("my_add", my_add);

// Allocate a module.
auto module = jitter.create_module("mymodule");
auto &context = module->getContext();

// Build the IR
Type *int32 = Type::getInt32Ty(context);
Type *result_type = int32;
Type *argument_types[] = { int32, int32 };
bool vararg = false;

FunctionType *function_type = FunctionType::get(
    result_type,
    argument_types,
    vararg);

Function *function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMAdd",
    *module);

Function *my_add_function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "my_add",
    *module);

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *args[] = { arg0, arg1 };
Value *added = builder.CreateCall(my_add_function, args, "added");
builder.CreateRet(added);

// Add the module to the JIT, it's immediately compiled.
jitter.add_module(std::move(module));

// Get function pointer to symbol and execute it.
auto *fn_ptr =
    reinterpret_cast<int (*)(int, int)>(jitter.get_symbol_address("LLVMAdd"));
printf("10 + 20 = %d\n", fn_ptr(10, 20));

Which should print 30.

(gdb) disas fn_ptr,fn_ptr+20
Dump of assembler code from 0x7fb9d7bcc000 to 0x7fb9d7bcc014:
   0x00007fb9d7bcc000:	push   %rax
   0x00007fb9d7bcc001:	movabs $0x55d0584c33ea,%rax
   0x00007fb9d7bcc00b:	callq  *%rax
   0x00007fb9d7bcc00d:	pop    %rcx
   0x00007fb9d7bcc00e:	retq   
   0x00007fb9d7bcc00f:	add    %al,(%rax) <- zero memory
   0x00007fb9d7bcc011:	add    %al,(%rax)
   0x00007fb9d7bcc013:	add    %al,(%rax)

This is just the bare minimum, simplest approach we can take to JIT compilation in LLVM, but it works. The code which calls external functions seems a bit strange though, but it might be this way so it’s easy to patch in other call addresses later, I’m not sure.

Conclusion

Hopefully this was informative at least. In part 3, we will use these tools at our disposal and bring up a MIPS.

An unusual recompiler experiment – MIPS to LLVM IR – Part 1

While not graphics per-se, recompilers in emulation is an interesting topic. They are notoriously complex and difficult to implement, but promise incredible performance improvements. While emulating weaker hardware is serviced well with interpreters, more powerful hardware needs recompilers to shine. Starting with the PlayStation 1 era of hardware, recompilers are common, and beyond that, recompilers are required to have any hope of reaching real-time performance.

This is a multi-part post. I’m not sure how many posts it will take, but there’s a lot of stuff to write about. In this round, I’ll introduce the experiment, we’ll parse a MIPS ELF file, and make it ready for execution in our emulated Linux environment.

What is the goal of a recompiler

The main purpose of the recompiler is to look at the foreign machine code an application is running, and converting that to equivalent machine code on the hardware you are running on.

Conversely, an interpreter looks at instructions one at a time, and performs the action it requires, which wastes a lot of work in decoding instructions and branching dynamically to whatever code snippet you need to execute.

Isn’t this what Java and .NET runtimes do?

Basically, yes. Just replace “foreign machine code” with “byte code”.

The portability problem

Since a recompiler needs to target the raw machine code of the hardware you’re running on, this is a serious hazard for portability. Typical recompilers aiming to be portable need to write backends for all kinds of machines you want to run on, and deal with operating-system specific ABIs along the way. Today, the most relevant targets would be:

  • x86
  • x86-64/amd64
  • ARMv7 (older mobile devices)
  • AArch64 (newer mobile devices)

To target something exotic, like homebrew on consoles, you might have recompilers targeting PowerPC, which was very popular as well for two console generations.

This is too much, so I’ve been interested in the prospect of leveraging LLVM instead. If I can just target LLVM IR, LLVM could generate machine code in runtime (ORC JIT) for me, targeting any of these architectures. It is also a good excuse to learn some LLVM internals. We’ll be generating LLVM code with the C++ API, and pass that along to the JIT runtime to generate machine code on the fly.

Ahead-of-time recompilation and re-targeting?

If we target LLVM, we could in theory dump all LLVM IR code we have encountered to disk, optimize the code more aggressively (maybe some LTO), and build everything into a single dynamic library for any target we’d like. Once we have warmed up all the known code paths we should be able to avoid almost all run-time recompilation on subsequent runs. I wonder how practical it is, but it’s something I’d like to experiment with.

Patching speed critical sections with native C?

If we can dump LLVM IR to disk it doesn’t seem to farfetched to replace functions at known addresses with our own native versions written in C or something.

Recompilation efficiency

A big advantage of having a dedicated recompiler is how quickly the code can be generated as it barely needs to qualify as a compiler to get the job done. LLVM is a complex beast which needs to target all sorts of use cases, and using it as a just-in-time compiler like this is going to create some performance issues. It will be interesting to see how slow it is in practice.

Why MIPS?

MIPS is found in lots of gaming console hardware from the 90s and early 00s.

  • PlayStation 1
  • Nintendo 64 (CPU and RSP)
  • PlayStation 2
  • PlayStation Portable

MIPS is also a very simple ISA to understand and get started with. The core, original MIPS I ISA is probably the simplest, practical ISA to learn, and it’s often part of the curriculum when studying for an electronics degree. As part of my undergrad digital design course, we hacked on a trivial MIPS CPU core in VHDL, which was very fun and educational.

What should we emulate?

I felt like doing something I could get results from quickly, so rather than emulating a full game console, I wanted to try pretending to be a MIPS Linux kernel, running fully fleshed-out MIPS ELF binaries compiled with GCC 8.2 backed by glibc and libstdc++. That way I could build my way up from running trivial test cases written in assembly without any run-time all the way to emulating complicated programs using STL and the modern C/C++ run-times. We can also get a much better picture of performance differences between a natively compiled C++ application compared to a recompiled one.

The MIPS we’re going to target is a 32-bit MIPS I with whatever extra instructions we need when running real applications. GCC can target MIPS I with -march=mips1, but there are a few instructions from MIPS II and up GCC will use anyways to run any glibc application, due to a couple extra fundamental features:

  • RDHWR – Reads a hardware register, used to query thread local storage (TLS). To run any non-trivial C program, we need this because of errno, which is defined by POSIX to be thread local.
  • LL/SC – Load linked and store conditional serve as the backbone for atomic operations. glibc needs it in a few places. We’re not going to bother with threading, so we can trivially implement it as normal load/store.

As for endianness, MIPS can support both little and big-endian. Little is the easiest to start with since it matches our target hardware, but we’ll want to support big-endian as well, as it is the default, and the only MIPS endianness I know of in the wild.

User-space Linux application – ELF parsing and syscalls

Our program will need to read an ELF file, set up a virtual 32-bit address space and begin execution. We will need to implement various Linux syscalls required to host a real application. Common linux syscalls like:

  • open
  • read
  • write
  • exit
  • kill
  • llseek
  • mmap
  • munmap
  • brk

… and so on will be enough to get us over the printf stage. We do not have to concern ourselves with interrupts, CPU exceptions, memory-mapped I/O and other complicated things a game console emulator would have to deal with, fortunately.

Step #0 – Getting a MIPS cross compiler

The first step is getting some code to compile to MIPS. This took a surprisingly long time as the PKGBUILDs for Arch Linux did not work properly with some weird incompatibilities. To cut the long story short, I made some PKGBUILDs for Arch Linux which worked for me to create fully functional cross-compilers for both little and big-endian 32-bit MIPS. https://github.com/Themaister/MIPS-Toolchain-PKGBUILD/

To build a little-endian binary once you build the toolchain:

mipsel-linux-gnu-gcc -o test test.c -static -march=mips1 -O2
# CMake toolchain file for little endian.
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR mipsel)
set(CMAKE_C_COMPILER mipsel-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER mipsel-linux-gnu-g++)
set(CMAKE_C_FLAGS "-mgp32 -march=mips1")
set(CMAKE_CXX_FLAGS "-mgp32 -march=mips1")
set(CMAKE_ASM_FLAGS "-mgp32 -march=mips1")
set(CMAKE_C_LINK_FLAGS "-static")
set(CMAKE_CXX_LINK_FLAGS "-static")

The -static flag is important as we do not want to have to deal with dynamically loading the C runtime in our ELF loader. Fortunately, static libgcc/libstdc++ seems to work just fine for our purpose here.

Step #1 – Parsing an ELF file

Before starting I did not know much about ELF (Executable and Linkable Format). It is the executable format used on Linux and many other systems. It is surprisingly simple when we just want to run a statically linked executable like this. It is helpful to use the readelf tool (mipsel-linux-gnu-readelf) to study the ELF, this will help us to understand what’s going on.

# mipsel-linux-gnu-readelf -h mips.elf
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           MIPS R3000
  Version:                           0x1
  Entry point address:               0x400630
  Start of program headers:          52 (bytes into file)
  Start of section headers:          628984 (bytes into file)
  Flags:                             0x1007, noreorder, pic, cpic, o32, mips1
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         38
  Section header string table index: 37

This is the first few bytes of the binary. The structure is defined in the system header “elf.h” on Linux, which is very handy when we want to write a parser. There are a few things we care about here:

  • We can verify that we have a MIPS binary, and in which endianness.
  • What the entry point address is, i.e. where do we start executing.
  • How to find the program headers
  • How to find the section headers

The program headers

The program headers contain information about where data is located in the binary and what to do with it. We only care about LOAD and TLS blobs.

# mipsel-linux-gnu-readelf -l mips.elf                                            

Elf file type is EXEC (Executable file)
Entry point 0x400630
There are 7 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  ABIFLAGS       0x000138 0x00400138 0x00400138 0x00018 0x00018 R   0x8
  REGINFO        0x000150 0x00400150 0x00400150 0x00018 0x00018 R   0x4
  LOAD           0x000000 0x00400000 0x00400000 0x813d4 0x813d4 R E 0x10000
  LOAD           0x082000 0x00492000 0x00492000 0x041e0 0x04f9c RW  0x10000
  NOTE           0x000114 0x00400114 0x00400114 0x00020 0x00020 R   0x4
  NOTE           0x000168 0x00400168 0x00400168 0x00024 0x00024 R   0x4
  TLS            0x0820b0 0x004920b0 0x004920b0 0x00010 0x00030 R   0x4

...

We see two LOAD blocks. One has the read/execute flags set. This is our code segments, which we load from Offset in the ELF into virtual address “VirtAddr”. The read/write is the data segment, where global variables live. Note that FileSiz may be != MemSiz. This is to support zero-initialized global variables, they simply need memory allocated to them.

TLS is a bit special, it marks which global data needs to be thread local. Any data here needs to be copied out to a new allocation per-thread if we create a new one (we won’t, but we still need to implement it).

The section headers

The section headers don’t seem vital to execution, but they contain the symbol table, which is useful for debugging.

The virtual address space

Linux has a virtual address space, so we need to copy all relevant data out to our own virtualized address space. This is simply represented as a page table (4k pages) spanning the entire 32-bit address space. Through our syscall emulation, applications/glibc can use mmap() to either allocate memory dynamically or memory map files. We cannot assume a fixed address space layout if we want to emulate Linux binaries.

Having an address space like this means all memory access becomes indirect. This will certainly be a performance problem. There might be a better way if we abuse mmap(), but sounds very hard.

Setting up the stack

Before we can call the entry point we must set up a stack. Normally, we would think this stack only contains “argc” and “argv”, ala:

int main(int argc, char **argv)

but we actually need to pass a lot more information. All of this extra information is used by the C runtime entry point, which seems to be called __start in MIPS. We allocate the stack space at the top of the virtual address space, and push some data to the stack. The $sp register will point to:

  • argc
  • argv argument #0 (char *)
  • argv argument #1
  • NULL // Terminates argv
  • environment variable key #0 (char *)
  • environment variable value #0 (char *)
  • environment variable key #1
  • environment variable value #1
  • NULL // Terminates envp

The environment variables are passed in like this, and the C runtime will parse it. However, there is more data we need to pass on the stack on Linux, which was rather surprising. glibc will crash deep into its initialization if this is not done properly.

// ELF AUXV, see <elf.h>
stack_data.push_back(AT_PHDR);
stack_data.push_back(misc.phdr_addr);
stack_data.push_back(AT_PHENT);
stack_data.push_back(ehdr.e_phentsize);
stack_data.push_back(AT_PHNUM);
stack_data.push_back(ehdr.e_phnum);
stack_data.push_back(AT_PAGESZ);
stack_data.push_back(VirtualAddressSpace::PageSize);
stack_data.push_back(AT_ENTRY);
stack_data.push_back(ehdr.e_entry);
stack_data.push_back(AT_UID);
stack_data.push_back(getuid());
stack_data.push_back(AT_EUID);
stack_data.push_back(geteuid());
stack_data.push_back(AT_GID);
stack_data.push_back(getgid());
stack_data.push_back(AT_EGID);
stack_data.push_back(getegid());
stack_data.push_back(AT_RANDOM);
stack_data.push_back(stack_top); // Just point to something. glibc needs this.
stack_data.push_back(AT_NULL);

glibc needs to have a pointer to its own ELF headers and some other information like user IDs, page sizes, and some other things. The headers are used to set up TLS properly. It also needs a random number created by the Linux kernel, which it uses in early initialization to set up stack protection canaries.

Now, everything is set up, and we can start executing … I mean generating some LLVM IR … in part 2.