An unusual recompiler experiment – MIPS to LLVM IR – Part 3

In part 1 and part 2 we laid the groundwork to start recompiling MIPS code to LLVM IR. Strap your seatbelts, we’re going to MIPS and x86 assembly land.

The top-level run loop

The top level code fundamentally needs to be able to translate the program counter (short-hand, PC) to an executable function pointer. We can choose a hash map (large address space) or flat array (small address space) here.

If we need to call a PC we have not seen before, we will need to recompile a new LLVM module, starting at that PC, and then we can execute it.

Self-modifying code?

An immediate question is self-modifying code. This is a fairly ugly topic to deal with since our previously compiled function might become invalid if the underlying code changes. I think the solution for that is to keep a JIT block cache which translates a hash to function pointer and do some analysis of code blocks we don’t have a function pointer for yet. Any i-cache invalidations will clear out the relevant function pointers which triggers hashing in some form. Most likely the code for our function in particular did not change, so we can likely reuse the code blocks we generated.

For our purpose, we will not deal with self-modifying code here. A real emulator will have to deal with it, but self-modifying code should be rarer and rarer the more modern hardware we’re dealing with.

Recompiling a function

So, given a PC to execute, we’ll do some analysis where we map out all execution paths from that PC. We do this by mapping out all the basic blocks. See part 2 for more detail on what basic blocks do in LLVM.

Basic block

Basic blocks are represented as a starting PC and an end, where the execution flow is linear. The end of a basic block occurs where we see some kind of branch instruction (except for call instructions!). In this analysis we only care about these “special” instructions. Normal opcodes like arithmetic and load/store are ignored since they cannot affect control flow.

Branch delay slots

A very important part of MIPS is the use of a branch delay slot. It is a very unique design aspect of the architecture, which is considered a design flaw today because it was hard-coded to help a very specific micro-architecture. Exposing micro-architecture details like this should be considered bad taste. Whenever a branch is taken, unconditionally or not, the next instruction is always executed. Let’s see a trivial example:

int foo(int a)
{
	return a + 10;
}
00000000 <foo>:
   0:	03e00008 	jr	ra
   4:	2482000a 	addiu	v0,a0,10

“jr $ra” jumps to an address stored in a register, and $ra is used for the return address of a function. However, we can see that the add instruction comes afterwards. GCC exploits the delay slot here to do the useful computation inside it. Note that if you write MIPS assembly, you can get the assembler to perform this reordering for you. Often you will see “nop” after a branch if there is nothing useful to do in the delay slot.

One thought you might have now is, what happens if you have multiple branches back to back, branching in a delay slot? Well, if you actually thought of that then congratulations, have a cookie. This is explicitly banned in MIPS ISA, because it is non-sensical and undefined. While the hardware behavior could be well defined for a particular chip, it is still extremely broken, because if there is any exception or hardware interrupt happening in the middle of this sequence, it is impossible to recover from it. MIPS interrupt handlers typically have to deal with delay slots and fix-up the PC register accordingly.

The practical effect of the delay slot for us is that whenever we recompile a branch instruction, we recompile the following instruction first, then perform the branch. If the following instruction is also a branch instruction, we know that it cannot legally be taken, and branch instructions cannot have side effects (except for jal/jalr, but those always take the branch, illegal!), so we just skip it.

Load delay slots

MIPS I also has a delay for loads. We cannot use the target register in the instruction following a load. However, for recompilation purpose we can ignore this. While clever code might attempt to abuse the fact that a target register for a load instruction hasn’t been updated yet, this is also unsafe in the real world. If an interrupt triggers in the middle of this sequence, the register will be updated anyways, breaking the assumption of the clever code. Thus, we simply ignore the existence of the load delay slot because correct code cannot rely on this hack.

Conditional branches

MIPS has a few conditional branch instructions. When we see a conditional branch, we can branch to one of two basic blocks. Either we take the branch, or we don’t. We recursively analyze the new basic blocks we found if their target PCs haven’t been analyzed already. Some instructions to look out for are

  • BEQ
  • BNE
  • BLEZ
  • BLTZ
  • BGEZ
  • BGTZ
  • BC1T (floating point compare)
  • BC1F (floating point compare)

Direct branches

The direct branch in MIPS is the “J” instruction. We need to be careful with this instruction because it is commonly used in two ways:

  • Branch to basic block
  • Tail call to an unrelated function

If we mistakenly treat a J as a basic block where it should have been a function, we will end up inlining huge functions into our own, where we should have just “called” them instead. Too much inlining will bloat the JIT and make recompilation slower. Let’s see an example.

#include <stdlib.h>

// Make sure we don't get inlining optimizations.
__attribute__((noinline))
static void *wrapped_malloc(size_t size)
{
	return malloc(size);
}

void *my_malloc(size_t size)
{ 
	return wrapped_malloc(size * 4); // Tail-call
}
00000000 <wrapped_malloc>:
   0:	3c1c0000 	lui	gp,0x0
   4:	279c0000 	addiu	gp,gp,0
   8:	8f990000 	lw	t9,0(gp)
   c:	00000000 	nop
  10:	03200008 	jr	t9
  14:	00000000 	nop

00000018 <my_malloc>:
  18:	08000000 	j	0 <wrapped_malloc>
  1c:	00042080 	sll	a0,a0,0x2

Here we need to see that J is actually a tail call to “wrapped_malloc”, and not a branch to a basic block. The heuristic I ended up with was that if the J target refers to a basic block through the use of conditional branches elsewhere, we can assume J refers to a branch to a basic block ala if/else or switch blocks. If not, we assume it’s a tail call.

There are other static branches we can find in MIPS. The conditional branches can become static branches if the $0 register is used. This seems to be mostly useful with position-independent code since we can branch to an address relative to our PC rather than fixed address with J. We should try to detect these “static” branches as well. There is no need in analyzing code which can never be executed.

Indirect branches

Indirect branches in MIPS are a bit tricky to handle. They are implemented using the JR instruction. The edge case we need to handle is that JR is also used to return from a function. Either way, JR will always end a basic block. The implementation logic will end up being something like this:

// JR
uint32_t target_pc = registers[instr.register];
if (target_pc == return_prediction_stack.top())
{
    // This is actually a return!
    predicition_stack.pop();
    return;
}
else
{
    // This might have to recompile new code if we haven't seen target_pc before!
    auto *target = mips_resolve_call_target(target_pc);
    return target(mips_state); // Tail-call.
}

There are a few main use cases for JR:

  • Returning from a function, almost always using “jr $ra”.
  • Jump tables
  • Tail calls in dynamically loaded code.

We might be able to add a few optimizations for “well-behaved” code, where we can safely assume that “jr $ra” always means return, and that $ra always refers to the correct return address. That is not guaranteed, but I think GCC will always generate sane code at least.

Illegal instructions

If we find an illegal instruction, we can call out to the VM host, and request a SIGILL signal to be raised to our thread. This also ends the basic block.

Putting it together

Now we have gone through all instructions which can trigger an end of a basic block. Let’s take a more complex function and split it up into basic blocks.

int number_of_even(const int *values, int count)
{
	int res = 0;
	for (int i = 0; i < count; i++)
		if ((values[i] & 1) == 0)
			res++;
	return res;
}
 00000000 <number_of_even>:
  // ConditionalBranch -> 0x38 or 0x8
  // Note that the basic block does not end until after
  // the delay slot has executed.
   0:	18a0000d 	blez	a1,38 <number_of_even+0x38>
   4:	00052880 	sll	a1,a1,0x2

  // ConditionalBranch -> 0x28 or 0x24
   8:	00852821 	addu	a1,a0,a1
   c:	00001025 	move	v0,zero
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x28 or 0x24
  // Should split up here because 0x10 is a branch target.
  // Current implementation does not split up basic blocks
  // to allow branching to the middle of another basic block.
  // Instead we end up duplicating some code.
  10:	8c830000 	lw	v1,0(a0)
  14:	00000000 	nop // A wild load-delay slot appears.
  18:	30630001 	andi	v1,v1,0x1
  1c:	14600002 	bnez	v1,28 <number_of_even+0x28>
  20:	24840004 	addiu	a0,a0,4

  // ConditionalBranch -> 0x30 or 0x10
  24:	24420001 	addiu	v0,v0,1
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // ConditionalBranch -> 0x30 or 0x10
  // Same here w.r.t. code duplication,
  // 0x24 and 0x28 are both branch targets.
  28:	14a4fff9 	bne	a1,a0,10 <number_of_even+0x10>
  2c:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  30:	03e00008 	jr	ra
  34:	00000000 	nop

  // Indirect branch -> terminates graph, tail call or return.
  38:	03e00008 	jr	ra
  3c:	00001025 	move	v0,zero

Once we know all the basic blocks, we can create LLVM basic blocks for them, and then recompile the blocks directly and link them together with BranchInst. This way of analyzing and recompiling is fairly ISA agnostic actually and it’s not that hard to change MIPS into something else once the basic structure is in place. The recompiler itself which sets up this is actually completely MIPS agnostic, it only asks for “given a start PC, where does the basic block end, and what kind of basic block is it.”

Register allocation and branches

While we’re working on registers, we ideally want the MIPS registers to be reflected by our native hardware registers. It’s obvious a 1:1 mapping is not possible. MIPS has 32 (well, 31) general purpose registers, 32 floating-point registers and various control registers. This isn’t going to fit on x86 or Arm.

Fortunately, we do not have to really care about register allocation when using LLVM. We just need to make sure we don’t emit CreateStore/CreateLoad as much as possible, and LLVM should take care of the rest. Within a basic block, this is very easy since we always know which SSA value a register refers to as the control flow is linear. I implemented a simple RegisterTracker class which lets me translate registers to SSA values. If we haven’t used a register yet, load it from memory, if we modify a register, just replace the SSA value and remember that we eventually have to write it back to memory later, i.e. the register bank.

The real problem is how to deal with branches. We learned last time that to pass values to other basic blocks we can use PHI nodes. I tried implementing a scheme like this, where I would build a full CFG and try to link up register values using PHI nodes, but I gave up. The biggest complication is that our registers can become invalidated when calling other functions (since they modify registers as well), and we will have a real hard time handling register dirty tracking. If we have say a basic block C which can be entered from basic block A and B, A might write registers 1 through 15 and B might write registers 16 through 31. If we want to use PHI nodes, we’ll need to create one for every possible register all predecessors of C might have touched. We also don’t really know which registers are dirty and need to be moved back to memory after the function ends, and emitting branches just to conditionally move registers back to memory is dumb. Because of all these complications and pathological cases I went with a very simple scheme. At the end of a basic block or before a function call, all dirty registers are flushed to memory. On entry of a basic block, we will have to load all the registers we need from memory. Ideally, LLVM should be able to optimize this back to SSA/PHI form, but it might be rather expensive to do so. Even if LLVM does not optimize for this, the register bank should be 100% hot in L1 cache, so I’m not too worried about performance. x86 is a very register starved architecture to begin with and moving data to and from L1 cache is very common.

Call instructions

MIPS has several ways of “calling” functions. These functions do not necessarily end a basic block, since we expect control flow to return to the instruction following the branch delay slot.

  • JAL
  • JALR
  • BLTZAL
  • BGEZAL
  • J (deduced tail call)
  • BEQ/BGEZ/BLTZ (deduced position-independent tail call)

The L stands for link, which means that $pc + 8 is written to the return register $ra before jumping. As we saw earlier, we can return by jumping indirectly to $ra. Unlike x86, there is no “return address is on stack”.

JAL is the easiest one to understand, as it means “call this address”. JALR is a variant where we call a function pointer. BLTZAL and BGEZAL are very interesting as they conditionally call a function. They are also useful for position independent calls since they use the PC-relative addressing mode. All of these instructions are fundamentally implemented in the same way.

Return stack prediction

We want to be as friendly as possible to our CPUs branch predictor. The return instruction is one of the best prediction methods we can exploit. When we return, the CPU can be almost 100% sure where we are going to branch unless we were the subject of a stack smash attack or something. The CPU keeps an internal stack where it expects a return to go, and that can be used to predict returns perfectly if our code is well behaved.

Of course, we cannot assume the code we’re running is perfect, but we can optimize for it. Whenever we are executing a link instruction, we can push the link target to a prediction stack. Whenever we see a JR instruction later, we check if it equals the top of the prediction stack. If so, we can pop the stack and simply return, no extra JIT compilation necessary. If JR is not a return, we might have to compile some more code.

One problem of the return stack is that MIPS code is free to just call JAL over and over and over, since JAL just writes to the link register, and doesn’t actually affect the stack pointer $sp.

To deal with the situation where the return stack grows towards infinity, we will just need to deal with it by setting a rational upper limit. In the worst case where our return stack for some reason grows too large, we can use the nuclear option in our arsenal, longjmp! The top level code uses setjmp, and if at any time we’ve reached a hopeless situation, longjmp unwinds the entire stack at once, and we can re-enter with our new PC. However, this is kinda terrible for performance since all return instructions will now fail to optimize to a simple return, and might have to JIT out random code which followed a call instruction. We’ll hope this never happens for real.

To thunk or not to thunk

While indirect calls must have a lookup to determine what we are actually calling in runtime, it’s possible for direct call instructions to directly call another function in LLVM. In this case, we avoid any runtime lookups. We risk recursively having to recompile the callee functions to be able to link such a function, so the initial JIT step can become really slow. I added an option which lets JAL pretend to be JALR and have all call instructions go through an indirection. LLVM can support lazy JIT to alleviate this problem, but I don’t know how to make that work, so, meh. Our grand plan is to optimize all of this stuff offline anyways later 😉

Putting it all together

It’s time to look at some real code, real MIPS output and the resulting LLVM IR. In the VM, I added a mode which lets me call any function by name. This is very useful to facilitate small test cases, so I don’t have to go through the entire libc init step just to test some basic arithmetic. $ra will be 0, and I treat returning to PC 0 as “I’m done with the test, dump registers”.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        return foo(40, 50);
}
004005ec <main>:
  4005ec:       24050032        li      a1,50
  4005f0:       08100208        j       400820 <foo>
  4005f4:       24040028        li      a0,40

00400820 <foo>:
  400820:       03e00008        jr      ra
  400824:       00851021        addu    v0,a0,a1

Doesn’t get much simpler to start with this test. main calls foo through a tail call, let’s see what the LLVM looks like completely unoptimized:

; ModuleID = '_004005ec'
source_filename = "_004005ec"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004005ec(%0*) {
entry:
  br label %_004005ec

_004005ec:                                        ; preds = %entry
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 40, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 50, i32* %a1Ptr
  tail call void @_00400820(%0* %0)
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %v0 = add i32 %a0Loaded, %a1Loaded
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

The first thing we notice is

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

which expresses the MIPS state which we pass around to our JIT functions. 64 i32 values are reserved for the general purpose registers (32 + a couple other hidden registers), 64 FP registers (32 + a couple extra), and finally, the page table. We inline it in the struct to be able to load and store memory as efficiently as possible. The code should be fairly easy to follow until we reach the return in foo()

  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_00400820
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400820
  ret void
}

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

Here we call to our externally defined function in the VM to check if return stack prediction worked. Either we tail call or just simply return. $ra in this case will be 0, and we just end execution here.

The registers are dumped at the end to read:

...
  v0 = 90
  v1 = 0
  a0 = 40
  a1 = 50
...

Very nice! $v0 is the return register in the MIPS ABI and $a0/$a1 are the first and second arguments respectively.

Loads and stores

Let’s have a look what happens when we cannot rely on tail calls.

__attribute__((noinline))
int foo(int a, int b)
{
        return a + b;
}

int main(void)
{
        int a = foo(1, 2);
        a += foo(3, 4);
        return a;
}
004005ec <main>:
  4005ec:       27bdffe0        addiu   sp,sp,-32
  4005f0:       24050002        li      a1,2
  4005f4:       afbf001c        sw      ra,28(sp)
  4005f8:       0c100210        jal     400840 <foo>
  4005fc:       24040001        li      a0,1
  400600:       24050004        li      a1,4
  400604:       24040003        li      a0,3
  400608:       0c100210        jal     400840 <foo>
  40060c:       00401825        move    v1,v0
  400610:       8fbf001c        lw      ra,28(sp)
  400614:       00621021        addu    v0,v1,v0
  400618:       03e00008        jr      ra
  40061c:       27bd0020        addiu   sp,sp,32

We only need to load and store to stack, but we’ll see the codegen in action.

_004005ec:                                        ; preds = %entry
  %spPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded = load i32, i32* %spPtr
  %sp = add i32 %spLoaded, -32
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %SWAddr = add i32 %sp, 28

; Translate virtual address to page + offset
  %PageIndex = lshr i32 %SWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %SWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  store i32 %raLoaded, i32* %PagePtr

; Flush registers before calling foo
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 2, i32* %a1Ptr
  %spPtr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp, i32* %spPtr1
  %raPtr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195840, i32* %raPtr2

; Predict the return address
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195840)
; Direct call to foo, no indirection needed here.
  call void @_00400840(%0* %0)
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v0Loaded, i32* %v1Ptr
  %a0Ptr3 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 3, i32* %a0Ptr3
  %a1Ptr4 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4, i32* %a1Ptr4
  %raPtr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 4195856, i32* %raPtr5
  call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195856)
  call void @_00400840(%0* %0)
  %spPtr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  %spLoaded7 = load i32, i32* %spPtr6
  %LWAddr = add i32 %spLoaded7, 28
  %PageIndex8 = lshr i32 %LWAddr, 12
  %Page9 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex8
  %PageLoaded10 = load i8*, i8** %Page9
  %Page3211 = bitcast i8* %PageLoaded10 to i32*
  %PageOffset12 = lshr i32 %LWAddr, 2
  %PageOffsetMasked13 = and i32 %PageOffset12, 1023
  %PagePtr14 = getelementptr inbounds i32, i32* %Page3211, i32 %PageOffsetMasked13
  %Loaded = load i32, i32* %PagePtr14
  %v0Ptr15 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded16 = load i32, i32* %v0Ptr15
  %v1Ptr17 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  %v1Loaded = load i32, i32* %v1Ptr17
  %v0 = add i32 %v1Loaded, %v0Loaded16
  %sp18 = add i32 %spLoaded7, 32
  %v0Ptr19 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr19
  %spPtr20 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
  store i32 %sp18, i32* %spPtr20
  %raPtr21 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  store i32 %Loaded, i32* %raPtr21
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %Loaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004005ec
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004005ec
  ret void
}

declare void @__recompiler_predict_return(%0*, i32, i32)

There is a fair bit of noise here with loading and storing to memory. We have to emulate the virtual address space, so that means translating addresses into pages and offsets. What about the x86-64 output?

0000000000000000 <_004005ec>:
   0:   53                      push   %rbx
   1:   48 89 fb                mov    %rdi,%rbx
   4:   8b 47 74                mov    0x74(%rdi),%eax
   7:   8b 4f 7c                mov    0x7c(%rdi),%ecx
   a:   8d 50 e0                lea    -0x20(%rax),%edx
   d:   83 c0 fc                add    $0xfffffffc,%eax
  10:   89 c6                   mov    %eax,%esi
  12:   c1 ee 0c                shr    $0xc,%esi
  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)
  28:   48 b8 01 00 00 00 02    movabs $0x200000001,%rax
  2f:   00 00 00 
  32:   48 89 47 10             mov    %rax,0x10(%rdi)
  36:   89 57 74                mov    %edx,0x74(%rdi)
  39:   c7 47 7c 00 06 40 00    movl   $0x400600,0x7c(%rdi)
  40:   be 40 08 40 00          mov    $0x400840,%esi
  45:   ba 00 06 40 00          mov    $0x400600,%edx
  4a:   e8 00 00 00 00          callq  4f <_004005ec+0x4f>
  4f:   48 89 df                mov    %rbx,%rdi
  52:   e8 79 00 00 00          callq  d0 <_00400840>
  57:   8b 43 08                mov    0x8(%rbx),%eax
  5a:   89 43 0c                mov    %eax,0xc(%rbx)
  5d:   48 b8 03 00 00 00 04    movabs $0x400000003,%rax
  64:   00 00 00 
  67:   48 89 43 10             mov    %rax,0x10(%rbx)
  6b:   c7 43 7c 10 06 40 00    movl   $0x400610,0x7c(%rbx)
  72:   be 40 08 40 00          mov    $0x400840,%esi
  77:   ba 10 06 40 00          mov    $0x400610,%edx
  7c:   48 89 df                mov    %rbx,%rdi
  7f:   e8 00 00 00 00          callq  84 <_004005ec+0x84>
  84:   48 89 df                mov    %rbx,%rdi
  87:   e8 44 00 00 00          callq  d0 <_00400840>
  8c:   8b 43 0c                mov    0xc(%rbx),%eax
  8f:   8b 4b 74                mov    0x74(%rbx),%ecx
  92:   8d 51 1c                lea    0x1c(%rcx),%edx
  95:   89 d6                   mov    %edx,%esi
  97:   c1 ee 0c                shr    $0xc,%esi
  9a:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  a1:   00 
  a2:   c1 ea 02                shr    $0x2,%edx
  a5:   81 e2 ff 03 00 00       and    $0x3ff,%edx
  ab:   8b 34 96                mov    (%rsi,%rdx,4),%esi
  ae:   83 c1 20                add    $0x20,%ecx
  b1:   01 43 08                add    %eax,0x8(%rbx)
  b4:   89 4b 74                mov    %ecx,0x74(%rbx)
  b7:   89 73 7c                mov    %esi,0x7c(%rbx)
  ba:   48 89 df                mov    %rbx,%rdi
  bd:   e8 00 00 00 00          callq  c2 <_004005ec+0xc2>
  c2:   48 85 c0                test   %rax,%rax
  c5:   74 06                   je     cd <_004005ec+0xcd>
  c7:   48 89 df                mov    %rbx,%rdi
  ca:   5b                      pop    %rbx
  cb:   ff e0                   jmpq   *%rax
  cd:   5b                      pop    %rbx
  ce:   c3                      retq   
  cf:   90                      nop

Ouch. A lot of this is noise to deal with register moves. We can see the code sequence which performs loads and stores here:

  15:   48 8b b4 f7 00 02 00    mov    0x200(%rdi,%rsi,8),%rsi
  1c:   00 
  1d:   c1 e8 02                shr    $0x2,%eax
  20:   25 ff 03 00 00          and    $0x3ff,%eax
  25:   89 0c 86                mov    %ecx,(%rsi,%rax,4)

Good news is that this is very straight forward code, so the CPU should churn through most of this like butter unless we’re missing the page table reads in L1. It will be interesting to benchmark this code against natively compiled C code later.

Loops

Let’s try to JIT the number_of_even function we made earlier and see if LLVM can preserve data in registers across loop iterations.

__attribute__((noinline))
int number_of_even(const int *values, int count)
{
        int res = 0;
        for (int i = 0; i < count; i++)
                if ((values[i] & 1) == 0)
                        res++;
        return res;
}

int main(void)
{
        static const int values[] = { 1, 2, 3, 4 };
        return number_of_even(values, 4); 
}
00400820 <number_of_even>:
  400820:       18a0000d        blez    a1,400858 <number_of_even+0x38>
  400824:       00052880        sll     a1,a1,0x2
  400828:       00852821        addu    a1,a0,a1
  40082c:       00001025        move    v0,zero
  400830:       8c830000        lw      v1,0(a0)
  400834:       00000000        nop
  400838:       30630001        andi    v1,v1,0x1
  40083c:       14600002        bnez    v1,400848 <number_of_even+0x28>
  400840:       24840004        addiu   a0,a0,4
  400844:       24420001        addiu   v0,v0,1
  400848:       14a4fff9        bne     a1,a0,400830 <number_of_even+0x10>
  40084c:       00000000        nop
  400850:       03e00008        jr      ra
  400854:       00000000        nop
  400858:       03e00008        jr      ra
  40085c:       00001025        move    v0,zero

004005ec <main>:
  4005ec:       3c040047        lui     a0,0x47
  4005f0:       24050004        li      a1,4
  4005f4:       08100208        j       400820 <number_of_even>
  4005f8:       2484a330        addiu   a0,a0,-23760
  4005fc:       00000000        nop
define void @_00400820(%0*) {
entry:
  br label %_00400820

_00400820:                                        ; preds = %entry
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded = load i32, i32* %a1Ptr
  %BLEZ = icmp sle i32 %a1Loaded, 0
  %a1 = shl i32 %a1Loaded, 2
  %a1Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a1, i32* %a1Ptr1
  br i1 %BLEZ, label %_00400858, label %_00400828

_00400858:                                        ; preds = %_00400820
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

_00400828:                                        ; preds = %_00400820
  %a1Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded3 = load i32, i32* %a1Ptr2
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded = load i32, i32* %a0Ptr
  %a14 = add i32 %a0Loaded, %a1Loaded3
  %LWAddr = add i32 %a0Loaded, 0
  %PageIndex = lshr i32 %LWAddr, 12
  %Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
  %PageLoaded = load i8*, i8** %Page
  %Page32 = bitcast i8* %PageLoaded to i32*
  %PageOffset = lshr i32 %LWAddr, 2
  %PageOffsetMasked = and i32 %PageOffset, 1023
  %PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
  %Loaded = load i32, i32* %PagePtr
  %v1 = and i32 %Loaded, 1
  %BNE = icmp ne i32 %v1, 0
  %a0 = add i32 %a0Loaded, 4
  %v0Ptr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 0, i32* %v0Ptr5
  %v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v1, i32* %v1Ptr
  %a0Ptr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a0, i32* %a0Ptr6
  %a1Ptr7 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 %a14, i32* %a1Ptr7
  br i1 %BNE, label %_00400848, label %_00400844

_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Ptr8 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded9 = load i32, i32* %a0Ptr8
  %a1Ptr10 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded11 = load i32, i32* %a1Ptr10
  %BNE12 = icmp ne i32 %a1Loaded11, %a0Loaded9
  br i1 %BNE12, label %_00400830, label %_00400850

_00400830:                                        ; preds = %_00400844, %_00400848
  %a0Ptr13 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded14 = load i32, i32* %a0Ptr13
  %LWAddr15 = add i32 %a0Loaded14, 0
  %PageIndex16 = lshr i32 %LWAddr15, 12
  %Page17 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex16
  %PageLoaded18 = load i8*, i8** %Page17
  %Page3219 = bitcast i8* %PageLoaded18 to i32*
  %PageOffset20 = lshr i32 %LWAddr15, 2
  %PageOffsetMasked21 = and i32 %PageOffset20, 1023
  %PagePtr22 = getelementptr inbounds i32, i32* %Page3219, i32 %PageOffsetMasked21
  %Loaded23 = load i32, i32* %PagePtr22
  %v124 = and i32 %Loaded23, 1
  %BNE25 = icmp ne i32 %v124, 0
  %a026 = add i32 %a0Loaded14, 4
  %v1Ptr27 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
  store i32 %v124, i32* %v1Ptr27
  %a0Ptr28 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 %a026, i32* %a0Ptr28
  br i1 %BNE25, label %_00400848, label %_00400844

_00400844:                                        ; preds = %_00400830, %_00400828
  %v0Ptr29 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  %v0Loaded = load i32, i32* %v0Ptr29
  %v0 = add i32 %v0Loaded, 1
  %a0Ptr30 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  %a0Loaded31 = load i32, i32* %a0Ptr30
  %a1Ptr32 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  %a1Loaded33 = load i32, i32* %a1Ptr32
  %BNE34 = icmp ne i32 %a1Loaded33, %a0Loaded31
  %v0Ptr35 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 %v0, i32* %v0Ptr35
  br i1 %BNE34, label %_00400830, label %_00400850

_00400850:                                        ; preds = %_00400844, %_00400848
  %raPtr36 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded37 = load i32, i32* %raPtr36
  %jump_addr38 = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded37)
  %jump_addr_cmp41 = icmp ne void (%0*)* %jump_addr38, null
  br i1 %jump_addr_cmp41, label %IndirectJumpPath39, label %IndirectJumpReturn40

IndirectJumpPath:                                 ; preds = %_00400858
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_00400858
  ret void

IndirectJumpPath39:                               ; preds = %_00400850
  tail call void %jump_addr38(%0* %0)
  ret void

IndirectJumpReturn40:                             ; preds = %_00400850
  ret void
}

Again, pretty noisy output, and this is unoptimized output after all. If we look at the x86-64 output, then as expected, it’s pretty bad:

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 47 14                mov    0x14(%rdi),%eax
  17:   8d 0c 85 00 00 00 00    lea    0x0(,%rax,4),%ecx
  1e:   85 c0                   test   %eax,%eax
  20:   89 4f 14                mov    %ecx,0x14(%rdi)
  23:   7f 19                   jg     3e <_00400820+0x2e>
  25:   8b 73 7c                mov    0x7c(%rbx),%esi
  28:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  2f:   48 89 df                mov    %rbx,%rdi
  32:   e8 00 00 00 00          callq  37 <_00400820+0x27>
  37:   48 85 c0                test   %rax,%rax
  3a:   75 50                   jne    8c <_00400820+0x7c>
  3c:   5b                      pop    %rbx
  3d:   c3                      retq   
  3e:   8b 43 10                mov    0x10(%rbx),%eax
  41:   89 c1                   mov    %eax,%ecx
  43:   c1 e9 0c                shr    $0xc,%ecx
  46:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  4d:   00 
  4e:   89 c2                   mov    %eax,%edx
  50:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  56:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  59:   01 43 14                add    %eax,0x14(%rbx)
  5c:   83 c0 04                add    $0x4,%eax
  5f:   83 e1 01                and    $0x1,%ecx
  62:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  69:   89 4b 0c                mov    %ecx,0xc(%rbx)
  6c:   89 43 10                mov    %eax,0x10(%rbx)
  6f:   75 23                   jne    94 <_00400820+0x84>
  71:   8b 43 14                mov    0x14(%rbx),%eax
  74:   ff 43 08                incl   0x8(%rbx)
  77:   3b 43 10                cmp    0x10(%rbx),%eax
  7a:   75 20                   jne    9c <_00400820+0x8c>
  7c:   8b 73 7c                mov    0x7c(%rbx),%esi
  7f:   48 89 df                mov    %rbx,%rdi
  82:   e8 00 00 00 00          callq  87 <_00400820+0x77>
  87:   48 85 c0                test   %rax,%rax
  8a:   74 06                   je     92 <_00400820+0x82>
  8c:   48 89 df                mov    %rbx,%rdi
  8f:   5b                      pop    %rbx
  90:   ff e0                   jmpq   *%rax
  92:   5b                      pop    %rbx
  93:   c3                      retq   
  94:   8b 43 14                mov    0x14(%rbx),%eax
  97:   3b 43 10                cmp    0x10(%rbx),%eax
  9a:   74 e0                   je     7c <_00400820+0x6c>
  9c:   8b 43 10                mov    0x10(%rbx),%eax
  9f:   89 c1                   mov    %eax,%ecx
  a1:   c1 e9 0c                shr    $0xc,%ecx
  a4:   48 8b 8c cb 00 02 00    mov    0x200(%rbx,%rcx,8),%rcx
  ab:   00 
  ac:   89 c2                   mov    %eax,%edx
  ae:   81 e2 fc 0f 00 00       and    $0xffc,%edx
  b4:   8b 0c 11                mov    (%rcx,%rdx,1),%ecx
  b7:   83 e1 01                and    $0x1,%ecx
  ba:   83 c0 04                add    $0x4,%eax
  bd:   89 4b 0c                mov    %ecx,0xc(%rbx)
  c0:   89 43 10                mov    %eax,0x10(%rbx)
  c3:   85 c9                   test   %ecx,%ecx
  c5:   74 aa                   je     71 <_00400820+0x61>
  c7:   eb cb                   jmp    94 <_00400820+0x84>

Not a lot of register use here, what happens if the run the LLVM through opt first though?

; ....
_00400848:                                        ; preds = %_00400830, %_00400828
  %a0Loaded9 = phi i32 [ %a026, %_00400830 ], [ %a0, %_00400828 ]
  %v0Loaded3 = phi i32 [ %v0Loaded4, %_00400830 ], [ 0, %_00400828 ]
  %BNE12 = icmp eq i32 %a14, %a0Loaded9
  br i1 %BNE12, label %_00400850, label %_00400830

_00400830:                                        ; preds = %_00400848, %_00400844
  %a0Loaded14 = phi i32 [ %a0Loaded9, %_00400848 ], [ %a0Loaded31, %_00400844 ]
  %v0Loaded4 = phi i32 [ %v0Loaded3, %_00400848 ], [ %v0, %_00400844 ]
; ...

Sure enough, we’re seeing some loads and stores getting promoted to phi nodes, excellent. The x86-64 codegen is improved a bit as well. Still kinda hard to read though …

0000000000000010 <_00400820>:
  10:   53                      push   %rbx
  11:   48 89 fb                mov    %rdi,%rbx
  14:   8b 4f 14                mov    0x14(%rdi),%ecx
  17:   8d 04 8d 00 00 00 00    lea    0x0(,%rcx,4),%eax
  1e:   85 c9                   test   %ecx,%ecx
  20:   89 47 14                mov    %eax,0x14(%rdi)
  23:   0f 8e 84 00 00 00       jle    ad <_00400820+0x9d>
  29:   8b 53 10                mov    0x10(%rbx),%edx
  2c:   01 d0                   add    %edx,%eax
  2e:   89 d6                   mov    %edx,%esi
  30:   8d 4a 04                lea    0x4(%rdx),%ecx
  33:   48 c1 ea 0c             shr    $0xc,%rdx
  37:   48 8b 94 d3 00 02 00    mov    0x200(%rbx,%rdx,8),%rdx
  3e:   00 
  3f:   81 e6 fc 0f 00 00       and    $0xffc,%esi
  45:   8b 34 32                mov    (%rdx,%rsi,1),%esi
  48:   31 d2                   xor    %edx,%edx
  4a:   83 e6 01                and    $0x1,%esi
  4d:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  54:   89 73 0c                mov    %esi,0xc(%rbx)
  57:   89 4b 10                mov    %ecx,0x10(%rbx)
  5a:   89 43 14                mov    %eax,0x14(%rbx)
  5d:   74 2f                   je     8e <_00400820+0x7e>
  5f:   39 c8                   cmp    %ecx,%eax
  61:   74 34                   je     97 <_00400820+0x87>
  63:   89 ce                   mov    %ecx,%esi
  65:   c1 ee 0c                shr    $0xc,%esi
  68:   48 8b b4 f3 00 02 00    mov    0x200(%rbx,%rsi,8),%rsi
  6f:   00 
  70:   89 cf                   mov    %ecx,%edi
  72:   c1 ef 02                shr    $0x2,%edi
  75:   81 e7 ff 03 00 00       and    $0x3ff,%edi
  7b:   8b 34 be                mov    (%rsi,%rdi,4),%esi
  7e:   83 e6 01                and    $0x1,%esi
  81:   83 c1 04                add    $0x4,%ecx
  84:   89 73 0c                mov    %esi,0xc(%rbx)
  87:   89 4b 10                mov    %ecx,0x10(%rbx)
  8a:   85 f6                   test   %esi,%esi
  8c:   75 d1                   jne    5f <_00400820+0x4f>
  8e:   ff c2                   inc    %edx
  90:   39 c8                   cmp    %ecx,%eax
  92:   89 53 08                mov    %edx,0x8(%rbx)
  95:   75 cc                   jne    63 <_00400820+0x53>
  97:   8b 73 7c                mov    0x7c(%rbx),%esi
  9a:   48 89 df                mov    %rbx,%rdi
  9d:   e8 00 00 00 00          callq  a2 <_00400820+0x92>
  a2:   48 85 c0                test   %rax,%rax
  a5:   74 1d                   je     c4 <_00400820+0xb4>
  a7:   48 89 df                mov    %rbx,%rdi
  aa:   5b                      pop    %rbx
  ab:   ff e0                   jmpq   *%rax
  ad:   8b 73 7c                mov    0x7c(%rbx),%esi
  b0:   c7 43 08 00 00 00 00    movl   $0x0,0x8(%rbx)
  b7:   48 89 df                mov    %rbx,%rdi
  ba:   e8 00 00 00 00          callq  bf <_00400820+0xaf>
  bf:   48 85 c0                test   %rax,%rax
  c2:   75 e3                   jne    a7 <_00400820+0x97>
  c4:   5b                      pop    %rbx
  c5:   c3                      retq 

I suspect some of the issues are related with lack of noalias attributes. LLVM might think that storing to virtual memory might alias with the register bank, and generate very conservative code. Something to have a look at later.

Optimizing well-behaved calls

If we know that the application is well-behaved w.r.t. calls and returns, we can remove the thunk calls to __recompiler_predict_return and checks for JR. If jr $ra is seen, we statically translate that to a return.

Floating point

In MIPS I, floating point math is handled by coprocessor 1, CP1. We can load 32-bit values directly into the FP registers, move to and from integer registers, and fiddle with the control register. The control register controls rounding modes. I haven’t bothered emulating correct rounding modes for now, but the control register is used to deal with floating point conditional branches, so the register needs to be emulated at least. Just like SSE, the actual data type of the FP register can vary depending on the instruction, so we will need a lot of bitcasts, fortunately, this is a native construct in LLVM.

Let’s try implementing an FMA loop for good measure.

__attribute__((noinline))
float my_fma(const float *a, const float *b, int count)
{
        float res = 0.0f;
        for (int i = 0; i < count; i++)
                res += a[i] * b[i];
        return res;
}

int main(void)
{
        const float as[] = { 1.0f, 2.0f, 3.0f, 4.0f };
        const float bs[] = { 10.0f, -2.0f, 50.0f, -4.0f };
        float result = my_fma(as, bs, 4);
        return (int)result;
}
004008c0 <my_fma>:
  4008c0:       18c0000c        blez    a2,4008f4 <my_fma+0x34>
  4008c4:       00063080        sll     a2,a2,0x2
  4008c8:       44800000        mtc1    zero,$f0
  4008cc:       00863021        addu    a2,a0,a2
  4008d0:       c4820000        lwc1    $f2,0(a0)
  4008d4:       c4a40000        lwc1    $f4,0(a1)
  4008d8:       24840004        addiu   a0,a0,4
  4008dc:       46041082        mul.s   $f2,$f2,$f4
  4008e0:       24a50004        addiu   a1,a1,4
  4008e4:       14c4fffa        bne     a2,a0,4008d0 <my_fma+0x10>
  4008e8:       46020000        add.s   $f0,$f0,$f2
  4008ec:       03e00008        jr      ra
  4008f0:       00000000        nop
  4008f4:       44800000        mtc1    zero,$f0
  4008f8:       03e00008        jr      ra
  4008fc:       00000000        nop

We implement the floating point registers by bitcasting all the things, and keeping the register bank as integer always. Otherwise, the code-gen to LLVM IR is reasonably straight forward. In the generated x86-64 we end up seeing the magic instructions we want to see buried in the noise.

...  
  ea:   f3 0f 59 0c 98          mulss  (%rax,%rbx,4),%xmm1
  ef:   f3 0f 58 c1             addss  %xmm1,%xmm0
...

Syscalls

To end this post on a less intense note, let’s write hello world without the support of libc setup and run it in our VM. Unfortunately, we will have to write this in assembly as the C code we generate assumes that libc is up and running (something something $gp register), so raw assembly it is.

.data
str:
.ascii "Hello World!\n"

.text
.global __start
__start:
# write syscall is 4004
        li $v0, 4004
        li $a0, 1
        la $a1, str
        li $a2, 13
        syscall

# exit syscall is 4001
        li $v0, 4001
        li $a0, 0
        syscall

# Should never get here.
        jr $ra
; ModuleID = '_004000f0'
source_filename = "_004000f0"

%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }

define void @_004000f0(%0*) {
entry:
  br label %_004000f0

_004000f0:                                        ; preds = %entry
  %v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4004, i32* %v0Ptr
  %a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 1, i32* %a0Ptr
  %a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
  store i32 4260128, i32* %a1Ptr
  %a2Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 6
  store i32 13, i32* %a2Ptr
  call void @__recompiler_syscall(%0* %0, i32 4194564, i32 0)
  %v0Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
  store i32 4001, i32* %v0Ptr1
  %a0Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
  store i32 0, i32* %a0Ptr2
  call void @__recompiler_syscall(%0* %0, i32 4194576, i32 0)
  %raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
  %raLoaded = load i32, i32* %raPtr
  %jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
  %jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
  br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn

IndirectJumpPath:                                 ; preds = %_004000f0
  tail call void %jump_addr(%0* %0)
  ret void

IndirectJumpReturn:                               ; preds = %_004000f0
  ret void
}

declare void @__recompiler_syscall(%0*, i32, i32)

declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)

To handle syscalls, we simply create a hook into the VM host, and handle it. The syscall number goes in the $v0 register, and arguments follow as normal. To implement the write syscall we simply need to copy over data from our virtual address space and call write in our native environment.

void MIPS::syscall_write()
{
	int fd = scalar_registers[REG_A0];
	Address addr = scalar_registers[REG_A1];
	uint32_t count = scalar_registers[REG_A2];
	std::vector<uint8_t> output;
	output.resize(count);

	addr_space.copy_from_user(output.data(), addr, count);

	scalar_registers[REG_V0] = write(fd, output.data(), count);

	if (scalar_registers[REG_V0] < 0)
		scalar_registers[REG_A3] = errno;
	else
		scalar_registers[REG_A3] = 0;
}

Of course, to run this code on Windows, we’d have to do a lot of extra work to emulate these syscalls, but meh :p That is boring.

Syscalls are generally easy to deal with, but the exception is mmap() and friends. These interact directly with the virtual address space, and we need to implement our own virtual page allocator. glibc requires this to implement malloc(), so any non-trivial code is going to need a decent mmap() implementation. Getting all the weird edge cases working took a surprising amount of time. We also need to implement the obscure brk() syscall which predates mmap(). brk() is used by glibc until it fails, and then it falls back to mmap() to allocate heap memory. mmap() can also refer to non-memory resources, so we cannot just assume we have a nice, big and flat address space which we allocate from.

ioctl() will also be a nightmare, and I have not bothered with this syscall yet. We cannot translate generic structs between the two completely different ABIs since ioctl() just takes a void *. Fortunately, glibc does not require ioctl to work properly to host a full C++ application.

Conclusion

We have seen how we’re taking MIPS code and turning it into running code through LLVM. In the next post we will bring up a fully-fledged C application and even a C++ application, and do some benchmarking to compare native applications vs recompiled MIPS applications, stay tuned!