In part 1 and part 2 we laid the groundwork to start recompiling MIPS code to LLVM IR. Strap your seatbelts, we’re going to MIPS and x86 assembly land.
The top-level run loop
The top level code fundamentally needs to be able to translate the program counter (short-hand, PC) to an executable function pointer. We can choose a hash map (large address space) or flat array (small address space) here.
If we need to call a PC we have not seen before, we will need to recompile a new LLVM module, starting at that PC, and then we can execute it.
Self-modifying code?
An immediate question is self-modifying code. This is a fairly ugly topic to deal with since our previously compiled function might become invalid if the underlying code changes. I think the solution for that is to keep a JIT block cache which translates a hash to function pointer and do some analysis of code blocks we don’t have a function pointer for yet. Any i-cache invalidations will clear out the relevant function pointers which triggers hashing in some form. Most likely the code for our function in particular did not change, so we can likely reuse the code blocks we generated.
For our purpose, we will not deal with self-modifying code here. A real emulator will have to deal with it, but self-modifying code should be rarer and rarer the more modern hardware we’re dealing with.
Recompiling a function
So, given a PC to execute, we’ll do some analysis where we map out all execution paths from that PC. We do this by mapping out all the basic blocks. See part 2 for more detail on what basic blocks do in LLVM.
Basic block
Basic blocks are represented as a starting PC and an end, where the execution flow is linear. The end of a basic block occurs where we see some kind of branch instruction (except for call instructions!). In this analysis we only care about these “special” instructions. Normal opcodes like arithmetic and load/store are ignored since they cannot affect control flow.
Branch delay slots
A very important part of MIPS is the use of a branch delay slot. It is a very unique design aspect of the architecture, which is considered a design flaw today because it was hard-coded to help a very specific micro-architecture. Exposing micro-architecture details like this should be considered bad taste. Whenever a branch is taken, unconditionally or not, the next instruction is always executed. Let’s see a trivial example:
int foo(int a)
{
return a + 10;
}
00000000 <foo>:
0: 03e00008 jr ra
4: 2482000a addiu v0,a0,10
“jr $ra” jumps to an address stored in a register, and $ra is used for the return address of a function. However, we can see that the add instruction comes afterwards. GCC exploits the delay slot here to do the useful computation inside it. Note that if you write MIPS assembly, you can get the assembler to perform this reordering for you. Often you will see “nop” after a branch if there is nothing useful to do in the delay slot.
One thought you might have now is, what happens if you have multiple branches back to back, branching in a delay slot? Well, if you actually thought of that then congratulations, have a cookie. This is explicitly banned in MIPS ISA, because it is non-sensical and undefined. While the hardware behavior could be well defined for a particular chip, it is still extremely broken, because if there is any exception or hardware interrupt happening in the middle of this sequence, it is impossible to recover from it. MIPS interrupt handlers typically have to deal with delay slots and fix-up the PC register accordingly.
The practical effect of the delay slot for us is that whenever we recompile a branch instruction, we recompile the following instruction first, then perform the branch. If the following instruction is also a branch instruction, we know that it cannot legally be taken, and branch instructions cannot have side effects (except for jal/jalr, but those always take the branch, illegal!), so we just skip it.
Load delay slots
MIPS I also has a delay for loads. We cannot use the target register in the instruction following a load. However, for recompilation purpose we can ignore this. While clever code might attempt to abuse the fact that a target register for a load instruction hasn’t been updated yet, this is also unsafe in the real world. If an interrupt triggers in the middle of this sequence, the register will be updated anyways, breaking the assumption of the clever code. Thus, we simply ignore the existence of the load delay slot because correct code cannot rely on this hack.
Conditional branches
MIPS has a few conditional branch instructions. When we see a conditional branch, we can branch to one of two basic blocks. Either we take the branch, or we don’t. We recursively analyze the new basic blocks we found if their target PCs haven’t been analyzed already. Some instructions to look out for are
- BEQ
- BNE
- BLEZ
- BLTZ
- BGEZ
- BGTZ
- BC1T (floating point compare)
- BC1F (floating point compare)
Direct branches
The direct branch in MIPS is the “J” instruction. We need to be careful with this instruction because it is commonly used in two ways:
- Branch to basic block
- Tail call to an unrelated function
If we mistakenly treat a J as a basic block where it should have been a function, we will end up inlining huge functions into our own, where we should have just “called” them instead. Too much inlining will bloat the JIT and make recompilation slower. Let’s see an example.
#include <stdlib.h>
// Make sure we don't get inlining optimizations.
__attribute__((noinline))
static void *wrapped_malloc(size_t size)
{
return malloc(size);
}
void *my_malloc(size_t size)
{
return wrapped_malloc(size * 4); // Tail-call
}
00000000 <wrapped_malloc>:
0: 3c1c0000 lui gp,0x0
4: 279c0000 addiu gp,gp,0
8: 8f990000 lw t9,0(gp)
c: 00000000 nop
10: 03200008 jr t9
14: 00000000 nop
00000018 <my_malloc>:
18: 08000000 j 0 <wrapped_malloc>
1c: 00042080 sll a0,a0,0x2
Here we need to see that J is actually a tail call to “wrapped_malloc”, and not a branch to a basic block. The heuristic I ended up with was that if the J target refers to a basic block through the use of conditional branches elsewhere, we can assume J refers to a branch to a basic block ala if/else or switch blocks. If not, we assume it’s a tail call.
There are other static branches we can find in MIPS. The conditional branches can become static branches if the $0 register is used. This seems to be mostly useful with position-independent code since we can branch to an address relative to our PC rather than fixed address with J. We should try to detect these “static” branches as well. There is no need in analyzing code which can never be executed.
Indirect branches
Indirect branches in MIPS are a bit tricky to handle. They are implemented using the JR instruction. The edge case we need to handle is that JR is also used to return from a function. Either way, JR will always end a basic block. The implementation logic will end up being something like this:
// JR
uint32_t target_pc = registers[instr.register];
if (target_pc == return_prediction_stack.top())
{
// This is actually a return!
predicition_stack.pop();
return;
}
else
{
// This might have to recompile new code if we haven't seen target_pc before!
auto *target = mips_resolve_call_target(target_pc);
return target(mips_state); // Tail-call.
}
There are a few main use cases for JR:
- Returning from a function, almost always using “jr $ra”.
- Jump tables
- Tail calls in dynamically loaded code.
We might be able to add a few optimizations for “well-behaved” code, where we can safely assume that “jr $ra” always means return, and that $ra always refers to the correct return address. That is not guaranteed, but I think GCC will always generate sane code at least.
Illegal instructions
If we find an illegal instruction, we can call out to the VM host, and request a SIGILL signal to be raised to our thread. This also ends the basic block.
Putting it together
Now we have gone through all instructions which can trigger an end of a basic block. Let’s take a more complex function and split it up into basic blocks.
int number_of_even(const int *values, int count)
{
int res = 0;
for (int i = 0; i < count; i++)
if ((values[i] & 1) == 0)
res++;
return res;
}
00000000 <number_of_even>:
// ConditionalBranch -> 0x38 or 0x8
// Note that the basic block does not end until after
// the delay slot has executed.
0: 18a0000d blez a1,38 <number_of_even+0x38>
4: 00052880 sll a1,a1,0x2
// ConditionalBranch -> 0x28 or 0x24
8: 00852821 addu a1,a0,a1
c: 00001025 move v0,zero
10: 8c830000 lw v1,0(a0)
14: 00000000 nop
18: 30630001 andi v1,v1,0x1
1c: 14600002 bnez v1,28 <number_of_even+0x28>
20: 24840004 addiu a0,a0,4
// ConditionalBranch -> 0x28 or 0x24
// Should split up here because 0x10 is a branch target.
// Current implementation does not split up basic blocks
// to allow branching to the middle of another basic block.
// Instead we end up duplicating some code.
10: 8c830000 lw v1,0(a0)
14: 00000000 nop // A wild load-delay slot appears.
18: 30630001 andi v1,v1,0x1
1c: 14600002 bnez v1,28 <number_of_even+0x28>
20: 24840004 addiu a0,a0,4
// ConditionalBranch -> 0x30 or 0x10
24: 24420001 addiu v0,v0,1
28: 14a4fff9 bne a1,a0,10 <number_of_even+0x10>
2c: 00000000 nop
// ConditionalBranch -> 0x30 or 0x10
// Same here w.r.t. code duplication,
// 0x24 and 0x28 are both branch targets.
28: 14a4fff9 bne a1,a0,10 <number_of_even+0x10>
2c: 00000000 nop
// Indirect branch -> terminates graph, tail call or return.
30: 03e00008 jr ra
34: 00000000 nop
// Indirect branch -> terminates graph, tail call or return.
38: 03e00008 jr ra
3c: 00001025 move v0,zero
Once we know all the basic blocks, we can create LLVM basic blocks for them, and then recompile the blocks directly and link them together with BranchInst. This way of analyzing and recompiling is fairly ISA agnostic actually and it’s not that hard to change MIPS into something else once the basic structure is in place. The recompiler itself which sets up this is actually completely MIPS agnostic, it only asks for “given a start PC, where does the basic block end, and what kind of basic block is it.”
Register allocation and branches
While we’re working on registers, we ideally want the MIPS registers to be reflected by our native hardware registers. It’s obvious a 1:1 mapping is not possible. MIPS has 32 (well, 31) general purpose registers, 32 floating-point registers and various control registers. This isn’t going to fit on x86 or Arm.
Fortunately, we do not have to really care about register allocation when using LLVM. We just need to make sure we don’t emit CreateStore/CreateLoad as much as possible, and LLVM should take care of the rest. Within a basic block, this is very easy since we always know which SSA value a register refers to as the control flow is linear. I implemented a simple RegisterTracker class which lets me translate registers to SSA values. If we haven’t used a register yet, load it from memory, if we modify a register, just replace the SSA value and remember that we eventually have to write it back to memory later, i.e. the register bank.
The real problem is how to deal with branches. We learned last time that to pass values to other basic blocks we can use PHI nodes. I tried implementing a scheme like this, where I would build a full CFG and try to link up register values using PHI nodes, but I gave up. The biggest complication is that our registers can become invalidated when calling other functions (since they modify registers as well), and we will have a real hard time handling register dirty tracking. If we have say a basic block C which can be entered from basic block A and B, A might write registers 1 through 15 and B might write registers 16 through 31. If we want to use PHI nodes, we’ll need to create one for every possible register all predecessors of C might have touched. We also don’t really know which registers are dirty and need to be moved back to memory after the function ends, and emitting branches just to conditionally move registers back to memory is dumb. Because of all these complications and pathological cases I went with a very simple scheme. At the end of a basic block or before a function call, all dirty registers are flushed to memory. On entry of a basic block, we will have to load all the registers we need from memory. Ideally, LLVM should be able to optimize this back to SSA/PHI form, but it might be rather expensive to do so. Even if LLVM does not optimize for this, the register bank should be 100% hot in L1 cache, so I’m not too worried about performance. x86 is a very register starved architecture to begin with and moving data to and from L1 cache is very common.
Call instructions
MIPS has several ways of “calling” functions. These functions do not necessarily end a basic block, since we expect control flow to return to the instruction following the branch delay slot.
- JAL
- JALR
- BLTZAL
- BGEZAL
- J (deduced tail call)
- BEQ/BGEZ/BLTZ (deduced position-independent tail call)
The L stands for link, which means that $pc + 8 is written to the return register $ra before jumping. As we saw earlier, we can return by jumping indirectly to $ra. Unlike x86, there is no “return address is on stack”.
JAL is the easiest one to understand, as it means “call this address”. JALR is a variant where we call a function pointer. BLTZAL and BGEZAL are very interesting as they conditionally call a function. They are also useful for position independent calls since they use the PC-relative addressing mode. All of these instructions are fundamentally implemented in the same way.
Return stack prediction
We want to be as friendly as possible to our CPUs branch predictor. The return instruction is one of the best prediction methods we can exploit. When we return, the CPU can be almost 100% sure where we are going to branch unless we were the subject of a stack smash attack or something. The CPU keeps an internal stack where it expects a return to go, and that can be used to predict returns perfectly if our code is well behaved.
Of course, we cannot assume the code we’re running is perfect, but we can optimize for it. Whenever we are executing a link instruction, we can push the link target to a prediction stack. Whenever we see a JR instruction later, we check if it equals the top of the prediction stack. If so, we can pop the stack and simply return, no extra JIT compilation necessary. If JR is not a return, we might have to compile some more code.
One problem of the return stack is that MIPS code is free to just call JAL over and over and over, since JAL just writes to the link register, and doesn’t actually affect the stack pointer $sp.
To deal with the situation where the return stack grows towards infinity, we will just need to deal with it by setting a rational upper limit. In the worst case where our return stack for some reason grows too large, we can use the nuclear option in our arsenal, longjmp! The top level code uses setjmp, and if at any time we’ve reached a hopeless situation, longjmp unwinds the entire stack at once, and we can re-enter with our new PC. However, this is kinda terrible for performance since all return instructions will now fail to optimize to a simple return, and might have to JIT out random code which followed a call instruction. We’ll hope this never happens for real.
To thunk or not to thunk
While indirect calls must have a lookup to determine what we are actually calling in runtime, it’s possible for direct call instructions to directly call another function in LLVM. In this case, we avoid any runtime lookups. We risk recursively having to recompile the callee functions to be able to link such a function, so the initial JIT step can become really slow. I added an option which lets JAL pretend to be JALR and have all call instructions go through an indirection. LLVM can support lazy JIT to alleviate this problem, but I don’t know how to make that work, so, meh. Our grand plan is to optimize all of this stuff offline anyways later ๐
Putting it all together
It’s time to look at some real code, real MIPS output and the resulting LLVM IR. In the VM, I added a mode which lets me call any function by name. This is very useful to facilitate small test cases, so I don’t have to go through the entire libc init step just to test some basic arithmetic. $ra will be 0, and I treat returning to PC 0 as “I’m done with the test, dump registers”.
__attribute__((noinline))
int foo(int a, int b)
{
return a + b;
}
int main(void)
{
return foo(40, 50);
}
004005ec <main>:
4005ec: 24050032 li a1,50
4005f0: 08100208 j 400820 <foo>
4005f4: 24040028 li a0,40
00400820 <foo>:
400820: 03e00008 jr ra
400824: 00851021 addu v0,a0,a1
Doesn’t get much simpler to start with this test. main calls foo through a tail call, let’s see what the LLVM looks like completely unoptimized:
; ModuleID = '_004005ec'
source_filename = "_004005ec"
%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }
define void @_004005ec(%0*) {
entry:
br label %_004005ec
_004005ec: ; preds = %entry
%a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
store i32 40, i32* %a0Ptr
%a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
store i32 50, i32* %a1Ptr
tail call void @_00400820(%0* %0)
ret void
}
declare void @__recompiler_predict_return(%0*, i32, i32)
define void @_00400820(%0*) {
entry:
br label %_00400820
_00400820: ; preds = %entry
%raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
%raLoaded = load i32, i32* %raPtr
%a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
%a1Loaded = load i32, i32* %a1Ptr
%a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
%a0Loaded = load i32, i32* %a0Ptr
%v0 = add i32 %a0Loaded, %a1Loaded
%v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
store i32 %v0, i32* %v0Ptr
%jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
%jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn
IndirectJumpPath: ; preds = %_00400820
tail call void %jump_addr(%0* %0)
ret void
IndirectJumpReturn: ; preds = %_00400820
ret void
}
declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)
The first thing we notice is
%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }
which expresses the MIPS state which we pass around to our JIT functions. 64 i32 values are reserved for the general purpose registers (32 + a couple other hidden registers), 64 FP registers (32 + a couple extra), and finally, the page table. We inline it in the struct to be able to load and store memory as efficiently as possible. The code should be fairly easy to follow until we reach the return in foo()
%jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
%jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn
IndirectJumpPath: ; preds = %_00400820
tail call void %jump_addr(%0* %0)
ret void
IndirectJumpReturn: ; preds = %_00400820
ret void
}
declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)
Here we call to our externally defined function in the VM to check if return stack prediction worked. Either we tail call or just simply return. $ra in this case will be 0, and we just end execution here.
The registers are dumped at the end to read:
...
v0 = 90
v1 = 0
a0 = 40
a1 = 50
...
Very nice! $v0 is the return register in the MIPS ABI and $a0/$a1 are the first and second arguments respectively.
Loads and stores
Let’s have a look what happens when we cannot rely on tail calls.
__attribute__((noinline))
int foo(int a, int b)
{
return a + b;
}
int main(void)
{
int a = foo(1, 2);
a += foo(3, 4);
return a;
}
004005ec <main>:
4005ec: 27bdffe0 addiu sp,sp,-32
4005f0: 24050002 li a1,2
4005f4: afbf001c sw ra,28(sp)
4005f8: 0c100210 jal 400840 <foo>
4005fc: 24040001 li a0,1
400600: 24050004 li a1,4
400604: 24040003 li a0,3
400608: 0c100210 jal 400840 <foo>
40060c: 00401825 move v1,v0
400610: 8fbf001c lw ra,28(sp)
400614: 00621021 addu v0,v1,v0
400618: 03e00008 jr ra
40061c: 27bd0020 addiu sp,sp,32
We only need to load and store to stack, but we’ll see the codegen in action.
_004005ec: ; preds = %entry
%spPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
%spLoaded = load i32, i32* %spPtr
%sp = add i32 %spLoaded, -32
%raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
%raLoaded = load i32, i32* %raPtr
%SWAddr = add i32 %sp, 28
; Translate virtual address to page + offset
%PageIndex = lshr i32 %SWAddr, 12
%Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
%PageLoaded = load i8*, i8** %Page
%Page32 = bitcast i8* %PageLoaded to i32*
%PageOffset = lshr i32 %SWAddr, 2
%PageOffsetMasked = and i32 %PageOffset, 1023
%PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
store i32 %raLoaded, i32* %PagePtr
; Flush registers before calling foo
%a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
store i32 1, i32* %a0Ptr
%a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
store i32 2, i32* %a1Ptr
%spPtr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
store i32 %sp, i32* %spPtr1
%raPtr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
store i32 4195840, i32* %raPtr2
; Predict the return address
call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195840)
; Direct call to foo, no indirection needed here.
call void @_00400840(%0* %0)
%v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
%v0Loaded = load i32, i32* %v0Ptr
%v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
store i32 %v0Loaded, i32* %v1Ptr
%a0Ptr3 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
store i32 3, i32* %a0Ptr3
%a1Ptr4 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
store i32 4, i32* %a1Ptr4
%raPtr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
store i32 4195856, i32* %raPtr5
call void @__recompiler_predict_return(%0* %0, i32 4196416, i32 4195856)
call void @_00400840(%0* %0)
%spPtr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
%spLoaded7 = load i32, i32* %spPtr6
%LWAddr = add i32 %spLoaded7, 28
%PageIndex8 = lshr i32 %LWAddr, 12
%Page9 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex8
%PageLoaded10 = load i8*, i8** %Page9
%Page3211 = bitcast i8* %PageLoaded10 to i32*
%PageOffset12 = lshr i32 %LWAddr, 2
%PageOffsetMasked13 = and i32 %PageOffset12, 1023
%PagePtr14 = getelementptr inbounds i32, i32* %Page3211, i32 %PageOffsetMasked13
%Loaded = load i32, i32* %PagePtr14
%v0Ptr15 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
%v0Loaded16 = load i32, i32* %v0Ptr15
%v1Ptr17 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
%v1Loaded = load i32, i32* %v1Ptr17
%v0 = add i32 %v1Loaded, %v0Loaded16
%sp18 = add i32 %spLoaded7, 32
%v0Ptr19 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
store i32 %v0, i32* %v0Ptr19
%spPtr20 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 29
store i32 %sp18, i32* %spPtr20
%raPtr21 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
store i32 %Loaded, i32* %raPtr21
%jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %Loaded)
%jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn
IndirectJumpPath: ; preds = %_004005ec
tail call void %jump_addr(%0* %0)
ret void
IndirectJumpReturn: ; preds = %_004005ec
ret void
}
declare void @__recompiler_predict_return(%0*, i32, i32)
There is a fair bit of noise here with loading and storing to memory. We have to emulate the virtual address space, so that means translating addresses into pages and offsets. What about the x86-64 output?
0000000000000000 <_004005ec>:
0: 53 push %rbx
1: 48 89 fb mov %rdi,%rbx
4: 8b 47 74 mov 0x74(%rdi),%eax
7: 8b 4f 7c mov 0x7c(%rdi),%ecx
a: 8d 50 e0 lea -0x20(%rax),%edx
d: 83 c0 fc add $0xfffffffc,%eax
10: 89 c6 mov %eax,%esi
12: c1 ee 0c shr $0xc,%esi
15: 48 8b b4 f7 00 02 00 mov 0x200(%rdi,%rsi,8),%rsi
1c: 00
1d: c1 e8 02 shr $0x2,%eax
20: 25 ff 03 00 00 and $0x3ff,%eax
25: 89 0c 86 mov %ecx,(%rsi,%rax,4)
28: 48 b8 01 00 00 00 02 movabs $0x200000001,%rax
2f: 00 00 00
32: 48 89 47 10 mov %rax,0x10(%rdi)
36: 89 57 74 mov %edx,0x74(%rdi)
39: c7 47 7c 00 06 40 00 movl $0x400600,0x7c(%rdi)
40: be 40 08 40 00 mov $0x400840,%esi
45: ba 00 06 40 00 mov $0x400600,%edx
4a: e8 00 00 00 00 callq 4f <_004005ec+0x4f>
4f: 48 89 df mov %rbx,%rdi
52: e8 79 00 00 00 callq d0 <_00400840>
57: 8b 43 08 mov 0x8(%rbx),%eax
5a: 89 43 0c mov %eax,0xc(%rbx)
5d: 48 b8 03 00 00 00 04 movabs $0x400000003,%rax
64: 00 00 00
67: 48 89 43 10 mov %rax,0x10(%rbx)
6b: c7 43 7c 10 06 40 00 movl $0x400610,0x7c(%rbx)
72: be 40 08 40 00 mov $0x400840,%esi
77: ba 10 06 40 00 mov $0x400610,%edx
7c: 48 89 df mov %rbx,%rdi
7f: e8 00 00 00 00 callq 84 <_004005ec+0x84>
84: 48 89 df mov %rbx,%rdi
87: e8 44 00 00 00 callq d0 <_00400840>
8c: 8b 43 0c mov 0xc(%rbx),%eax
8f: 8b 4b 74 mov 0x74(%rbx),%ecx
92: 8d 51 1c lea 0x1c(%rcx),%edx
95: 89 d6 mov %edx,%esi
97: c1 ee 0c shr $0xc,%esi
9a: 48 8b b4 f3 00 02 00 mov 0x200(%rbx,%rsi,8),%rsi
a1: 00
a2: c1 ea 02 shr $0x2,%edx
a5: 81 e2 ff 03 00 00 and $0x3ff,%edx
ab: 8b 34 96 mov (%rsi,%rdx,4),%esi
ae: 83 c1 20 add $0x20,%ecx
b1: 01 43 08 add %eax,0x8(%rbx)
b4: 89 4b 74 mov %ecx,0x74(%rbx)
b7: 89 73 7c mov %esi,0x7c(%rbx)
ba: 48 89 df mov %rbx,%rdi
bd: e8 00 00 00 00 callq c2 <_004005ec+0xc2>
c2: 48 85 c0 test %rax,%rax
c5: 74 06 je cd <_004005ec+0xcd>
c7: 48 89 df mov %rbx,%rdi
ca: 5b pop %rbx
cb: ff e0 jmpq *%rax
cd: 5b pop %rbx
ce: c3 retq
cf: 90 nop
Ouch. A lot of this is noise to deal with register moves. We can see the code sequence which performs loads and stores here:
15: 48 8b b4 f7 00 02 00 mov 0x200(%rdi,%rsi,8),%rsi
1c: 00
1d: c1 e8 02 shr $0x2,%eax
20: 25 ff 03 00 00 and $0x3ff,%eax
25: 89 0c 86 mov %ecx,(%rsi,%rax,4)
Good news is that this is very straight forward code, so the CPU should churn through most of this like butter unless we’re missing the page table reads in L1. It will be interesting to benchmark this code against natively compiled C code later.
Loops
Let’s try to JIT the number_of_even function we made earlier and see if LLVM can preserve data in registers across loop iterations.
__attribute__((noinline))
int number_of_even(const int *values, int count)
{
int res = 0;
for (int i = 0; i < count; i++)
if ((values[i] & 1) == 0)
res++;
return res;
}
int main(void)
{
static const int values[] = { 1, 2, 3, 4 };
return number_of_even(values, 4);
}
00400820 <number_of_even>:
400820: 18a0000d blez a1,400858 <number_of_even+0x38>
400824: 00052880 sll a1,a1,0x2
400828: 00852821 addu a1,a0,a1
40082c: 00001025 move v0,zero
400830: 8c830000 lw v1,0(a0)
400834: 00000000 nop
400838: 30630001 andi v1,v1,0x1
40083c: 14600002 bnez v1,400848 <number_of_even+0x28>
400840: 24840004 addiu a0,a0,4
400844: 24420001 addiu v0,v0,1
400848: 14a4fff9 bne a1,a0,400830 <number_of_even+0x10>
40084c: 00000000 nop
400850: 03e00008 jr ra
400854: 00000000 nop
400858: 03e00008 jr ra
40085c: 00001025 move v0,zero
004005ec <main>:
4005ec: 3c040047 lui a0,0x47
4005f0: 24050004 li a1,4
4005f4: 08100208 j 400820 <number_of_even>
4005f8: 2484a330 addiu a0,a0,-23760
4005fc: 00000000 nop
define void @_00400820(%0*) {
entry:
br label %_00400820
_00400820: ; preds = %entry
%a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
%a1Loaded = load i32, i32* %a1Ptr
%BLEZ = icmp sle i32 %a1Loaded, 0
%a1 = shl i32 %a1Loaded, 2
%a1Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
store i32 %a1, i32* %a1Ptr1
br i1 %BLEZ, label %_00400858, label %_00400828
_00400858: ; preds = %_00400820
%raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
%raLoaded = load i32, i32* %raPtr
%v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
store i32 0, i32* %v0Ptr
%jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
%jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn
_00400828: ; preds = %_00400820
%a1Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
%a1Loaded3 = load i32, i32* %a1Ptr2
%a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
%a0Loaded = load i32, i32* %a0Ptr
%a14 = add i32 %a0Loaded, %a1Loaded3
%LWAddr = add i32 %a0Loaded, 0
%PageIndex = lshr i32 %LWAddr, 12
%Page = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex
%PageLoaded = load i8*, i8** %Page
%Page32 = bitcast i8* %PageLoaded to i32*
%PageOffset = lshr i32 %LWAddr, 2
%PageOffsetMasked = and i32 %PageOffset, 1023
%PagePtr = getelementptr inbounds i32, i32* %Page32, i32 %PageOffsetMasked
%Loaded = load i32, i32* %PagePtr
%v1 = and i32 %Loaded, 1
%BNE = icmp ne i32 %v1, 0
%a0 = add i32 %a0Loaded, 4
%v0Ptr5 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
store i32 0, i32* %v0Ptr5
%v1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
store i32 %v1, i32* %v1Ptr
%a0Ptr6 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
store i32 %a0, i32* %a0Ptr6
%a1Ptr7 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
store i32 %a14, i32* %a1Ptr7
br i1 %BNE, label %_00400848, label %_00400844
_00400848: ; preds = %_00400830, %_00400828
%a0Ptr8 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
%a0Loaded9 = load i32, i32* %a0Ptr8
%a1Ptr10 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
%a1Loaded11 = load i32, i32* %a1Ptr10
%BNE12 = icmp ne i32 %a1Loaded11, %a0Loaded9
br i1 %BNE12, label %_00400830, label %_00400850
_00400830: ; preds = %_00400844, %_00400848
%a0Ptr13 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
%a0Loaded14 = load i32, i32* %a0Ptr13
%LWAddr15 = add i32 %a0Loaded14, 0
%PageIndex16 = lshr i32 %LWAddr15, 12
%Page17 = getelementptr inbounds %0, %0* %0, i32 0, i32 2, i32 %PageIndex16
%PageLoaded18 = load i8*, i8** %Page17
%Page3219 = bitcast i8* %PageLoaded18 to i32*
%PageOffset20 = lshr i32 %LWAddr15, 2
%PageOffsetMasked21 = and i32 %PageOffset20, 1023
%PagePtr22 = getelementptr inbounds i32, i32* %Page3219, i32 %PageOffsetMasked21
%Loaded23 = load i32, i32* %PagePtr22
%v124 = and i32 %Loaded23, 1
%BNE25 = icmp ne i32 %v124, 0
%a026 = add i32 %a0Loaded14, 4
%v1Ptr27 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 3
store i32 %v124, i32* %v1Ptr27
%a0Ptr28 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
store i32 %a026, i32* %a0Ptr28
br i1 %BNE25, label %_00400848, label %_00400844
_00400844: ; preds = %_00400830, %_00400828
%v0Ptr29 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
%v0Loaded = load i32, i32* %v0Ptr29
%v0 = add i32 %v0Loaded, 1
%a0Ptr30 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
%a0Loaded31 = load i32, i32* %a0Ptr30
%a1Ptr32 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
%a1Loaded33 = load i32, i32* %a1Ptr32
%BNE34 = icmp ne i32 %a1Loaded33, %a0Loaded31
%v0Ptr35 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
store i32 %v0, i32* %v0Ptr35
br i1 %BNE34, label %_00400830, label %_00400850
_00400850: ; preds = %_00400844, %_00400848
%raPtr36 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
%raLoaded37 = load i32, i32* %raPtr36
%jump_addr38 = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded37)
%jump_addr_cmp41 = icmp ne void (%0*)* %jump_addr38, null
br i1 %jump_addr_cmp41, label %IndirectJumpPath39, label %IndirectJumpReturn40
IndirectJumpPath: ; preds = %_00400858
tail call void %jump_addr(%0* %0)
ret void
IndirectJumpReturn: ; preds = %_00400858
ret void
IndirectJumpPath39: ; preds = %_00400850
tail call void %jump_addr38(%0* %0)
ret void
IndirectJumpReturn40: ; preds = %_00400850
ret void
}
Again, pretty noisy output, and this is unoptimized output after all. If we look at the x86-64 output, then as expected, it’s pretty bad:
0000000000000010 <_00400820>:
10: 53 push %rbx
11: 48 89 fb mov %rdi,%rbx
14: 8b 47 14 mov 0x14(%rdi),%eax
17: 8d 0c 85 00 00 00 00 lea 0x0(,%rax,4),%ecx
1e: 85 c0 test %eax,%eax
20: 89 4f 14 mov %ecx,0x14(%rdi)
23: 7f 19 jg 3e <_00400820+0x2e>
25: 8b 73 7c mov 0x7c(%rbx),%esi
28: c7 43 08 00 00 00 00 movl $0x0,0x8(%rbx)
2f: 48 89 df mov %rbx,%rdi
32: e8 00 00 00 00 callq 37 <_00400820+0x27>
37: 48 85 c0 test %rax,%rax
3a: 75 50 jne 8c <_00400820+0x7c>
3c: 5b pop %rbx
3d: c3 retq
3e: 8b 43 10 mov 0x10(%rbx),%eax
41: 89 c1 mov %eax,%ecx
43: c1 e9 0c shr $0xc,%ecx
46: 48 8b 8c cb 00 02 00 mov 0x200(%rbx,%rcx,8),%rcx
4d: 00
4e: 89 c2 mov %eax,%edx
50: 81 e2 fc 0f 00 00 and $0xffc,%edx
56: 8b 0c 11 mov (%rcx,%rdx,1),%ecx
59: 01 43 14 add %eax,0x14(%rbx)
5c: 83 c0 04 add $0x4,%eax
5f: 83 e1 01 and $0x1,%ecx
62: c7 43 08 00 00 00 00 movl $0x0,0x8(%rbx)
69: 89 4b 0c mov %ecx,0xc(%rbx)
6c: 89 43 10 mov %eax,0x10(%rbx)
6f: 75 23 jne 94 <_00400820+0x84>
71: 8b 43 14 mov 0x14(%rbx),%eax
74: ff 43 08 incl 0x8(%rbx)
77: 3b 43 10 cmp 0x10(%rbx),%eax
7a: 75 20 jne 9c <_00400820+0x8c>
7c: 8b 73 7c mov 0x7c(%rbx),%esi
7f: 48 89 df mov %rbx,%rdi
82: e8 00 00 00 00 callq 87 <_00400820+0x77>
87: 48 85 c0 test %rax,%rax
8a: 74 06 je 92 <_00400820+0x82>
8c: 48 89 df mov %rbx,%rdi
8f: 5b pop %rbx
90: ff e0 jmpq *%rax
92: 5b pop %rbx
93: c3 retq
94: 8b 43 14 mov 0x14(%rbx),%eax
97: 3b 43 10 cmp 0x10(%rbx),%eax
9a: 74 e0 je 7c <_00400820+0x6c>
9c: 8b 43 10 mov 0x10(%rbx),%eax
9f: 89 c1 mov %eax,%ecx
a1: c1 e9 0c shr $0xc,%ecx
a4: 48 8b 8c cb 00 02 00 mov 0x200(%rbx,%rcx,8),%rcx
ab: 00
ac: 89 c2 mov %eax,%edx
ae: 81 e2 fc 0f 00 00 and $0xffc,%edx
b4: 8b 0c 11 mov (%rcx,%rdx,1),%ecx
b7: 83 e1 01 and $0x1,%ecx
ba: 83 c0 04 add $0x4,%eax
bd: 89 4b 0c mov %ecx,0xc(%rbx)
c0: 89 43 10 mov %eax,0x10(%rbx)
c3: 85 c9 test %ecx,%ecx
c5: 74 aa je 71 <_00400820+0x61>
c7: eb cb jmp 94 <_00400820+0x84>
Not a lot of register use here, what happens if the run the LLVM through opt first though?
; ....
_00400848: ; preds = %_00400830, %_00400828
%a0Loaded9 = phi i32 [ %a026, %_00400830 ], [ %a0, %_00400828 ]
%v0Loaded3 = phi i32 [ %v0Loaded4, %_00400830 ], [ 0, %_00400828 ]
%BNE12 = icmp eq i32 %a14, %a0Loaded9
br i1 %BNE12, label %_00400850, label %_00400830
_00400830: ; preds = %_00400848, %_00400844
%a0Loaded14 = phi i32 [ %a0Loaded9, %_00400848 ], [ %a0Loaded31, %_00400844 ]
%v0Loaded4 = phi i32 [ %v0Loaded3, %_00400848 ], [ %v0, %_00400844 ]
; ...
Sure enough, we’re seeing some loads and stores getting promoted to phi nodes, excellent. The x86-64 codegen is improved a bit as well. Still kinda hard to read though …
0000000000000010 <_00400820>:
10: 53 push %rbx
11: 48 89 fb mov %rdi,%rbx
14: 8b 4f 14 mov 0x14(%rdi),%ecx
17: 8d 04 8d 00 00 00 00 lea 0x0(,%rcx,4),%eax
1e: 85 c9 test %ecx,%ecx
20: 89 47 14 mov %eax,0x14(%rdi)
23: 0f 8e 84 00 00 00 jle ad <_00400820+0x9d>
29: 8b 53 10 mov 0x10(%rbx),%edx
2c: 01 d0 add %edx,%eax
2e: 89 d6 mov %edx,%esi
30: 8d 4a 04 lea 0x4(%rdx),%ecx
33: 48 c1 ea 0c shr $0xc,%rdx
37: 48 8b 94 d3 00 02 00 mov 0x200(%rbx,%rdx,8),%rdx
3e: 00
3f: 81 e6 fc 0f 00 00 and $0xffc,%esi
45: 8b 34 32 mov (%rdx,%rsi,1),%esi
48: 31 d2 xor %edx,%edx
4a: 83 e6 01 and $0x1,%esi
4d: c7 43 08 00 00 00 00 movl $0x0,0x8(%rbx)
54: 89 73 0c mov %esi,0xc(%rbx)
57: 89 4b 10 mov %ecx,0x10(%rbx)
5a: 89 43 14 mov %eax,0x14(%rbx)
5d: 74 2f je 8e <_00400820+0x7e>
5f: 39 c8 cmp %ecx,%eax
61: 74 34 je 97 <_00400820+0x87>
63: 89 ce mov %ecx,%esi
65: c1 ee 0c shr $0xc,%esi
68: 48 8b b4 f3 00 02 00 mov 0x200(%rbx,%rsi,8),%rsi
6f: 00
70: 89 cf mov %ecx,%edi
72: c1 ef 02 shr $0x2,%edi
75: 81 e7 ff 03 00 00 and $0x3ff,%edi
7b: 8b 34 be mov (%rsi,%rdi,4),%esi
7e: 83 e6 01 and $0x1,%esi
81: 83 c1 04 add $0x4,%ecx
84: 89 73 0c mov %esi,0xc(%rbx)
87: 89 4b 10 mov %ecx,0x10(%rbx)
8a: 85 f6 test %esi,%esi
8c: 75 d1 jne 5f <_00400820+0x4f>
8e: ff c2 inc %edx
90: 39 c8 cmp %ecx,%eax
92: 89 53 08 mov %edx,0x8(%rbx)
95: 75 cc jne 63 <_00400820+0x53>
97: 8b 73 7c mov 0x7c(%rbx),%esi
9a: 48 89 df mov %rbx,%rdi
9d: e8 00 00 00 00 callq a2 <_00400820+0x92>
a2: 48 85 c0 test %rax,%rax
a5: 74 1d je c4 <_00400820+0xb4>
a7: 48 89 df mov %rbx,%rdi
aa: 5b pop %rbx
ab: ff e0 jmpq *%rax
ad: 8b 73 7c mov 0x7c(%rbx),%esi
b0: c7 43 08 00 00 00 00 movl $0x0,0x8(%rbx)
b7: 48 89 df mov %rbx,%rdi
ba: e8 00 00 00 00 callq bf <_00400820+0xaf>
bf: 48 85 c0 test %rax,%rax
c2: 75 e3 jne a7 <_00400820+0x97>
c4: 5b pop %rbx
c5: c3 retq
I suspect some of the issues are related with lack of noalias attributes. LLVM might think that storing to virtual memory might alias with the register bank, and generate very conservative code. Something to have a look at later.
Optimizing well-behaved calls
If we know that the application is well-behaved w.r.t. calls and returns, we can remove the thunk calls to __recompiler_predict_return and checks for JR. If jr $ra is seen, we statically translate that to a return.
Floating point
In MIPS I, floating point math is handled by coprocessor 1, CP1. We can load 32-bit values directly into the FP registers, move to and from integer registers, and fiddle with the control register. The control register controls rounding modes. I haven’t bothered emulating correct rounding modes for now, but the control register is used to deal with floating point conditional branches, so the register needs to be emulated at least. Just like SSE, the actual data type of the FP register can vary depending on the instruction, so we will need a lot of bitcasts, fortunately, this is a native construct in LLVM.
Let’s try implementing an FMA loop for good measure.
__attribute__((noinline))
float my_fma(const float *a, const float *b, int count)
{
float res = 0.0f;
for (int i = 0; i < count; i++)
res += a[i] * b[i];
return res;
}
int main(void)
{
const float as[] = { 1.0f, 2.0f, 3.0f, 4.0f };
const float bs[] = { 10.0f, -2.0f, 50.0f, -4.0f };
float result = my_fma(as, bs, 4);
return (int)result;
}
004008c0 <my_fma>:
4008c0: 18c0000c blez a2,4008f4 <my_fma+0x34>
4008c4: 00063080 sll a2,a2,0x2
4008c8: 44800000 mtc1 zero,$f0
4008cc: 00863021 addu a2,a0,a2
4008d0: c4820000 lwc1 $f2,0(a0)
4008d4: c4a40000 lwc1 $f4,0(a1)
4008d8: 24840004 addiu a0,a0,4
4008dc: 46041082 mul.s $f2,$f2,$f4
4008e0: 24a50004 addiu a1,a1,4
4008e4: 14c4fffa bne a2,a0,4008d0 <my_fma+0x10>
4008e8: 46020000 add.s $f0,$f0,$f2
4008ec: 03e00008 jr ra
4008f0: 00000000 nop
4008f4: 44800000 mtc1 zero,$f0
4008f8: 03e00008 jr ra
4008fc: 00000000 nop
We implement the floating point registers by bitcasting all the things, and keeping the register bank as integer always. Otherwise, the code-gen to LLVM IR is reasonably straight forward. In the generated x86-64 we end up seeing the magic instructions we want to see buried in the noise.
...
ea: f3 0f 59 0c 98 mulss (%rax,%rbx,4),%xmm1
ef: f3 0f 58 c1 addss %xmm1,%xmm0
...
Syscalls
To end this post on a less intense note, let’s write hello world without the support of libc setup and run it in our VM. Unfortunately, we will have to write this in assembly as the C code we generate assumes that libc is up and running (something something $gp register), so raw assembly it is.
.data
str:
.ascii "Hello World!\n"
.text
.global __start
__start:
# write syscall is 4004
li $v0, 4004
li $a0, 1
la $a1, str
li $a2, 13
syscall
# exit syscall is 4001
li $v0, 4001
li $a0, 0
syscall
# Should never get here.
jr $ra
; ModuleID = '_004000f0'
source_filename = "_004000f0"
%0 = type { [64 x i32], [64 x i32], [1048576 x i8*] }
define void @_004000f0(%0*) {
entry:
br label %_004000f0
_004000f0: ; preds = %entry
%v0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
store i32 4004, i32* %v0Ptr
%a0Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
store i32 1, i32* %a0Ptr
%a1Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 5
store i32 4260128, i32* %a1Ptr
%a2Ptr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 6
store i32 13, i32* %a2Ptr
call void @__recompiler_syscall(%0* %0, i32 4194564, i32 0)
%v0Ptr1 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 2
store i32 4001, i32* %v0Ptr1
%a0Ptr2 = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 4
store i32 0, i32* %a0Ptr2
call void @__recompiler_syscall(%0* %0, i32 4194576, i32 0)
%raPtr = getelementptr inbounds %0, %0* %0, i32 0, i32 0, i32 31
%raLoaded = load i32, i32* %raPtr
%jump_addr = call void (%0*)* @__recompiler_jump_indirect(%0* %0, i32 %raLoaded)
%jump_addr_cmp = icmp ne void (%0*)* %jump_addr, null
br i1 %jump_addr_cmp, label %IndirectJumpPath, label %IndirectJumpReturn
IndirectJumpPath: ; preds = %_004000f0
tail call void %jump_addr(%0* %0)
ret void
IndirectJumpReturn: ; preds = %_004000f0
ret void
}
declare void @__recompiler_syscall(%0*, i32, i32)
declare void (%0*)* @__recompiler_jump_indirect(%0*, i32)
To handle syscalls, we simply create a hook into the VM host, and handle it. The syscall number goes in the $v0 register, and arguments follow as normal. To implement the write syscall we simply need to copy over data from our virtual address space and call write in our native environment.
void MIPS::syscall_write()
{
int fd = scalar_registers[REG_A0];
Address addr = scalar_registers[REG_A1];
uint32_t count = scalar_registers[REG_A2];
std::vector<uint8_t> output;
output.resize(count);
addr_space.copy_from_user(output.data(), addr, count);
scalar_registers[REG_V0] = write(fd, output.data(), count);
if (scalar_registers[REG_V0] < 0)
scalar_registers[REG_A3] = errno;
else
scalar_registers[REG_A3] = 0;
}
Of course, to run this code on Windows, we’d have to do a lot of extra work to emulate these syscalls, but meh :p That is boring.
Syscalls are generally easy to deal with, but the exception is mmap() and friends. These interact directly with the virtual address space, and we need to implement our own virtual page allocator. glibc requires this to implement malloc(), so any non-trivial code is going to need a decent mmap() implementation. Getting all the weird edge cases working took a surprising amount of time. We also need to implement the obscure brk() syscall which predates mmap(). brk() is used by glibc until it fails, and then it falls back to mmap() to allocate heap memory. mmap() can also refer to non-memory resources, so we cannot just assume we have a nice, big and flat address space which we allocate from.
ioctl() will also be a nightmare, and I have not bothered with this syscall yet. We cannot translate generic structs between the two completely different ABIs since ioctl() just takes a void *. Fortunately, glibc does not require ioctl to work properly to host a full C++ application.
Conclusion
We have seen how we’re taking MIPS code and turning it into running code through LLVM. In the next post we will bring up a fully-fledged C application and even a C++ application, and do some benchmarking to compare native applications vs recompiled MIPS applications, stay tuned!