An unusual recompiler experiment – MIPS to LLVM IR – Part 2

In part 1 we parsed a MIPS ELF file and set up a stack, we should now consider how to translate the MIPS machine code to LLVM IR. Before we get there, we need to learn how to use the LLVM APIs and how code-gen to LLVM IR works.

If you know LLVM well already, this post is not for you, wait for part 3 when we start considering the design of the recompiler. If you don’t, this post should give you all understanding needed to be able to generate some code in LLVM yourself and understand the implementation.

This turned out rather dry, but hey, it’s limited how exciting writing C++ to generate IR can be. 😀 At least it’s very educational.

Which LLVM version to target?

The LLVM JIT APIs are unfortunately rather unstable. The APIs appear to break between major versions. A common workaround to this seems to be to simply stick to a version and depend on that unless you’re ready to update. I got it working on the current LLVM 7, and also the upcoming LLVM 8, which seems to shuffle a lot of internal APIs around. For the sample here, we’re going to consider LLVM 7, which is the latest stable major version. The API changes as far as I know are only related to the JIT runtime, not codegen, so we can hide away the ugliness behind a clean interface at least.

LLVM IR basics

To get us started, let’s make a trivial function, which adds two arguments together and returns it. In C, we would represent this as:

int LLVMAdd(int a, int b);

To do this, we need to create an LLVMContext, an LLVM module (sort of like a translation unit) and a function to that module. We’ll need to create some types as well.

// We'll need a ton of headers for later, might as well just add it now.
#include "llvm/ExecutionEngine/Orc/CompileUtils.h"
#include "llvm/ExecutionEngine/Orc/Core.h"
#include "llvm/ExecutionEngine/Orc/ExecutionUtils.h"
#include "llvm/ExecutionEngine/Orc/IRCompileLayer.h"
#include "llvm/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/DataLayout.h"
#include "llvm/Target/TargetMachine.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/ExecutionEngine/SectionMemoryManager.h"

void test()
{
  llvm::LLVMContext context;
  llvm::Module module("mymodule", context);

  llvm::Type *int32 = llvm::Type::getInt32Ty(context);
  llvm::Type *result_type = int32;
  llvm::Type *argument_types[] = { int32, int32 };
  bool vararg = false;

  llvm::FunctionType *function_type =
    llvm::FunctionType::get(result_type, argument_types, vararg);
  llvm::Function *function =
    llvm::Function::Create(function_type,
                           llvm::Function::ExternalLinkage,
                           "LLVMAdd", module);

  llvm::errs() << module << "\n";
}
; ModuleID = 'mymodule'
source_filename = "mymodule"

declare i32 @LLVMAdd(i32, i32)

For now, we have simply declared the function. If we do nothing more, this is equivalent to using “extern” in C. We can call it, but the linker needs to find the function elsewhere. Let’s expand it with some code.

Basic blocks

Basic blocks are the foundation of LLVM and a function consists of any number of basic blocks. Basic blocks represent linear control flow within a function. At the end of a basic block, the block must be terminated. Here, and only here can we do things like:

  • Branch to another basic block, conditionally or not
  • Return
  • And some other special cases

For our first test, we can get away with one basic block.

Single static assignment (SSA)

Another core principle of LLVM is the use of single-static assignment. In short, it basically means that once you assign a value to a variable, it cannot ever change. If a variable is modified multiple times, each time it is assigned, it needs to be assigned to a new SSA value. You will see this when we start generating some code. This has various benefits for optimization and code-gen later in the pipeline, but that’s LLVM’s problem. Let’s implement the add function.

llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", function);
llvm::IRBuilder<> builder(bb);
llvm::Value *arg0 = &function->arg_begin()[0];
llvm::Value *arg1 = &function->arg_begin()[1];
llvm::Value *value = builder.CreateAdd(arg0, arg1, "added_value");
builder.CreateRet(value);

Here we add a basic block to our function. The first basic block is where the function starts executing. The IRBuilder is the class which lets you build actual code. Here we create an add instruction, and return it. This ends the basic block, and we’re done. Not too bad! If you print it using the existing line of code, you’ll now see:

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32) {
entry:
  %added_value = add i32 %0, %1
  ret i32 %added_value
}

Compiling IR to native code

We could just dump this directly into the JIT engine now, but let’s try compiling this as a dynamic library, using the offline tools. Copy this IR code into a file, e.g. test.ll. Let’s compile this to native code.

$ llc -relocation-model=pic -filetype obj -o test.o test.ll -O2
$ ld -shared -o test.so test.o
$ objdump -d test.so

0000000000001000 <LLVMAdd>:
    1000:       8d 04 37                lea    (%rdi,%rsi,1),%eax
    1003:       c3                      retq

LLVM can usually target other architectures with one binary, so try adding -mtriple=mips to llc, disassemble with the MIPS binutils, and you’ll get:

00000000 <LLVMAdd>:
   0:   03e00008        jr      ra
   4:   00851021        addu    v0,a0,a1

This code might confuse you, it’s returning before the add? We’ll get to that later, it’s one of the weirder parts of MIPS, delay slots.

Branches

We’ll need some control flow. Let’s try implementing this function:

int foo(int a, int b, int c)
{
    if (c > 0)
        return a * b;
    else
        return a + b;
}
using namespace llvm;
// ...
// Remember to update this to take 3 arguments!
Type *argument_types[] = { int32, int32, int32 };
// ...
BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];
Value *cmp = builder.CreateICmpSGT(
    arg0,
    ConstantInt::get(Type::getInt32Ty(context), 0),
    "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *mul = builder.CreateMul(arg1, arg2, "multiplied");
builder.CreateRet(mul);

builder.SetInsertPoint(false_path);
Value *add = builder.CreateMul(arg1, arg2, "added");
builder.CreateRet(add);

Notice that the compare instruction is “signed greater than”. Normally, LLVM does not care about signedness, except for operations where this matters. To branch, we create a branch instruction, which ends the first basic block.

; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %multiplied = mul i32 %1, %2
  ret i32 %multiplied

false:                                            ; preds = %entry
  %added = add i32 %1, %2
  ret i32 %added
}

And this becomes in x86-64 (sorry for the AT&T syntax):

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 06                   jle    a <LLVMAdd+0xa>
   4:   0f af f2                imul   %edx,%esi
   7:   89 f0                   mov    %esi,%eax
   9:   c3                      retq   
   a:   01 d6                   add    %edx,%esi
   c:   89 f0                   mov    %esi,%eax
   e:   c3                      retq 

Memory and pointers

We can’t always use pure SSA values, and we have to deal with good old memory. LLVM has an “interesting” approach to pointers, but let’s start easy. Loading and storing memory in LLVM is usually done in two stages:

  • Compute address with “get element pointer”
  • Load/store with CreateLoad/CreateStore

The backend is responsible for translating this according to the memory addressing the CPU can deal with. Get element pointer translates very naturally to C-like pointer arithmetic. We don’t need to access in byte offsets and all sorts of gunk, but array elements and struct members. Let’s try an example with structs.

struct Foo
{
    int a;
    float b;
    int c;
    float d;
};

int LLVMAdd(Foo *foo)
{
    foo[1].c += 10;
}
Type *int32 = Type::getInt32Ty(context);
Type *float32 = Type::getFloatTy(context);

Type *struct_types[] = {
    int32,
    float32,
    int32,
    float32
};

Type *struct_type = StructType::get(context, struct_types, false);
Type *p_struct_type = PointerType::get(struct_type, 0);

Type *result_type = int32;
Type *argument_types[] = { p_struct_type };

// ...
// ...

BasicBlock *bb = BasicBlock::Create(context, "entry", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *indices[] = {
    ConstantInt::get(int32, 1), // pointer -> array index
    ConstantInt::get(int32, 2), // struct -> member index
};
Value *ptr = builder.CreateInBoundsGEP(arg0, indices, "ptr");
Value *loaded = builder.CreateLoad(ptr, "loaded");
Value *added = builder.CreateAdd(
    loaded,
    ConstantInt::get(int32, 10), "added");
builder.CreateStore(added, ptr);
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd({ i32, float, i32, float }*) {
entry:
  %ptr = getelementptr inbounds { i32, float, i32, float }, { i32, float, i32, float }* %0, i32 1, i32 2
  %loaded = load i32, i32* %ptr
  %added = add i32 %loaded, 10
  store i32 %added, i32* %ptr
  ret i32 %added
}
0000000000000000 <LLVMAdd>:
   0:   8b 47 18                mov    0x18(%rdi),%eax
   3:   83 c0 0a                add    $0xa,%eax
   6:   89 47 18                mov    %eax,0x18(%rdi)
   9:   c3                      retq   

The mystical PHI node

I can’t really talk about SSA without mentioning PHI nodes. SSA values do not live in memory, so what happens when the control flow changes? Let’s look at an example function:

int foo(int a, int b, int c)
{
    int v;
    if (a > 0)
        v = b + c;
    else
        v = b - c;

    return v + a;
}

If we translate this directly to IR, we’ll see that v suddenly has two versions of it, but SSA says *single* static assignment. How is this resolved? A PHI node is used. The purpose of this node is to pull together multiple versions of a value into one, and you specify all basic blocks which can branch into your block. This is rather bizarre, since it’s kinda like an inverse goto.

Of course, we could translate this to memory allocated on the stack and just store to v rather than deal with this, and have LLVM optimize it. For completion, let’s try that first.

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);
BasicBlock *merge = BasicBlock::Create(context, "merge", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];

Value *v = builder.CreateAlloca(int32, ConstantInt::get(int32, 1),
    "alloca");
Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0),
    "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *bc_add = builder.CreateAdd(arg1, arg2, "b_plus_c");
builder.CreateStore(bc_add, v);
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *bc_sub = builder.CreateSub(arg1, arg2, "b_minus_c");
builder.CreateStore(bc_sub, v);
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
Value *loaded = builder.CreateLoad(v, "loaded");
Value *added = builder.CreateAdd(loaded, arg0, "added");
builder.CreateRet(added);
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %alloca = alloca i32
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %b_plus_c = add i32 %1, %2
  store i32 %b_plus_c, i32* %alloca
  br label %merge

false:                                            ; preds = %entry
  %b_minus_c = sub i32 %1, %2
  store i32 %b_minus_c, i32* %alloca
  br label %merge

merge:                                            ; preds = %false, %true
  %loaded = load i32, i32* %alloca
  %added = add i32 %loaded, %0
  ret i32 %added
}

If we just give this to llc we see that it is sadly using stack space.

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 0d                   jle    11 <LLVMAdd+0x11>
   4:   01 d6                   add    %edx,%esi
   6:   89 74 24 fc             mov    %esi,-0x4(%rsp)
   a:   03 7c 24 fc             add    -0x4(%rsp),%edi
   e:   89 f8                   mov    %edi,%eax
  10:   c3                      retq   
  11:   29 d6                   sub    %edx,%esi
  13:   89 74 24 fc             mov    %esi,-0x4(%rsp)
  17:   03 7c 24 fc             add    -0x4(%rsp),%edi
  1b:   89 f8                   mov    %edi,%eax
  1d:   c3                      retq

What’s going on, we gave llc the -O3 option? Well, we didn’t optimize the IR first. This problem is extremely common when compiling C and C++ code of course, so we can optimize in the IR domain. Here we can do the classic “mem2reg” optimization, which replaces stack space with SSA values, which is much easier for the backend to deal with.

$ opt -O3 -o test.bc test.ll
$ llvm-dis test.bc -o optimized.ll
$ cat optimized.ll

; ModuleID = 'test.bc'
source_filename = "mymodule"

; Function Attrs: norecurse nounwind readnone
define i32 @LLVMAdd(i32, i32, i32) local_unnamed_addr #0 {
entry:
  %cmp = icmp sgt i32 %0, 0
  %3 = sub i32 0, %2
  %alloca.0.p = select i1 %cmp, i32 %2, i32 %3
  %alloca.0 = add i32 %1, %0
  %added = add i32 %alloca.0, %alloca.0.p
  ret i32 %added
}

attributes #0 = { norecurse nounwind readnone }

0000000000000000 <LLVMAdd>:
   0:   89 d1                   mov    %edx,%ecx
   2:   f7 d9                   neg    %ecx
   4:   85 ff                   test   %edi,%edi
   6:   0f 4f ca                cmovg  %edx,%ecx
   9:   8d 04 3e                lea    (%rsi,%rdi,1),%eax
   c:   01 c8                   add    %ecx,%eax
   e:   c3                      retq 

That’s pretty nifty, and it even replaced the branch with select, but it also rewrote the entire function because it found some other optimizations. Let’s not rely on opt and rewrite this without stack space using PHI.

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];

Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0), "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *bc_add = builder.CreateAdd(arg1, arg2, "b_plus_c");
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *bc_sub = builder.CreateSub(arg1, arg2, "b_minus_c");
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
PHINode *phi = builder.CreatePHI(int32, 2, "phi");
phi->addIncoming(bc_add, true_path);
phi->addIncoming(bc_sub, false_path);
Value *added = builder.CreateAdd(phi, arg0, "added");
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %b_plus_c = add i32 %1, %2
  br label %merge

false:                                            ; preds = %entry
  %b_minus_c = sub i32 %1, %2
  br label %merge

merge:                                            ; preds = %false, %true
  %phi = phi i32 [ %b_plus_c, %true ], [ %b_minus_c, %false ]
  %added = add i32 %phi, %0
  ret i32 %added
}

0000000000000000 <LLVMAdd>:
   0:   85 ff                   test   %edi,%edi
   2:   7e 07                   jle    b <LLVMAdd+0xb>
   4:   01 d6                   add    %edx,%esi
   6:   01 fe                   add    %edi,%esi
   8:   89 f0                   mov    %esi,%eax
   a:   c3                      retq   
   b:   29 d6                   sub    %edx,%esi
   d:   01 fe                   add    %edi,%esi
   f:   89 f0                   mov    %esi,%eax
  11:   c3                      retq

This is a very pure translation of our IR, nice.

Calling functions

This is quite straight forward fortunately. Let’s replace the two branches with actual function calls. You can probably tell by now what the C code was supposed to look like.

FunctionType *function_type = FunctionType::get(result_type, argument_types, vararg);
Function *function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMAdd",
    module);
Function *true_function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMTrue",
    module);

Function *false_function = Function::Create(
    function_type, 
    Function::ExternalLinkage,
    "LLVMFalse", module);

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
BasicBlock *true_path = BasicBlock::Create(context, "true", function);
BasicBlock *false_path = BasicBlock::Create(context, "false", function);
BasicBlock *merge = BasicBlock::Create(context, "merge", function);

IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *arg2 = &function->arg_begin()[2];
Value *call_args[] = { arg0, arg1, arg2 };

Value *cmp = builder.CreateICmpSGT(arg0, ConstantInt::get(int32, 0), "cmp");
BranchInst::Create(true_path, false_path, cmp, bb);

builder.SetInsertPoint(true_path);
Value *true_call = builder.CreateCall(true_function, call_args);
BranchInst::Create(merge, true_path);

builder.SetInsertPoint(false_path);
Value *false_call = builder.CreateCall(false_function, call_args);
BranchInst::Create(merge, false_path);

builder.SetInsertPoint(merge);
PHINode *phi = builder.CreatePHI(int32, 2, "phi");
phi->addIncoming(true_call, true_path);
phi->addIncoming(false_call, false_path);
Value *added = builder.CreateAdd(phi, arg0, "added");
builder.CreateRet(added);
; ModuleID = 'mymodule'
source_filename = "mymodule"

define i32 @LLVMAdd(i32, i32, i32) {
entry:
  %cmp = icmp sgt i32 %0, 0
  br i1 %cmp, label %true, label %false

true:                                             ; preds = %entry
  %3 = call i32 @LLVMTrue(i32 %0, i32 %1, i32 %2)
  br label %merge

false:                                            ; preds = %entry
  %4 = call i32 @LLVMFalse(i32 %0, i32 %1, i32 %2)
  br label %merge

merge:                                            ; preds = %false, %true
  %phi = phi i32 [ %3, %true ], [ %4, %false ]
  %added = add i32 %phi, %0
  ret i32 %added
}

declare i32 @LLVMTrue(i32, i32, i32)

declare i32 @LLVMFalse(i32, i32, i32)

Here we’re trying to call functions which haven’t been defined in the module. It is up to the linker to resolve this later. Let’s see what happens if we build this as a dynamic library.

0000000000001000 <.plt>:
    1000:       ff 35 02 30 00 00       pushq  0x3002(%rip)        # 4008 <_GLOBAL_OFFSET_TABLE_+0x8>
    1006:       ff 25 04 30 00 00       jmpq   *0x3004(%rip)        # 4010 <_GLOBAL_OFFSET_TABLE_+0x10>
    100c:       0f 1f 40 00             nopl   0x0(%rax)

0000000000001010 <LLVMFalse@plt>:
    1010:       ff 25 02 30 00 00       jmpq   *0x3002(%rip)        # 4018 <LLVMFalse>
    1016:       68 00 00 00 00          pushq  $0x0
    101b:       e9 e0 ff ff ff          jmpq   1000 <.plt>

0000000000001020 <LLVMTrue@plt>:
    1020:       ff 25 fa 2f 00 00       jmpq   *0x2ffa(%rip)        # 4020 <LLVMTrue>
    1026:       68 01 00 00 00          pushq  $0x1
    102b:       e9 d0 ff ff ff          jmpq   1000 <.plt>

Disassembly of section .text:

0000000000001030 <LLVMAdd>:
    1030:       53                      push   %rbx
    1031:       89 fb                   mov    %edi,%ebx
    1033:       85 ff                   test   %edi,%edi
    1035:       7e 0b                   jle    1042 <LLVMAdd+0x12>
    1037:       89 df                   mov    %ebx,%edi
    1039:       e8 e2 ff ff ff          callq  1020 <LLVMTrue@plt>
    103e:       01 d8                   add    %ebx,%eax
    1040:       5b                      pop    %rbx
    1041:       c3                      retq   
    1042:       89 df                   mov    %ebx,%edi
    1044:       e8 c7 ff ff ff          callq  1010 <LLVMFalse@plt>
    1049:       01 d8                   add    %ebx,%eax
    104b:       5b                      pop    %rbx
    104c:       c3                      retq  

The noise before the function is dynamic library gunk. If we had linked with –no-undefined, we would see:

ld: test.o: in function `LLVMAdd':
mymodule:(.text+0xa): undefined reference to `LLVMTrue'
ld: mymodule:(.text+0x15): undefined reference to `LLVMFalse'

This is basically how we will call into our emulator host code to do various things. Handle syscalls, deal with function calls to other emulated code, etc. The possibilities are endless now. When we JIT, we can pass function pointers of our own host code into the symbol resolver and the JIT can patch in straight function calls into the code. Nifty!

At this point, with some searching around, you should be able to figure out how to generate the IR you want. We’ve covered the basics I think.

JIT compilation

JIT-ing these llvm::Modules is in theory quite straight forward, there’s just a lot of API noise to go through. We create:

  • llvm::LLVMContext
  • llvm::orc::ExecutionSession
  • llvm::orc::RTDyldObjectLinkingLayer
  • llvm::orc::IRCompileLayer<>
  • llvm::TargetMachine
  • llvm::orc::MangleAndInterner
  • llvm::DataLayout

There is too much code to deal with, just have a look at the gists I made instead for now 😀 This should be reusable if you want to play around with it.

For LLVM 7, you’ll want to define JITTER_LLVM_VERSION_LEGACY. For LLVM 8, don’t.

The API usage should be fairly simple.

int my_add(int a, int b) { return a + b };

// ...

using namespace llvm;
JITTIR::Jitter jitter;

// Let JIT link against this symbol.
jitter.add_external_symbol("my_add", my_add);

// Allocate a module.
auto module = jitter.create_module("mymodule");
auto &context = module->getContext();

// Build the IR
Type *int32 = Type::getInt32Ty(context);
Type *result_type = int32;
Type *argument_types[] = { int32, int32 };
bool vararg = false;

FunctionType *function_type = FunctionType::get(
    result_type,
    argument_types,
    vararg);

Function *function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "LLVMAdd",
    *module);

Function *my_add_function = Function::Create(
    function_type,
    Function::ExternalLinkage,
    "my_add",
    *module);

BasicBlock *bb = BasicBlock::Create(context, "entry", function);
IRBuilder<> builder(bb);
Value *arg0 = &function->arg_begin()[0];
Value *arg1 = &function->arg_begin()[1];
Value *args[] = { arg0, arg1 };
Value *added = builder.CreateCall(my_add_function, args, "added");
builder.CreateRet(added);

// Add the module to the JIT, it's immediately compiled.
jitter.add_module(std::move(module));

// Get function pointer to symbol and execute it.
auto *fn_ptr =
    reinterpret_cast<int (*)(int, int)>(jitter.get_symbol_address("LLVMAdd"));
printf("10 + 20 = %d\n", fn_ptr(10, 20));

Which should print 30.

(gdb) disas fn_ptr,fn_ptr+20
Dump of assembler code from 0x7fb9d7bcc000 to 0x7fb9d7bcc014:
   0x00007fb9d7bcc000:	push   %rax
   0x00007fb9d7bcc001:	movabs $0x55d0584c33ea,%rax
   0x00007fb9d7bcc00b:	callq  *%rax
   0x00007fb9d7bcc00d:	pop    %rcx
   0x00007fb9d7bcc00e:	retq   
   0x00007fb9d7bcc00f:	add    %al,(%rax) <- zero memory
   0x00007fb9d7bcc011:	add    %al,(%rax)
   0x00007fb9d7bcc013:	add    %al,(%rax)

This is just the bare minimum, simplest approach we can take to JIT compilation in LLVM, but it works. The code which calls external functions seems a bit strange though, but it might be this way so it’s easy to patch in other call addresses later, I’m not sure.

Conclusion

Hopefully this was informative at least. In part 3, we will use these tools at our disposal and bring up a MIPS.

An unusual recompiler experiment – MIPS to LLVM IR – Part 1

While not graphics per-se, recompilers in emulation is an interesting topic. They are notoriously complex and difficult to implement, but promise incredible performance improvements. While emulating weaker hardware is serviced well with interpreters, more powerful hardware needs recompilers to shine. Starting with the PlayStation 1 era of hardware, recompilers are common, and beyond that, recompilers are required to have any hope of reaching real-time performance.

This is a multi-part post. I’m not sure how many posts it will take, but there’s a lot of stuff to write about. In this round, I’ll introduce the experiment, we’ll parse a MIPS ELF file, and make it ready for execution in our emulated Linux environment.

What is the goal of a recompiler

The main purpose of the recompiler is to look at the foreign machine code an application is running, and converting that to equivalent machine code on the hardware you are running on.

Conversely, an interpreter looks at instructions one at a time, and performs the action it requires, which wastes a lot of work in decoding instructions and branching dynamically to whatever code snippet you need to execute.

Isn’t this what Java and .NET runtimes do?

Basically, yes. Just replace “foreign machine code” with “byte code”.

The portability problem

Since a recompiler needs to target the raw machine code of the hardware you’re running on, this is a serious hazard for portability. Typical recompilers aiming to be portable need to write backends for all kinds of machines you want to run on, and deal with operating-system specific ABIs along the way. Today, the most relevant targets would be:

  • x86
  • x86-64/amd64
  • ARMv7 (older mobile devices)
  • AArch64 (newer mobile devices)

To target something exotic, like homebrew on consoles, you might have recompilers targeting PowerPC, which was very popular as well for two console generations.

This is too much, so I’ve been interested in the prospect of leveraging LLVM instead. If I can just target LLVM IR, LLVM could generate machine code in runtime (ORC JIT) for me, targeting any of these architectures. It is also a good excuse to learn some LLVM internals. We’ll be generating LLVM code with the C++ API, and pass that along to the JIT runtime to generate machine code on the fly.

Ahead-of-time recompilation and re-targeting?

If we target LLVM, we could in theory dump all LLVM IR code we have encountered to disk, optimize the code more aggressively (maybe some LTO), and build everything into a single dynamic library for any target we’d like. Once we have warmed up all the known code paths we should be able to avoid almost all run-time recompilation on subsequent runs. I wonder how practical it is, but it’s something I’d like to experiment with.

Patching speed critical sections with native C?

If we can dump LLVM IR to disk it doesn’t seem to farfetched to replace functions at known addresses with our own native versions written in C or something.

Recompilation efficiency

A big advantage of having a dedicated recompiler is how quickly the code can be generated as it barely needs to qualify as a compiler to get the job done. LLVM is a complex beast which needs to target all sorts of use cases, and using it as a just-in-time compiler like this is going to create some performance issues. It will be interesting to see how slow it is in practice.

Why MIPS?

MIPS is found in lots of gaming console hardware from the 90s and early 00s.

  • PlayStation 1
  • Nintendo 64 (CPU and RSP)
  • PlayStation 2
  • PlayStation Portable

MIPS is also a very simple ISA to understand and get started with. The core, original MIPS I ISA is probably the simplest, practical ISA to learn, and it’s often part of the curriculum when studying for an electronics degree. As part of my undergrad digital design course, we hacked on a trivial MIPS CPU core in VHDL, which was very fun and educational.

What should we emulate?

I felt like doing something I could get results from quickly, so rather than emulating a full game console, I wanted to try pretending to be a MIPS Linux kernel, running fully fleshed-out MIPS ELF binaries compiled with GCC 8.2 backed by glibc and libstdc++. That way I could build my way up from running trivial test cases written in assembly without any run-time all the way to emulating complicated programs using STL and the modern C/C++ run-times. We can also get a much better picture of performance differences between a natively compiled C++ application compared to a recompiled one.

The MIPS we’re going to target is a 32-bit MIPS I with whatever extra instructions we need when running real applications. GCC can target MIPS I with -march=mips1, but there are a few instructions from MIPS II and up GCC will use anyways to run any glibc application, due to a couple extra fundamental features:

  • RDHWR – Reads a hardware register, used to query thread local storage (TLS). To run any non-trivial C program, we need this because of errno, which is defined by POSIX to be thread local.
  • LL/SC – Load linked and store conditional serve as the backbone for atomic operations. glibc needs it in a few places. We’re not going to bother with threading, so we can trivially implement it as normal load/store.

As for endianness, MIPS can support both little and big-endian. Little is the easiest to start with since it matches our target hardware, but we’ll want to support big-endian as well, as it is the default, and the only MIPS endianness I know of in the wild.

User-space Linux application – ELF parsing and syscalls

Our program will need to read an ELF file, set up a virtual 32-bit address space and begin execution. We will need to implement various Linux syscalls required to host a real application. Common linux syscalls like:

  • open
  • read
  • write
  • exit
  • kill
  • llseek
  • mmap
  • munmap
  • brk

… and so on will be enough to get us over the printf stage. We do not have to concern ourselves with interrupts, CPU exceptions, memory-mapped I/O and other complicated things a game console emulator would have to deal with, fortunately.

Step #0 – Getting a MIPS cross compiler

The first step is getting some code to compile to MIPS. This took a surprisingly long time as the PKGBUILDs for Arch Linux did not work properly with some weird incompatibilities. To cut the long story short, I made some PKGBUILDs for Arch Linux which worked for me to create fully functional cross-compilers for both little and big-endian 32-bit MIPS. https://github.com/Themaister/MIPS-Toolchain-PKGBUILD/

To build a little-endian binary once you build the toolchain:

mipsel-linux-gnu-gcc -o test test.c -static -march=mips1 -O2
# CMake toolchain file for little endian.
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR mipsel)
set(CMAKE_C_COMPILER mipsel-linux-gnu-gcc)
set(CMAKE_CXX_COMPILER mipsel-linux-gnu-g++)
set(CMAKE_C_FLAGS "-mgp32 -march=mips1")
set(CMAKE_CXX_FLAGS "-mgp32 -march=mips1")
set(CMAKE_ASM_FLAGS "-mgp32 -march=mips1")
set(CMAKE_C_LINK_FLAGS "-static")
set(CMAKE_CXX_LINK_FLAGS "-static")

The -static flag is important as we do not want to have to deal with dynamically loading the C runtime in our ELF loader. Fortunately, static libgcc/libstdc++ seems to work just fine for our purpose here.

Step #1 – Parsing an ELF file

Before starting I did not know much about ELF (Executable and Linkable Format). It is the executable format used on Linux and many other systems. It is surprisingly simple when we just want to run a statically linked executable like this. It is helpful to use the readelf tool (mipsel-linux-gnu-readelf) to study the ELF, this will help us to understand what’s going on.

# mipsel-linux-gnu-readelf -h mips.elf
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           MIPS R3000
  Version:                           0x1
  Entry point address:               0x400630
  Start of program headers:          52 (bytes into file)
  Start of section headers:          628984 (bytes into file)
  Flags:                             0x1007, noreorder, pic, cpic, o32, mips1
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         7
  Size of section headers:           40 (bytes)
  Number of section headers:         38
  Section header string table index: 37

This is the first few bytes of the binary. The structure is defined in the system header “elf.h” on Linux, which is very handy when we want to write a parser. There are a few things we care about here:

  • We can verify that we have a MIPS binary, and in which endianness.
  • What the entry point address is, i.e. where do we start executing.
  • How to find the program headers
  • How to find the section headers

The program headers

The program headers contain information about where data is located in the binary and what to do with it. We only care about LOAD and TLS blobs.

# mipsel-linux-gnu-readelf -l mips.elf                                            

Elf file type is EXEC (Executable file)
Entry point 0x400630
There are 7 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  ABIFLAGS       0x000138 0x00400138 0x00400138 0x00018 0x00018 R   0x8
  REGINFO        0x000150 0x00400150 0x00400150 0x00018 0x00018 R   0x4
  LOAD           0x000000 0x00400000 0x00400000 0x813d4 0x813d4 R E 0x10000
  LOAD           0x082000 0x00492000 0x00492000 0x041e0 0x04f9c RW  0x10000
  NOTE           0x000114 0x00400114 0x00400114 0x00020 0x00020 R   0x4
  NOTE           0x000168 0x00400168 0x00400168 0x00024 0x00024 R   0x4
  TLS            0x0820b0 0x004920b0 0x004920b0 0x00010 0x00030 R   0x4

...

We see two LOAD blocks. One has the read/execute flags set. This is our code segments, which we load from Offset in the ELF into virtual address “VirtAddr”. The read/write is the data segment, where global variables live. Note that FileSiz may be != MemSiz. This is to support zero-initialized global variables, they simply need memory allocated to them.

TLS is a bit special, it marks which global data needs to be thread local. Any data here needs to be copied out to a new allocation per-thread if we create a new one (we won’t, but we still need to implement it).

The section headers

The section headers don’t seem vital to execution, but they contain the symbol table, which is useful for debugging.

The virtual address space

Linux has a virtual address space, so we need to copy all relevant data out to our own virtualized address space. This is simply represented as a page table (4k pages) spanning the entire 32-bit address space. Through our syscall emulation, applications/glibc can use mmap() to either allocate memory dynamically or memory map files. We cannot assume a fixed address space layout if we want to emulate Linux binaries.

Having an address space like this means all memory access becomes indirect. This will certainly be a performance problem. There might be a better way if we abuse mmap(), but sounds very hard.

Setting up the stack

Before we can call the entry point we must set up a stack. Normally, we would think this stack only contains “argc” and “argv”, ala:

int main(int argc, char **argv)

but we actually need to pass a lot more information. All of this extra information is used by the C runtime entry point, which seems to be called __start in MIPS. We allocate the stack space at the top of the virtual address space, and push some data to the stack. The $sp register will point to:

  • argc
  • argv argument #0 (char *)
  • argv argument #1
  • NULL // Terminates argv
  • environment variable key #0 (char *)
  • environment variable value #0 (char *)
  • environment variable key #1
  • environment variable value #1
  • NULL // Terminates envp

The environment variables are passed in like this, and the C runtime will parse it. However, there is more data we need to pass on the stack on Linux, which was rather surprising. glibc will crash deep into its initialization if this is not done properly.

// ELF AUXV, see <elf.h>
stack_data.push_back(AT_PHDR);
stack_data.push_back(misc.phdr_addr);
stack_data.push_back(AT_PHENT);
stack_data.push_back(ehdr.e_phentsize);
stack_data.push_back(AT_PHNUM);
stack_data.push_back(ehdr.e_phnum);
stack_data.push_back(AT_PAGESZ);
stack_data.push_back(VirtualAddressSpace::PageSize);
stack_data.push_back(AT_ENTRY);
stack_data.push_back(ehdr.e_entry);
stack_data.push_back(AT_UID);
stack_data.push_back(getuid());
stack_data.push_back(AT_EUID);
stack_data.push_back(geteuid());
stack_data.push_back(AT_GID);
stack_data.push_back(getgid());
stack_data.push_back(AT_EGID);
stack_data.push_back(getegid());
stack_data.push_back(AT_RANDOM);
stack_data.push_back(stack_top); // Just point to something. glibc needs this.
stack_data.push_back(AT_NULL);

glibc needs to have a pointer to its own ELF headers and some other information like user IDs, page sizes, and some other things. The headers are used to set up TLS properly. It also needs a random number created by the Linux kernel, which it uses in early initialization to set up stack protection canaries.

Now, everything is set up, and we can start executing … I mean generating some LLVM IR … in part 2.