Skip to content

2 · The static recompiler

The problem: the game is 32-bit ARMv7 machine code. Modern iPhones can't run 32-bit apps at all, and we also want it on arm64 desktops, Android, and the web — none of which speak ARMv7.

This is the heart of the project: a tool we call armv7re that reads the ARM machine code and rewrites every instruction as equivalent C++, ahead of time. The output is an ordinary C++ program — the entire game, expressed as functions — that any modern compiler can build for any CPU.

The core idea

Each ARMv7 function becomes a C++ function that does exactly what the original instructions did, against a small struct that stands in for the ARM CPU's registers and flags.

flowchart LR
    subgraph IN["ARMv7 (from the .S files)"]
        A["add r0, r1, r2<br/>bx lr"]
    end
    subgraph OUT["Generated C++"]
        B["uint64_t f(ScratchRegisters sc, CpuState* s) {<br/>&nbsp;&nbsp;Cpu cpu{s, sc};<br/>&nbsp;&nbsp;cpu.reg&lt;0&gt;() = op_add(cpu.reg&lt;1&gt;(), cpu.reg&lt;2&gt;());<br/>&nbsp;&nbsp;return pack_ret(cpu.reg&lt;0&gt;(), cpu.reg&lt;1&gt;());<br/>}"]
    end
    A -->|lifted at build time| B

Do that for all ~9,000 functions and you have the whole game as portable C++. Calls between ARM functions become ordinary C++ calls; the original control flow is preserved. There is no interpreter and no JIT at runtime — by the time the app launches, it's just native code.

Why static, not a JIT?

The usual way to run old machine code is a dynamic recompiler (a JIT): it translates instructions to host code while the program runs. That works, but it was the wrong tool here:

Static recompiler (what we built) Dynamic recompiler / JIT
Runs on iOS App Store ✅ ordinary native code ❌ JITs are forbidden
Runs in a web browser ✅ compiles to WebAssembly ❌ can't generate code in the sandbox
Translation cost paid once, at build time paid forever, at runtime
Optimization full -O2/LTO on the generated C++ limited to the JIT's budget
Portability one C++ codebase → any target needs a hand-written backend per CPU
Debuggability step through readable C++ opaque generated code

Because the output is just C++, the same translated game compiles for arm64, x86-64, and WebAssembly with no per-architecture work on our side — the host compiler handles that. That single property is what lets the game run on phones, desktops, and the browser from one source of truth.

flowchart TB
    S[".S assembly"] --> O["assemble → ARMv7 .o"]
    O --> LIFT["armv7re lifter<br/>(decode → translate → emit)"]
    LIFT --> CPP["generated C++"]
    CPP --> C1["clang → arm64<br/>(iPhone, Apple Silicon, Android)"]
    CPP --> C2["clang → x86-64<br/>(Intel desktops, CI)"]
    CPP --> C3["emscripten → WebAssembly<br/>(the browser)"]

How a function gets translated

The lifter is a small ahead-of-time compiler in its own right. At build time it:

  1. Decodes the ARM/Thumb instructions. We reuse the battle-tested instruction decoder from dynarmic as a front end only — to understand what each instruction means. (We do not use its JIT; nothing dynarmic ships in the final app.)
  2. Lowers the decoded instructions into our own intermediate representation (an SSA-style IR designed for emitting clean C++).
  3. Optimizes that IR — see below.
  4. Prints it as C++.

The generated functions all share one uniform calling convention so they can call each other and the runtime can call into them:

// The single ABI every lifted function uses.
std::uint64_t f(ScratchRegisters sc, CpuState* s);
  • CpuState* s holds the persistent guest state — registers r4–r15, the VFP/NEON registers, the condition flags, the stack pointer.
  • ScratchRegisters sc carries the scratch registers r0–r3 by value. Passing the hot registers by value (rather than always through memory) lets the host keep them in real CPU registers across calls. The struct is exactly 16 bytes so on arm64 it lands directly in registers x0:x1.
  • Results come back as r0:r1 packed into the 64-bit return value — matching how ARM itself returns values, again so a pass-through needs no shuffling.
Why this exact shape? (for the curious)

A naive translator keeps all CPU state in one struct passed by pointer. That works but is slow: the pointer "escapes," so the compiler must assume every register write could be observed and spill everything to memory.

By splitting the four hot scratch registers into a small by-value bundle and returning results packed in a register, the generated code maps cleanly onto the host's own calling convention — the host compiler keeps guest registers in host registers. That scratch-register bundle alone sped up our LZMA decode benchmark by 34%.

Optimizing the generated C++

A literal one-instruction-to-one-statement translation would be correct but bulky and slow. The lifter runs a series of optimization passes over its IR before printing, each of which makes the C++ smaller and faster for the host compiler to optimize further. A few examples:

  • Register promotion — keep guest registers in host locals across a function instead of reading/writing the CpuState struct every time.
  • Flag elimination — ARM updates condition flags constantly, but most results are never read. We promote flags to throwaway locals so the host compiler can delete the dead flag math.
  • Frame localization — a function's private stack slots become a host-local array the compiler can put in registers, instead of guest memory.
  • Store/load coalescing — turn ARM's multi-register push/pop sequences into single block copies that the host fuses into wide stp/ldp instructions.

These are toggled with --optimize=regs,frame,flags (or --optimize=all). Turned on together they shrink the binary and make the translated game run almost 84% faster than a naive one-to-one translation — all while keeping behavior bit-identical to the original.

Verified against the real thing

Because every .S file reassembles to the original bytes (see chapter 1), the lifter's output can be checked against ground truth: the recompiled game has to log in to the real server and load the village exactly as the untranslated bytes would. Correctness isn't argued — it's observed.

What this layer produces

The output of armv7re is the complete game logic as portable C++ — but on its own it's inert. It expects an operating system underneath it: something to answer when it calls malloc, opens a socket, or asks OpenGL ES to draw. Supplying that environment is the runtime's job.

ARM instructions and the C++ they became A lifted function: ARMv7 on the left, the generated C++ on the right.