Deterministic Fully-Static Whole-Binary Translation Without Heuristics
Summary
This paper introduces Elevator, a novel binary translator that performs deterministic, fully-static translation of entire x86-64 binaries to AArch64 without heuristics or runtime fallbacks. It achieves performance comparable to QEMU while enabling pre-deployment validation and certification of the translated code.
View Cached Full Text
Cached at: 05/13/26, 06:36 AM
# Deterministic Fully-Static Whole-Binary Translation without Heuristics Source: [https://arxiv.org/html/2605.08419](https://arxiv.org/html/2605.08419) ,James McGowan[jwmcgowa@uci\.edu](https://arxiv.org/html/2605.08419v1/mailto:[email protected])University of California, IrvineIrvineCaliforniaUSAandMichael Franz[franz@uci\.edu](https://arxiv.org/html/2605.08419v1/mailto:[email protected])University of California, IrvineIrvineCaliforniaUSA ###### Abstract\. We presentElevator, the first binary translator capable of statically translating entire x86\-64 binary executables to AArch64 without using debug information, source code, or any assumptions about code patterns or layouts within the original binary\. Unlike existing binary translation systems, which rely on heuristics or runtime fallback mechanisms to recover from erroneous code\-versus\-data decoding decisions,Elevatorconsiders all possible interpretations of every byte in the original executable and creates a separate translation for each feasible one ahead of time\. For example, any byte might be interpreted as data, as part of an opcode, or as part of an argument to an opcode\. We generate separate control flow paths for all of these interpretations, pruning only those that lead to exceptional program termination\. For each such control flow path, a translation is generated by composing code “tiles” that have been automatically derived from a high\-level language description of the individual instructions in the source instruction set architecture\. This leads to a nimble binary translation framework\. Our approach is deterministic and produces complete, self\-contained output binaries\. Unlike previous solutions, it requires no runtime component in the trusted code base\.Elevatoroffers a different cost/benefit profile than previous solutions, with the principal cost being a substantial code size expansion\. The main benefit in return is thatElevator’s output is the actual code that will run, so it can be tested, validated, certified, and/or cryptographically signed before deployment, reducing risk compared to emulators or just\-in\-time \(JIT\) compilers\. We demonstrateElevator’s effectiveness on a diverse corpus of real\-world binaries, including the entire SPECint 2006 suite, showing that static full\-program binary translation can be made both reliable and practical\. Our method achieves performance on par or better than QEMU’s user\-mode emulation with JIT acceleration\. whole program static binary translation, binary lifting with cross\-compilation, binary\-to\-binary cross\-ISA translation, practical evaluation of a full\-scale implementation\. ††ccs:Software and its engineering Translator writing systems and compiler generators††ccs:Software and its engineering Software reverse engineering††ccs:Software and its engineering Maintaining software††ccs:Software and its engineering Software evolution††ccs:Software and its engineering Software maintenance tools††ccs:Software and its engineering Compilers††copyright:none## 1\.Introduction Hardware transitions from one instruction set architecture \(ISA\) to another often come with a need to bring along legacy software to the new platform\. Often enough, such legacy software transition cannot be achieved fully mechanically “simply by recompiling” surviving source code\. To further complicate matters, when legacy code has been validated or certified, it is typically not the source code that is certified, but a specific well\-tested “authoritative binary executable\.” Recreating this exact “authoritative binary” bit\-by\-bit from source code at a later date is often impossible\. Even if there is surviving source code and we manage to get it to compile and build, recreating the identical code is likely to require the exact same version of the compiler, linker, and possibly other parts of the build system that were used in the creation of the original binary\. As a consequence, recreating legacy programs from old source code may be risky because we can’t be sure we are faithfully replicating the functionality of the “authoritative binary\.” Instead, any surviving archived source code might refer to a subtly different version\. There are also reported cases, e\.g\.\(Kolsek,[2017](https://arxiv.org/html/2605.08419#bib.bib6)\), in which a manufacturer fixed a software error by manually and skillfully applying a “patch” directly to the binary, bypassing source code altogether\. Utilizing the archived source code version might hence bring back unknown programming errors that had already been fixed in the currently running binary\. Other than by exhaustive testing, there is typically no easy way of determining whether any software recreated from source code at a later date is actually the intended software version\. Instead of using source code, an alternative approach starts with the existing binary\. As we expand on below in our related work section, previous solutions to working with binary code directly have employed combinations of emulation and static and dynamic translation\. Common to all of these previous solutions is that they do not fully statically translate entire binary programs from one ISA to another, but that they all require additional system\-level components that need to execute alongside the translated program\. These additional runtime components must therefore be part of the trusted code base and are implicitly included in all testing\. Ascertaining overall reliability is made more difficult by the possibility that the dynamic behavior of such systems could lead to different results based on the ordering of specific tests or inputs\. In contrast,Elevatormakes the following key contributions: - •A fully static, deterministic, heuristics\-free cross\-ISA binary\-to\-binary translator\.Elevatoris, to our knowledge, the first cross\-ISA binary\-to\-binary translator from x86\-64111We have implemented our system for x86\-64 input binaries, but for better readability we mostly abbreviate this as “x64” throughout the text\.to AArch64 that is entirely static, fully deterministic, and heuristics\-free\. It makes no assumptions about the code layout of the input or the toolchain that produced it\. Running the same input binary throughElevatortwice yields the exact same output bit sequence\. Once translation completes, the resulting AArch64 binary is a stand\-alone executable that requires no runtime translation support and can be tested and certified in its own right\. - •A lightweight, LLVM\-backed code generator\.Our code generator for AArch64 is built on a lightweight mechanism that leverages LLVM’s mature compiler infrastructure to synthesize cross\-ISA translations automatically, rather than being hand\-written per instruction\. This substantially reduces the engineering effort required to bring up a new back\-end, and the same approach extends directly to other target architectures\. We have constructed a full\-scale implementation prototype and demonstrate its effectiveness on a comprehensive test corpus\. Our evaluation includes the entire SPECint 2006 benchmarking suite \(as a proxy for real\-world legacy binaries\) and a small number of hand\-crafted binaries \(as exemplified by Listings[1](https://arxiv.org/html/2605.08419#LST1)and[2](https://arxiv.org/html/2605.08419#LST2)\) designed to expose the limitations of existing fully static approaches\. We believe that our approach creates a useful new capability that is truly different and complementary to existing methods\. There are situations in which our technique is likely to be superior to existing solutions, for example when certain processor models suddenly aren’t available for political reasons or supply\-chain issues\. Using static translation of an existing binary can provide a rapid temporary “stop\-gap” cross\-ISA portability in this situation while preserving the ability to exhaustively test the resulting output code before it is deployed\. This is less risky than using emulators or just\-in\-time compilers\. We have no commercial interests in this research and pledge to open\-source everything at the end of our research project\. ## 2\.Background and Related Work There has long been an interest in modifying existing binaries without access to source code\. The most general form of this is called*binary rewriting*and aims to enable the application of various program transformations to a program’s binary form\. The eventual goal of such transformations could be instrumentation, security hardening, optimization, or deobfuscation\(Wenzlet al\.,[2019](https://arxiv.org/html/2605.08419#bib.bib60)\)\. In the past few years especially, there has been a renewed interest in binary rewriting for a wide range of downstream applications that include security\(Kolsek and Team,[2017](https://arxiv.org/html/2605.08419#bib.bib150)\), optimization\(Panchenkoet al\.,[2019](https://arxiv.org/html/2605.08419#bib.bib157)\), and code debloating\(Qianet al\.,[2019](https://arxiv.org/html/2605.08419#bib.bib158); Agadakoset al\.,[2019](https://arxiv.org/html/2605.08419#bib.bib159)\)\. The term*binary recompilation*is now frequently used to describe whole\-program rewriting techniques that operate by first “lifting” a program to an intermediate representation and then “lowering” it back into a machine\-executable form\. The system we describe in this paper performs binarycross\-\(re\)compilation from one ISA to another\. *Static rewriters*operate on a binary file without executing it\. Static techniques range from beingdirecttominimally\-invasive\(Ducket al\.,[2020](https://arxiv.org/html/2605.08419#bib.bib188)\)tofull\-translation\(Wenzlet al\.,[2019](https://arxiv.org/html/2605.08419#bib.bib60)\)\. Direct and minimally\-invasive schemes target specific tasks such as diverting control flow, inserting trampolines, or performing instruction\-level modifications\. Full\-translation techniques, on the other hand, usually translate programs to specialized intermediate representations \(IRs\) and eventuallyreassemblea new binary\. IRs used by such rewriters aim to faithfully represent the original program semantics\. Examples include VEX IR\(Valgrind Project,[2024](https://arxiv.org/html/2605.08419#bib.bib142)\), Binary Analysis Platform’s \(BAP\) IL\(Brumleyet al\.,[2011](https://arxiv.org/html/2605.08419#bib.bib109)\), and REIL\(Dullien and Porst,[2009](https://arxiv.org/html/2605.08419#bib.bib143)\)\. Crucially, full\-translation techniques use the expressivity of such powerful IRs to recover higher\-level constructs such as control flow, basic blocks, and functions, which enables them to apply complex program\-wide analyses and transformations\. *Dynamic rewriters*transform a programduringprogram execution\. This is achieved by using an instrumentation engine such as PIN\(Luket al\.,[2005](https://arxiv.org/html/2605.08419#bib.bib94)\)or DynamoRIO\(Brueninget al\.,[2003](https://arxiv.org/html/2605.08419#bib.bib161)\)that inserts fine\-grained hooks during native execution, or by running the binary inside a virtual environment such as QEMU\(Bellard,[2005](https://arxiv.org/html/2605.08419#bib.bib71)\)or Valgrind\(Nethercote and Seward,[2007](https://arxiv.org/html/2605.08419#bib.bib116)\)\. Compared to static rewriters, dynamic approaches can perform much more precise and fine\-grained modifications as they can observe control flow and program state at runtime\. However, modifications performed by such rewriters only persist for the duration of the current execution run\. Existing static binary recompilers have been surprisingly limited in their capabilities until quite recently; most of the published recompilation frameworks are not even able to recompile all of the constituent programs of some of the most basic standard benchmarking suites such as SPECint 2006\. This is fundamentally due to the fact that “lifting” is a hard problem: Horspool and Marovac\(Horspool and Marovac,[1980](https://arxiv.org/html/2605.08419#bib.bib113)\)showed as far back as 1980 that the general problem of “detranslating” \(decompiling/disassembling\) a binary executable requires being able to differentiate with certainty between code and data, which for most computer architectures is equivalent to the Halting Problem\(Turing,[1937](https://arxiv.org/html/2605.08419#bib.bib4)\)and is hence unsolvable in general\. Our approach overcomes this problem by translating each byte of the executable under all possible interpretations separately, and hence not having to make this determination at all\. Previous static binary lifters\(Anandet al\.,[2013](https://arxiv.org/html/2605.08419#bib.bib67); Dinaburg and Ruef,[2014](https://arxiv.org/html/2605.08419#bib.bib47); Yadavalli and Smith,[2019](https://arxiv.org/html/2605.08419#bib.bib21); Di Federicoet al\.,[2017](https://arxiv.org/html/2605.08419#bib.bib78); Williams\-Kinget al\.,[2020](https://arxiv.org/html/2605.08419#bib.bib191); Panchenkoet al\.,[2019](https://arxiv.org/html/2605.08419#bib.bib157); Dineshet al\.,[2020](https://arxiv.org/html/2605.08419#bib.bib162)\)have attempted to approximate the differentiation between code and data by using imprecise heuristics, which becomes a problem especially when trying to predict targets of indirect control flow transfers\(Panget al\.,[2020](https://arxiv.org/html/2605.08419#bib.bib98)\)\. For example, LLBT\(Shenet al\.,[2012](https://arxiv.org/html/2605.08419#bib.bib19),[2014](https://arxiv.org/html/2605.08419#bib.bib20)\)performs static translation of ARM binaries by lifting ARM instructions to LLVM IR and recompiling to the target architecture\. However, like many static binary rewriters, LLBT relies on heuristics to detect potential indirect branch targets, rendering it vulnerable to incorrect translation when processing obfuscated or manually crafted indirect branching code\. Additionally, LLBT makes several other assumptions about the input binaries during code identification, which help to shrink the size of their address mapping table and make the output binaries suitable for embedded devices where binary size is a serious concern\. The use of such imprecise heuristics is the main reason why all of the existing recompilers based on static binary lifting have problems handling even relatively simple benchmark programs: even a good heuristic will fail on some inputs, while correct lifting of an entire binary requires that the heuristic gets every single code\-versus\-data decision right\. Hence, the larger a binary becomes, the higher the chances that at least one heuristics\-based decision will come up wrong somewhere\. Conversely, dynamic approaches follow the flow of instructions as they are actually executed\(Altinayet al\.,[2020](https://arxiv.org/html/2605.08419#bib.bib39)\)\. They are thereby able to handle not only precise instruction recovery but also indirect control flows, by design; after all, the processor must also be able to follow and correctly decode the instruction stream\. However, dynamic methods can only lift instructions that are reached during concrete executions of the program\. Hence, a strictly dynamic lifter may have an incomplete view of the control flow graph contained in a binary program, and the parts of the binary that correspond to “unseen” parts will be omitted from the lifted code\. As a consequence, any output binary generated from the lifted code will have to be able to deal with situations in which a dynamically computed branch suddenly jumps into “terra incognita,” i\.e\., a previously unseen piece of code that was not covered during lifting \(and that may have been mistaken as “data”\)\. Static lifters also have this problem when dealing with control flow targets that are reachable only via computed branches, some of which may evade regular static control\-flow analysis\. To make matters even more complicated, ISAs such as x64 that have variable\-length instructions make it possible to nest instruction sequences within each other\. A branch terminating in the middle of a multi\-word instruction will result in the operands of the original instructions being decoded as instructions in their own right; “return\-oriented programming” \(ROP\) attacks\(Shacham,[2007](https://arxiv.org/html/2605.08419#bib.bib26)\)frequently make use of this fact, but this strategy may also be used for code obfuscation\. Hence, in order to statically cross\-compile an entire binary, one needs to not only correctly take into account differentiation between code and data, but also all possible valid instruction sequences that may be embedded, at an offset, within other valid instruction sequences\. Current best practice is to combine static and dynamic approaches to handle precisely these corner cases\. For example, commercial companies have employed such hybrid approaches when transitioning between ISAs, combining an interpreter for the old ISA along with a dynamic code generator for the new one\. Apple employed a system called Rosetta II in their transition from x64 to AArch64 on Mac systems\(Cunningham,[2025](https://arxiv.org/html/2605.08419#bib.bib16)\), a hybrid dynamic binary translation approach that performs some instruction translation ahead\-of\-time while translating others dynamically upon first discovery\(Nakagawa,[2021](https://arxiv.org/html/2605.08419#bib.bib3)\)\. Similarly, Microsoft has deployed a system called Prism\(Microsoft,[2024](https://arxiv.org/html/2605.08419#bib.bib5)\)to support their “Windows on Arm \(WoA\)”\(Arm, Inc\.,[2025a](https://arxiv.org/html/2605.08419#bib.bib1)\)initiative, combining ahead\-of\-time translation with dynamic components to handle undiscovered code and edge cases\. Among recent academic contributions, WYTIWYG\(Parzefallet al\.,[2024](https://arxiv.org/html/2605.08419#bib.bib18)\)and Polynima\(Deshpandeet al\.,[2024](https://arxiv.org/html/2605.08419#bib.bib25)\)perform static binary lifting along control flow paths that have previously been identified by dynamic profiling; they also rely on fall\-back mechanisms that collect additional control\-flow information dynamically whenever branches terminate at previously unseen target addresses\. In contrast to all of these prior works, we present what we believe is the first general, fully static, heuristics\-free binary cross\-compiler that scales even to large programs\. The key to our approach is that instead of ever making a determination as to whether any byte in the original program binary should be interpreted as code or data, as an instruction word or an argument to such an instruction, we create a separate control flow path under every feasible interpretation of that byte\. This is an application of the concept of*superset disassembly*\(Baumanet al\.,[2018](https://arxiv.org/html/2605.08419#bib.bib163)\), which was first described in the context of binary rewriting, but which to the best of our knowledge has not been applied to static recompilation, let alone cross\-compilation from one ISA to another\. Our work here demonstrates that the approach can be made both robust and practical, while our measurements reveal some of the consequences of trading decoding precision for code expansion in this manner\. ## 3\.System Overview Elevatoroperates on the principle of complete x64 state preservation within the translated AArch64 code\. The system employs a one\-to\-one mapping between x64 and AArch64 registers, emulating the state of each x64 register within a corresponding AArch64 register\. The x64 stack is emulated directly on the AArch64 stack, allowing the operating system to handle normal stack expansion as the program executes\. Rather than analyzing the application binary interface \(ABI\) of the input x64 binary,Elevatorperforms ABI translation only when execution transitions to and from external code; at those transition points, the standard rules of the x64 System V ABI and the AArch64 Procedure Call Standard regarding argument placement in registers and on the stack can be applied directly\. The combination of complete state preservation and one\-to\-one register correspondence is what enables*independent*translation of instructions: each x64 instruction can be translated with no knowledge of what runs before or after it, because the live state it reads and writes sits in the same dedicated AArch64 registers at every program point\. Instruction\-level isolation in turn allows us to represent the input as a superset control\-flow graph \(CFG\), derived byte by byte from the original binary, and to translate each candidate x64 instruction in the graph into AArch64\. Translating an individual candidate instruction under this discipline is mechanical\. The hard part of the pipeline is constructing the superset CFG itself\. As mentioned in Section[2](https://arxiv.org/html/2605.08419#S2), distinguishing code from data in a binary is in general equivalent to the Halting Problem\(Turing,[1937](https://arxiv.org/html/2605.08419#bib.bib4)\)and therefore unsolvable; every static translator that attempts to commit, at each byte of the input, to a single interpretation is forced to rely on heuristics that will be wrong on some inputs\. Listings[1](https://arxiv.org/html/2605.08419#LST1)and[2](https://arxiv.org/html/2605.08419#LST2)exhibit two pathological but entirely legal x64 programs that illustrate some of these issues\. Listing 1:Overlapping instruction example\.1OverlappingInstruction: 2xoreax,eax 3moval,0xC2 4testrdi,rdi 5jzReturnC2 6ReturnC3: 7\.byte0xB0 8ReturnC2: 9\.byte0xC3 10ret Listing 2:Computed indirect branch\.1WeirdIndirectBranch: 2andrdi,3 3shlrdi,1 4xoreax,eax 5callLabel 6inceax 7inceax 8inceax 9inceax 10ret 11Label: 12poprsi 13addrsi,rdi 14jmprsi Listing[1](https://arxiv.org/html/2605.08419#LST1)illustrates the nested\-instruction phenomenon\. Starting a decode at the\.byte 0xB0yieldsMOV AL, 0xC3followed byRET, whereas starting one byte later atReturnC2yields justRET\. Both decodes are reachable from the precedingjz, and any translator that commits to a single interpretation of these two bytes will silently miss one of them\. Listing[2](https://arxiv.org/html/2605.08419#LST2)illustrates a computed indirect branch: thecallinstruction captures the table’s base address which is recovered bypop rsi, to which an input dependent offset is added to construct the target ofjmp rsi, so the branch may land at any of fourinc eaxinstructions that lie two bytes apart in the encoded stream\. A translator that rewrites only statically resolvable jump targets has nowhere to land this branch\. Elevatorsidesteps the code\-versus\-data determination altogether through an application of superset disassembly\(Baumanet al\.,[2018](https://arxiv.org/html/2605.08419#bib.bib163)\): we simultaneously interpret every executable byte offset in the original binary as \(i\) data and \(ii\) the start of a potential instruction sequence beginning at that offset, and we build the superset control flow graph from every one of the resulting candidate decodes\. Every potential target of indirect jumps, callbacks, or other runtime dispatch mechanisms that cannot be statically analyzed therefore has a corresponding landing point in the rewritten binary\. These targets are resolved at runtime through a lookup table from original instruction addresses to translated code addresses that we embed in the final binary\. Elevator’s translation process falls into two distinct stages\. The first, executed once offline and independent of any input binary, constructs a reusable*tile bank*: a set of precompiled AArch64 byte sequences, one for every concrete combination of an x64 instruction and its operand registers\. Each compiled sequence reads and writes the AArch64 registers that hold the emulated x64 state directly\. The second, executed on every input binary, performs superset disassembly, selects the appropriate tile from the bank for each candidate x64 instruction it has discovered, concatenates the selected tiles into the body of an AArch64 object file, embeds the address\-lookup table alongside, and links the object against theElevatorruntime driver to produce the final stand\-alone executable\. Section[4](https://arxiv.org/html/2605.08419#S4)develops both stages, beginning with the offline construction of the tile bank\. x86\-64 binaryElevatorAArch64 executableTile bank\(built offline\)Figure 1\.Elevatorsystem overview\. For each input x86\-64 binary,Elevatorconsults a reusable tile bank, built once offline from hand\-written C tiles compiled through LLVM with a custom calling convention, and emits a stand\-alone AArch64 executable\. ## 4\.Translating the CFG Elevatorseparates translation into three stages\. An offline stage \(Section[4\.1](https://arxiv.org/html/2605.08419#S4.SS1)\) expresses x64 instruction semantics as C functions, specializes them per operand combination under a fixed x64\-to\-AArch64 register mapping, and compiles the whole set through a modified LLVM 20 into a reusable*tile bank*\. A per\-binary stage \(Section[4\.2](https://arxiv.org/html/2605.08419#S4.SS2)\) then rewrites an input x64 binary by looking each candidate instruction’s tile up in the bank by name and stitching the retrieved AArch64 byte sequences together, with a small set of hand\-crafted templates for the instruction categories that cannot be expressed as C tiles \(control\-flow transfers and ABI crossings\)\. A final packaging stage \(Section[4\.3](https://arxiv.org/html/2605.08419#S4.SS3)\) combines the translated code, the original x64 binary, an address\-lookup table, and a runtime driver into the stand\-alone AArch64 executable\. Using an existing compiler to produce target\-ISA code snippets that are then extracted and stitched together is an approach shared with template\-based just\-in\-time compilers\(Piumarta and Riccardi,[1998](https://arxiv.org/html/2605.08419#bib.bib11); Iliasov,[2003](https://arxiv.org/html/2605.08419#bib.bib8); Wimmeret al\.,[2013](https://arxiv.org/html/2605.08419#bib.bib12); Xu and Kjolstad,[2021](https://arxiv.org/html/2605.08419#bib.bib7)\); what is distinctive in our design is how tiles are specialized per x64 operand, how LLVM is configured so that the compiled tile code operates directly on emulated x64 state, and how the three stages cooperate so that the per\-binary stage reduces to byte\-level selection and concatenation\. ### 4\.1\.Offline: Building the Tile Bank \(Performed only Once\) Writing a translation map from every x64 instruction to an equivalent AArch64 instruction sequence by hand, one assembly sequence at a time, is impractical\. A single template such as theADD Reg8, Reg8form already expands into 256 concrete register combinations, and the full x64 instruction set has many such templates across its register, memory\-operand, and immediate addressing\-mode variants; hand\-writing an AArch64 encoding for each one would require dual expertise in both ISAs and an error\-prone amount of effort\.Elevatortherefore does not write such a map directly\. We express the semantics of each x64 instruction as a small C function, specialize it per concrete operand combination, and let LLVM compile the resulting set into AArch64\. As a concrete example, Listing[3](https://arxiv.org/html/2605.08419#LST3)shows the template tile for the x64 instructionADD Reg8, Reg8, and Listing[4](https://arxiv.org/html/2605.08419#LST4)shows the specialized tile that emerges for the particular instructionADD RCX, RDX\. Listing 3:Template tile forADD Reg8, Reg8\.1uint64\_tADD8\_R1\_R1\_R2\(uint64\_tR1,uint64\_tR2\)\{ 2return\(\(R1\+R2\)&MASK8ULL\)\|\(R1&~MASK8ULL\); 3\} Listing 4:Specialized tile forADD RCX, RDX\.1\_\_attribute\_\_\(\(aarch64\_custom\_reg\("X3:X3,X2"\)\)\) 2uint64\_tADD8\_RCX\_RCX\_RDX\(uint64\_tR1,uint64\_tR2\)\{ 3return\(\(R1\+R2\)&MASK8ULL\)\|\(R1&~MASK8ULL\); 4\} The template on the left returns the new value of the destination register: the lower eight bits are updated with the 8\-bit sum while the upper 56 bits remain unmodified, matching the x64 semantics for partial\-register writes\. The x64ADD Reg8, Reg8instruction also affects theRFLAGSregister, modifying the Carry, Parity, Auxiliary Carry, Zero, Sign, and Overflow flags; since a C function is constrained to return a single value, we capture the flag updates in a separate, dedicated flag tile that runs alongside the arithmetic one\. A single x64 instruction may therefore correspond to one or several tiles, which we concatenate back\-to\-back at emission time to recover the full semantics\. Two things change between the template on the left and the specialized form on the right\. The function name has been rewritten fromADD8\_R1\_R1\_R2toADD8\_RCX\_RCX\_RDX, pinning the template’s positional register arguments to the concrete x64 operandsRCXandRDX\. Anaarch64\_custom\_regattribute has also been attached, declaring the AArch64 registers in which LLVM is to place the return value and each argument: in this example the return value and first argument both bind toX3, and the second argument binds toX2, reflecting the mappingRCX↦\\mapstoX3andRDX↦\\mapstoX2\. The body of the function, which operates on the local variablesR1andR2, is unchanged; the attribute is what causes LLVM to read those locals out ofX3andX2on entry and to write the result back intoX3on exit\. Everything else in this subsection is about how the template is turned into the specialized form on the right, and how that specialized form is compiled and packaged into the tile bank\. A fixed x64\-to\-AArch64 register mapping is realized by a custom LLVM calling convention applied per tile, chosen under three constraints: 1. \(1\)Volatility preservation\.An x64 register that is callee\-saved under System V maps to an AArch64 register callee\-saved under AAPCS64, and symmetrically for caller\-saved registers, so that calls into AAPCS64 libraries preserve emulated x64 state across the boundary\. 2. \(2\)Argument\-position alignment\.The register holding thenn\-th integer argument under System V maps to the register holding thenn\-th integer argument under AAPCS64, so that a call from translated code to an AArch64 library restates its positional arguments rather than reshuffling them across the call\. 3. \(3\)Minimality\.We consume as few AArch64 callee\-saved registers as the first two constraints allow\. AArch64 offers twelve such registers against x64’s seven, and we keep the surplus free for future shadow state without perturbing the existing mapping\. Beyond the general\-purpose registers, x64’sRFLAGSbits and XMM register file are held in dedicated AArch64 registers under the same one\-to\-one discipline, keeping the full emulated state resident in the register file\. Producing the tile bank itself is mostly mechanical from here\. A modified LLVM 20 honors theaarch64\_custom\_regattribute on a per\-function basis and reclassifies the AArch64 registers backing emulated x64 state as callee\-saved inside the register allocator, so that neither argument placement nor scratch usage inside a tile can corrupt emulated state\. A small source\-to\-source tool,TileGen, walks each C template and emits one specialized copy per admissible operand combination, synthesizing the attribute mechanically from the template’s parameter positions using the register mapping above\. Compiling the specialized file through the modified LLVM, with a short post\-pass that makes tile bodies concatenation\-safe, yields the tile bank: a map from tile name to AArch64 byte sequence, built once offline and consumed by the per\-binary stage described next\. ### 4\.2\.Rewriting an x64 Binary \(Performed Once for Each Binary\) Given an input x64 binary, the per\-binary stage performs superset disassembly and walks the resulting CFG\. For each node, a formatter derives the tile name from the decoded instruction’s opcode and operands, composing multiple names for instructions whose effects span several tiles\. While x64 imposes no restriction on stack pointer alignment, AArch64 requires strict 16\-byte alignment when using the stack pointer in memory operands\. Although we emulate the x64 stack on the AArch64 stack, directly mappingRSPtoSPcreates several complications\. The absence of alignment requirements in x64 creates frequent violations in regular code patterns\. For example, consecutivePUSHinstructions in function prologues guarantee non\-16\-byte\-aligned memory accesses, which would trigger exceptions on AArch64\. We address this by having tiles access the stack through a separate register,X25, and only materializingSPin it when tiles actually require it\. Additionally, since our tiles compiled by LLVM expect 16\-byteSPalignment upon entry, we alignSPdown prior to executing any tile detected as allocating spill space, restoring it either from a saved register orX25depending on if the tile modifiedRSP\. We also implement a targeted optimization to eliminate unnecessary flag computation tiles, which we identify as being relatively expensive when compared to other tiles\. If the flags are overwritten prior to a read in a post\-dominating instruction, the flag computation part of the current node’s tile can be removed\. When we encounter unsupported instructions, which currently consist principally of x64’s AVX2 and later wide\-vector extensions, we insert an interrupt instruction in place of a tile\. This has no practical impact: superset disassembly inherently decodes numerous invalid or spurious instruction sequences at arbitrary byte offsets, but these occur on program paths that are never reached during normal control flow\. Our evaluation across all of SPECint 2006 demonstrates that our supported set, comprising the full x86\-64 integer ISA and the SSE subset exercised by SPECint, is sufficient to execute every benchmark\. Furthermore,Elevatoris designed to be an extensible framework, so that adding new tiles to support additional instructions is a straightforward process; however, the extra engineering work is highly unlikely to yield any additional scientific insights\. #### 4\.2\.1\.Control\-Flow Instructions Call, return, and branch instructions cannot be expressed as C tiles: their semantics depend on architectural decisions \(return\-address location, program counter, conditional\-flags layout\) that differ between x64 and AArch64, so a naive mapping such as x64CALLonto AArch64BLwould break the emulated x64 stack\. We hand\-craft the translation for each category\. Call\.Direct calls need no ABI translation: we push the original x64 return address \(the x64 call site plus five\) onto the emulated stack and branch to the translated tile of the callee\. The stacked address stays in the original x64 address space; the branch target is its translated counterpart\. Indirect calls, whose target is known only at runtime and may land either inside the translated binary or inside an external library, emit a bounds check against the embedded x64 binary and branch accordingly\. In both cases the original x64 return address is pushed first\. For internal targets, the x64\-offset\-to\-tile table translates the target before an unconditional branch to the corresponding tile\. For external targets, we install the address of a reverse ABI\-translation gadget inX30\(where the AArch64 library will return to\), perform the exit ABI translation, and branch to the external target\. Return\.Returns pop the 8\-byte return address from the emulated stack and compare it against the embedded x64 binary’s bounds\. Internal returns translate the address through the lookup table and branch to the corresponding tile; external returns perform the return\-side ABI translation before branching to the target\. Branch\.We split branches by whether the target is statically resolvable \(direct branches\) or computed at runtime \(indirect branches\)\. Direct branches encode their target as an immediate offset relative toRIP, so the target is known at translation time\. UnconditionalJMP Imm8/Imm32maps to AArch64Bwith the offset re\-encoded from x64’s 8 or 32 bits into AArch64’s 26\-bit range\. Conditional branches translate to the AArch64 conditional branches that test the x64 flag bits held inX14\. In both cases we emit the branch with a placeholder offset and defer address fixup to the linker, which patches the final offset once all tiles have been placed and inserts a veneer if the 26\-bit range is exceeded\.  \(a\)Indirect Branch Leaving the Translated Binary\. \(b\)Exit ABI Translation withn=4 stack arguments\. Figure 2\.Indirect branch handling and ABI translation at exit boundaries\.Indirect branches may target either the translated binary or an external library; the latter typically arises when a call to an external library at the end of a function is optimized into a tail call, as in Figure[2\(a\)](https://arxiv.org/html/2605.08419#S4.F2.sf1)\. We emit the same bounds check used for indirect calls and returns, and perform exit ABI translation when the target is external\. The indirect jump differs from the indirect call only in that the ABI\-translation path assumes a preceding relative\-call tile has already placed the return address at\[RSP\]\. #### 4\.2\.2\.Crossing the x64/AArch64 ABI Boundary Elevatorsupports only dynamically linked binaries\. This side\-steps the need to translate architecture\-specific instructions such asCPUID, which statically linked binaries inline directly but dynamically linked binaries delegate to libc\. To facilitate thisElevatorsupports transitioning to and from the x64 and AArch64 Linux ABIs when interacting with dynamically linked libraries\. There are four distinct cases where ABI translation is necessary as execution crosses between our emulated x64 environment and native AArch64 library code\. The two aspects requiring translation are argument placement and return address location\. The System V x64 ABI \(used by x64 Linux\) designates six registers,RDI,RSI,RDX,RCX,R8, andR9, as argument registers, with additional arguments passed on the stack starting at\[RSP\+8\]\. The x64CALLinstruction stores the return address on the stack below any arguments at\[RSP\]\. In contrast, the AArch64 Procedure Call Standard defines eight argument registers \(X0\-X7\) with remaining arguments on the stack at\[SP\], while also storing the return address in registerX30\. Calls to External Libraries\.When a translated x64 call instruction targets an external library, we must change the argument layout to respect AArch64’s calling conventions\. First, we subtract 8 fromSPto realign the stack to a 16 byte boundary, leaving the x64 return address, which was already on the stack, at\[SP\+0x8\]\. We then load two values from stack positions\[SP\+0x10\]and\[SP\+0x18\]into registersX6andX7, giving AArch64 libraries access to potential arguments 7 and 8 that the translated x64 code would have placed on the stack if they exist\. However, any potential remaining stack arguments are left starting in the wrong location,\[SP\+0x20\]\. Ideally we would have popped the x64 return address off the stack, as well as the values stored intoX6andX7\. Unfortunately, this is unsafe as we cannot be sure if the popped argument values are not instead caller spill space or part of a structure allocated on the caller’s stack\. Instead, we leave the caller’s stack layout entirely untouched and allocate an additionaln×8bytes of stack space\. In this new space, we copy innpotential 8\-byte arguments from their current locations, starting at\[SP\+0x20\+n×8\]\. This stack copy, starting at the new adjustedSP, now holds any potential stack arguments \(arguments numbered 9 and above\) that would be passed to the callee\. Figure[2\(b\)](https://arxiv.org/html/2605.08419#S4.F2.sf2)depicts the transformation applied to the stack and register file upon control flow leaving the translated binary, with black being the layout before, and blue being the updated layout after\. The maximum number of argumentsElevatorallows any function call in the input binary to pass isnstack arguments plus 6 register arguments, withndefaulting to 10\. However,nis fully configurable and can easily be increased to support input binaries that call external library functions with more than 16 total arguments\. Finally, we store the address of a gadget inX30for the external library to return to\. Returns from External Libraries\.When control returns to the gadget whose address was stored inX30before the call to the external library, the previously copied stack arguments are cleaned up by addingn×8to the stack pointer\. We then move the external library’s return value fromX0intoX9 \(RAX\), where the emulated x64 code expects it\. Finally, we pop the original x64 return address and its associated padding from the stack, translate the address, and branch to it, thereby resuming execution after the originalCALL\. Callbacks Into Translated Code\.When native AArch64 code calls into our translated binary, we must convert from AArch64’s calling convention to x64’s\. The emulated x64 code expects arguments 7 and 8 on the stack rather than in registersX6andX7\. We pushX7first, thenX6, placing them at the stack positions where x64 would expect these arguments\. If the callee does not actually expect a 7th and 8th argument, these pushed values will not have any effect\. Finally, we push the return address, which the AArch64 branch\-and\-link instruction in the external library will have put inX30, onto the stack where the x64 return instructions will expect it\. Callback Returns from Translated Code\.When translated code returns from a callback to an external library, we reverse the entry process\. The return address is popped off the stack, and the stack space allocated by pushingX6andX7is cleaned up by adding 0x10 to the stack pointer\. Since our translated callback code will have put the return value inX9\(RAX\), but the external library will expect it inX0, we move it fromX9intoX0\. Architecture\-dependent structures\.A few data structures have different layouts on x64 and AArch64 and must be translated in addition to the argument and return\-address marshalling above\. The most prevalent isva\_arg, whose layout reflects the ABI’s argument\-register count \(six for System V, eight for AAPCS64\)\. We install intercept stubs on the affected library routines \(vsprintfandvsprintf\_chk\) that perform the extra x64\-to\-AAPCS64va\_argrewrite as control leaves the binary\. ### 4\.3\.The Translated Executable Once the superset CFG has been rewritten into an AArch64 code stream, the remaining work is to turn that stream, together with its auxiliary structures, into a stand\-alone ELF executable\. The translated binary embeds four components: the translated code itself, the original x64 binary preserved verbatim as read\-only data, the address\-lookup table that the runtime uses to resolve computed indirect branches, and a small runtime driver that installs memory protections and a signal handler at startup\. #### 4\.3\.1\.Binary Layout Translated Code\.The translated superset CFG is included in the new file by inserting the tiles into symbols for the linker to place\. The system employs a greedy algorithm that traces static fall\-through edges in the superset CFG forward and merges consecutive tiles into single symbols\. Heuristics are used to identify likely call targets in the superset CFG, with the algorithm starting at these positions to merge likely executed tiles together\. When a branching instruction is encountered, forward merging is stopped and restarted at the branch target\(s\)\. Labels to each tile within the merged symbols are preserved, as other tiles outside the symbol may still need to jump into them\. When an instruction is encountered that has already been visited by the algorithm, meaning it already exists in another merged symbol, a branch to it is inserted and the linear merging stops\. The symbols are passed to the linker \(LLD\), which places them into the final executable and adds relocations to the x64\-offset\-to\-tile lookup table with the final tile positions\. Original Binary\.We preserve the original x64 binary, embedding it in a new section within the translated binary\. This embedded section contains a mapped view of the original binary with all segments expanded and positioned as they would appear if the ELF loader had loaded it into memory for execution\. Initially, the embedded binary resides in a read\-only section; however, during program startup, we parse the original program headers and apply the correct memory protections to each segment\. Crucially, we intentionally avoid marking any originally executable sections as executable in the translated binary\. None of the original code will be executed, and the fact that a signal will be raised if execution is attempted allows us to detect external calls into the binary, as will be explained later\. Lookup Table\.The x64\-offset\-to\-tile lookup table enables efficient translation between original x64 addresses and their corresponding translated code locations\. The table is an array of 8 byte offsets into the translated binary, and is indexed into using offsets in the original binary\. Driver\.The driver program is a small segment of C code linked into the translated binary that executes before the translated x64 code\. The driver code takes on many of the responsibilities the original x64 loader would have performed, such as applying proper segment memory protections, processing PLT relocations, resolving external library symbols and populating GOT entries\. It intercepts ABI incompatible library calls, such asvsprintf, redirecting them to separate functions to translate their architecture\-specific structures\. #### 4\.3\.2\.Signal Handler Range checks added to indirect call and jump instructions enable straightforward detection of when ABI transitions are necessary upon leaving the translated binary\. However, detecting when control flow*enters*the translated binary, which also necessitates ABI translation, is more difficult\.Elevatorsolves this by installing signal handlers that catch attempts to execute code within the embedded original x64 binary, and subsequently perform ABI translation and redirect control flow to the corresponding tiles\. Escaping Pointers\.Code pointers referencing our translated binary can escape to external libraries in only two ways: \(1\) function pointers passed as parameters to external libraries, and \(2\) exported functions\. In both cases, these escaped pointers reference addresses within the original x64 binary, which in the translated binary will be non\-executable\. With this, we can be sure that every attempt to execute code within our binary, from an outside source, will result in attempted execution inside the embedded x64 binary\. When this happens, a hardware exception is raised and execution is subsequently transferred to our signal handler where an ABI transition can be performed\. ## 5\.Evaluation To evaluateElevator, we first validate thatElevatorpreserves the functionality of the original x86\-64 binaries\. We then measure the static translation cost \(Section[5\.2](https://arxiv.org/html/2605.08419#S5.SS2)\), followed by runtime performance compared against native AArch64 binaries and against two dynamic binary translators\. The two translators cover the current state\-of\-art along different dimensions\. QEMU 8\.2\.2\(Bellard,[2005](https://arxiv.org/html/2605.08419#bib.bib71)\)in user\-mode emulation is the de\-facto reference for portable dynamic translation and the baseline used by nearly every prior binary\-rewriting study on Linux\. Box64\(Chevalier and Box64 Contributors,[2024](https://arxiv.org/html/2605.08419#bib.bib317)\)is a mature, actively developed x86\-64 to AArch64 dynamic translator originating in the gaming and Wine ecosystem, and represents the upper end of performance currently achievable for this ISA pair\. We emphasize up\-front that Box64 is a production\-quality engineering artifact refined over many years by a large open\-source community, with extensive hand\-tuned AArch64 dynarec code paths for common x86 instruction patterns; it is not directly comparable, on an engineering\-effort basis, to a research prototype such asElevator\. It is also not directly comparable on a structural basis\. Box64 executes the x86\-64 input through a runtime engine that JIT\-compiles basic blocks on first encounter and that itself remains resident on every run, whereasElevatorproduces a self\-contained AArch64 ELF that replaces the input entirely at translation time\. No dynamic translator of which we are aware can produce such an artifact, and the obstacle is fundamental to the dynamic approach rather than an engineering gap to close\. Dynamic translators discover code by following the program counter at runtime, so computed branches, indirect calls, and any code introduced throughdlopen, self\-modification, or JIT compilation have no form that an ahead\-of\-time pipeline can consume\.Elevator’s superset disassembly is what makes whole\-program static translation tractable, and is the reasonElevatorhas a shippable artifact to produce at all\. We nevertheless include Box64 as an aspirational reference point: the gap betweenElevatorand Box64 quantifies the optimization headroom that more sophisticated code generation could eventually deliver on top of our static, superset\-based approach\. Finally, we analyze the binary size expansion introduced by the superset\-based rewriting approach that underliesElevator\. Benchmarks\.Our evaluation uses the SPEC CPU2006 integer suite \(SPECint 2006\)\. We focus exclusively on the integer benchmarks because full translation of the SPECfp 2006 binaries would require additional engineering to handle the SSE and x87 floating\-point repertoire without yielding proportional scientific insight intoElevator’s design\. For each benchmark we compile x86\-64 inputs withgcc13\.3\.0 at both\-O2and\-O3\. The corresponding native AArch64 baselines are produced by recompiling the same sources withgcc13\.3\.0 at the matching optimization level\. All benchmarks use the reference \(ref\) input set\. We deliberately use SPECint 2006 rather than migrating to the newer SPEC CPU 2017 suite, and we motivate the choice carefully because both suites now appear in the recent binary\-translation literature\. SPECint 2006 remains the benchmark of record for static binary rewriters and cross\-ISA translators whose designElevatormost directly inherits from or is most directly comparable against, including Multiverse\(Baumanet al\.,[2018](https://arxiv.org/html/2605.08419#bib.bib163)\), Egalito\(Williams\-Kinget al\.,[2020](https://arxiv.org/html/2605.08419#bib.bib191)\), BinRec\(Altinayet al\.,[2020](https://arxiv.org/html/2605.08419#bib.bib39)\), HQEMU\(Honget al\.,[2012](https://arxiv.org/html/2605.08419#bib.bib312)\), and MAMBO\-X64\(D’Antraset al\.,[2017](https://arxiv.org/html/2605.08419#bib.bib314)\)\. Among these, Multiverse is the closest intellectual antecedent: its superset disassembly is the direct origin ofElevator’s superset\-based rewriting pipeline\. Retaining SPECint 2006 therefore placesElevatoron directly comparable footing with the prior work that shares its technical foundation\. We acknowledge that a separate line of recent work, including Biotite\(Chenet al\.,[2025](https://arxiv.org/html/2605.08419#bib.bib315)\)and the LLVM\-based DBT of Engelke\(Engelkeet al\.,[2021](https://arxiv.org/html/2605.08419#bib.bib316)\), reports on SPEC CPU 2017\. We regard the two benchmark choices as complementary rather than competing\. x86\-64 SPECint 2006 built withgcc13\.3\.0 demonstratesElevator’s ability to handle modern code generation while preserving direct comparability with previous work in the field of static binary translation\. Experimental Setup\.All benchmark evaluations were performed on an AArch64 server running Ubuntu 24\.04\.2 LTS with Linux kernel 6\.8\.0\. The system is a GIGABYTE 1U Mt\. Snow 1S chassis housing an AMPERE Altra processor built on the Arm Neoverse\-N1 microarchitecture, with 64 single\-threaded cores running at 3\.0 GHz, 64 GB of DDR4\-3200 ECC memory in a single NUMA node, and a 1 TB Samsung NVMe SSD\. Each core has 64 KB L1\-d, 64 KB L1\-i, and a private 1 MB L2 cache\. We run every \(benchmark,input,execution mode\) combination three times under the system’s default frequency governor and report the median wall\-clock time measured by/usr/bin/time \-v\. Hardware performance counters—cycles, instructions, cache references, cache misses, branches, and branch mispredictions—are collected on every run via the Linuxperf statinterface; for each configuration we extract the counters from the run whose wall\-clock time equals the reported median\. Special Cases and Modifications\.Benchmark 471\.omnetpp unconditionally throws a C\+\+ exception to terminate itself; however,Elevatordoes not yet support exception handling\. We therefore applied a minimal modification to a single function, replacing twothrowstatements with equivalentreturnvalues\. This preserves identical control\-flow semantics and does not affect the benchmark’s computational characteristics\. Additionally, we crafted several input test binaries that perform unconventional control\-flow tricks, including non\-standard indirect branching and overlapping instruction sequences\. These were created to probe edge cases that only an assumption\-less superset\-based CFG can handle, and thus to demonstrate the effectiveness of superset disassembly as the rewriting substrate underlyingElevator\. ### 5\.1\.Correctness We validateElevator’s correctness at individual instruction translation level and complete binary translation level\. We verify each tile by comparing its output against the corresponding x86 instruction executed on native x86 hardware\. Each instruction has over 100,000 test inputs, combining randomly generated values with carefully selected edge cases that include boundary values, arithmetic overflow and underflow conditions, and flag combinations\. We also verify end\-to\-end correctness by comparing outputs of translated binaries against their original x64 counterparts\. Across all SPECint 2006 benchmarks \(at both O2 and O3 optimization levels\) and our custom programs featuring unconventional control flow, all outputs match\. This demonstrates thatElevatorcorrectly preserves program semantics throughout the entire translation process\. ### 5\.2\.ElevatorTranslation Speed  \(a\) Translation time\.  \(b\) x86\-64\.textsize; labels give total ELF size\. Figure 3\.Elevator’s translation time \(a\) tracks input\.textsize \(b\) at Pearsonr=0\.9993r=0\.9993; only 403\.gcc and 483\.xalancbmk swap between the two orderings\.Figure[3](https://arxiv.org/html/2605.08419#S5.F3)\(a\) reports the translation time for each SPECint 2006 benchmark at both\-O2and\-O3\. Translating the entire suite takes140140s at\-O2and167167s at\-O3, and the per\-benchmark cost is strongly correlated with the size of the input x86\-64\.textsection shown in Figure[3](https://arxiv.org/html/2605.08419#S5.F3)\(b\) \(Pearsonr=0\.9993r=0\.9993\)\. The annotations above each bar report the full ELF binary size, which can be substantially larger than\.textwhen a benchmark bundles static data\.445\.gobmk, for instance, ships with a∼\\sim4\.5 MB pattern database\.Elevatordisassembles and retranslates only\.text, so it is that section’s size that governs the pipeline’s cost\. ### 5\.3\.Runtime Performance Figure 4\.\(a\) Wall\-clock runtime and \(b\) executed instructions on SPECint 2006 at\-O2/\-O3for native, Box64,Elevator, and QEMU\. A broken y\-axis compresses the403\.gccand464\.h264refQEMU outliers so the main data range stays at high resolution\.Table 1\.Geometric\-mean runtime slowdown vs\. native AArch64 on SPECint 2006\. Lower is better\. Table 2\.Geometric\-mean executed\-instruction inflation vs\. native AArch64 on SPECint 2006\. Lower is better\. #### 5\.3\.1\.Observed Runtime Figure[4](https://arxiv.org/html/2605.08419#S5.F4)\(a\) and Table[2](https://arxiv.org/html/2605.08419#S5.T2)report wall\-clock runtime on SPECint 2006\.Elevatorslows execution down by a geometric\-mean factor of4\.88×4\.88\\timesat\-O2and4\.79×4\.79\\timesat\-O3relative to the native AArch64 baseline, with per\-benchmark slowdowns between2\.34×2\.34\\timesand9\.85×9\.85\\timesat\-O2and between2\.50×2\.50\\timesand7\.85×7\.85\\timesat\-O3\. QEMU user\-mode is both slower on average and substantially more variable: geometric means of7\.24×7\.24\\timesand7\.69×7\.69\\times, with per\-benchmark ranges of3\.33×3\.33\\times–33\.20×33\.20\\timesand3\.43×3\.43\\times–34\.81×34\.81\\times\. Box64, which we include as a reference to the performance achievable by a mature production dynamic translator, slows execution down by1\.58×1\.58\\timesand1\.62×1\.62\\timeson average\.Elevatorexecutes faster than QEMU on 7 of 12 benchmarks at\-O2and on 8 of 12 at\-O3\. The rest of this section decomposes whereElevator’s overhead comes from and what would move the numbers\. #### 5\.3\.2\.Decomposing the Overhead Runtime can be factored exactly into the executed instruction count and the average cycles per instruction: TsysTnative=IsysInative⏟instruction inflation×CPIsysCPInative⏟per\-instruction ratio\.\\frac\{T\_\{\\text\{sys\}\}\}\{T\_\{\\text\{native\}\}\}\\;=\\;\\underbrace\{\\frac\{I\_\{\\text\{sys\}\}\}\{I\_\{\\text\{native\}\}\}\}\_\{\\text\{instruction inflation\}\}\\;\\times\\;\\underbrace\{\\frac\{\\text\{CPI\}\_\{\\text\{sys\}\}\}\{\\text\{CPI\}\_\{\\text\{native\}\}\}\}\_\{\\text\{per\-instruction ratio\}\}\. Figure[4](https://arxiv.org/html/2605.08419#S5.F4)\(b\) and Table[2](https://arxiv.org/html/2605.08419#S5.T2)report the first term\.Elevatorinflates the instruction stream by7\.12×7\.12\\timesat\-O2and7\.05×7\.05\\timesat\-O3, with a tight per\-benchmark range of5\.04×5\.04\\times–8\.79×8\.79\\times\(O2\) and4\.78×4\.78\\times–9\.05×9\.05\\times\(O3\)\. QEMU averages9\.36×9\.36\\timesand9\.56×9\.56\\timesbut spans nearly an order of magnitude across benchmarks \(4\.91×4\.91\\times–33\.03×33\.03\\timesat\-O2,4\.95×4\.95\\times–33\.36×33\.36\\timesat\-O3\)\. Box64 inflates by1\.94×1\.94\\timesand1\.96×1\.96\\times\. Figure[5](https://arxiv.org/html/2605.08419#S5.F5)\(a\) reports the second term\.Elevator’s geometric\-mean CPI is0\.4430\.443at\-O2and0\.4450\.445at\-O3, against a native baseline of0\.6450\.645and0\.6550\.655; the CPI ratio is therefore0\.69×0\.69\\timesand0\.68×0\.68\\times\. Combining the two factors,Elevator’s predicted slowdown is7\.05×0\.68≈4\.8×7\.05\\times 0\.68\\approx 4\.8\\times, which is consistent with the4\.79×4\.79\\timesobserved at\-O3\. Applying the same decomposition to QEMU gives9\.56×0\.81≈7\.7×9\.56\\times 0\.81\\approx 7\.7\\times\. Both factors contribute, but forElevatorthe per\-instruction ratio is below unity and partially offsets the inflation — meaningthe overheadElevatorincurs relative to native sits almost entirely in the size of the translated instruction stream, not in how quickly that stream retires on the target pipeline\. #### 5\.3\.3\.Where the Instructions Come From The7×7\\timesinstruction inflation is produced by a small number of structural sources that every x86–AArch64 translator must address\. First, x86 computes six condition flags \(PF,AF,SF,ZF,CF,OF\) on most arithmetic operations, while AArch64 provides only the four NZCV flags natively; faithful emulation of PF, AF, and the x86\-specific interactions with shifts and rotates requires multi\-instruction sequences for every flag\-writing x86 op\. Second, x86’s complex addressing modes \(e\.g\.\[base \+ index\*scale \+ disp\]\) decompose into a short address\-computation sequence in AArch64 whenever the source operand cannot be expressed directly\. Third, the AArch64 footprint of a translated x86 instruction depends on its semantics\. Constructs such asREP\-prefixed string operations and implicit partial\-register updates sit at the heavier end of this spectrum, consistently producing more target instructions than their plainer counterparts\. The tightness ofElevator’s inflation range \(5\.04×5\.04\\times–8\.79×8\.79\\times\) across the suite indicates that these sources scale roughly proportionally with the source instruction count rather than with any particular program pattern, which makes them a good target for per\-instruction optimization\. Figure 5\.Microarchitectural behavior ofElevatoragainst native AArch64 on SPECint 2006 at\-O2and\-O3: \(a\) cycles per instruction \(CPI\); \(b\) branch\-miss rate\. Log y\-axes absorb the429\.mcf\(CPI\) and473\.astar\(branch\-miss\) outliers\. QEMU and Box64 are omitted because their per\-instruction rates are diluted by translator\-internal instructions and are not directly comparable on a per\-instruction basis\. #### 5\.3\.4\.Microarchitectural Behavior Figure[5](https://arxiv.org/html/2605.08419#S5.F5)reports per\-benchmark CPI and branch\-miss rate forElevatorand native\. CPI is below native on 11 of 12 benchmarks at each optimization level, including the memory\-bound429\.mcfwhose native CPI of≈3\.18\\approx 3\.18drops to≈1\.67\\approx 1\.67under translation\. We interpret this not as a performance win but as a corollary of the translation: lowering x86 CISC sequences to AArch64 produces regular streams of simple, independentμ\\muops that the out\-of\-order back\-end can schedule in parallel more easily than the denser native code, even when the total amount of work is higher\. The single exception is456\.hmmer, whose hot kernel is already a tight native AArch64 loop with CPI below0\.340\.34and little remaining headroom; the translated stream of simpleμ\\muops, while wider, retires slightly less densely \(CPI0\.3450\.345at\-O2,0\.3600\.360at\-O3\)\. Branch and cache behavior is broadly preserved across translation\.Elevator’s geometric\-mean branch\-miss rate is1\.01%1\.01\\%at\-O2and1\.04%1\.04\\%at\-O3, compared with1\.21%1\.21\\%and1\.27%1\.27\\%native; the rate is at or below native on 8 of 12 benchmarks at\-O2and 9 of 12 at\-O3\. The handful of benchmarks whoseElevatorbranch miss rate marginally exceeds native \(403\.gcc,456\.hmmer,462\.libquantum, and at\-O2also483\.xalancbmk\) all sit at very low absolute rates below1%1\\%, where the percentage comparison is dominated by small absolute counts rather than by a systematic predictor degradation\. The conclusion from this subsection is negative\. The pipeline, the branch predictor, and the cache hierarchy are not whereElevator’s overhead lives; it lives in the length of the translated instruction stream\. That stream is determined at translation time, soElevator’s cost on any input is a property of the shipped binary rather than its execution history, and any optimization that does not shorten that stream is unlikely to change the headline numbers\. #### 5\.3\.5\.Predictability The translated instruction stream is fixed at translation time\. No translator state, code cache, or dispatch machinery runs alongside the translated binary, and no input can cause the stream to grow, shift, or respecialize\.Elevator’s runtime cost on a given input is therefore a static property of the translated artifact: determined at translation time, inspectable without execution, and bounded above by whatever qualification inputs have already demonstrated\. This property is one only a fully static, heuristic\-free translator can deliver\. The per\-benchmark spread in Figure[4](https://arxiv.org/html/2605.08419#S5.F4)\(a\) is its empirical shadow: QEMU’s first\-encounter translation, dispatch, and code\-cache management work is part of the measured runtime, paid whenever an input drives execution into previously\-unseen code\. Box64 compresses this spread substantially, but no amount of engineering can drive the first\-encounter cost to zero\. ### 5\.4\.Code Size Expansion Figure 6\.Code\-size cost ofElevator’s superset translation on SPECint 2006 at\-O2and\-O3\. \(a\) Translated\.textexpansion relative to natively\-compiled AArch64\. \(b\) Average x86\-64 instruction length measured on each source binary\.Table 3\.Multiplicative decomposition ofElevator’s geometric\-mean\.textexpansion\. Product matches the gmean in Figure[6](https://arxiv.org/html/2605.08419#S5.F6)\(a\)\.Elevatoremits an AArch64 sequence at every valid source\-byte offset of the x86\-64\.text, and no post\-translation size reduction is applied\. This follows from the assumption\-free stance the paper has taken throughout\(Baumanet al\.,[2018](https://arxiv.org/html/2605.08419#bib.bib163)\)\. Every candidate\-reduction path we are aware of \(CFG\-directed pruning, probabilistic disassembly\(Milleret al\.,[2019](https://arxiv.org/html/2605.08419#bib.bib2)\)\) either requires ground\-truth information that the target setting of stripped legacy binaries does not supply, or introduces heuristics that can fail to translate x86 instructions\. Optimizing the footprint is a choice we defer to deployments where such information is present, or where the static binary size itself is a hard constraint\. This subsection reports what that deferral costs\. Elevator’s translated\.textis47\.5×47\.5\\timesto62\.5×62\.5\\timeslarger than natively\-compiled AArch64\.textacross SPECint 2006 \(Figure[6](https://arxiv.org/html/2605.08419#S5.F6)\(a\)\), with geometric means of53×53\\timesat\-O2and54×54\\timesat\-O3: for every AArch64 instruction the native compiler emits,Elevatorproduces roughly5353\. The7×7\\timesper\-instruction lowering measured in Section[5\.3\.3](https://arxiv.org/html/2605.08419#S5.SS3.SSS3)accounts for only part of this cost; the remaining≈7\.5×\\approx 7\.5\\timesfollows from superset translation\.Elevatoremits a tile for every valid byte offset, yielding roughly3\.713\.71tiles per real x86 instruction given the suite’s average4\.064\.06\-byte encoding length \(Figure[6](https://arxiv.org/html/2605.08419#S5.F6)\(b\)\) and a measured valid\-decode rate of≈91%\\approx 91\\%\. The average tile is itself roughly twice the size of a real\-instruction tile, because decodes starting at non\-real offsets land on more complex x86 operations than compilers typically emit\. Table[3](https://arxiv.org/html/2605.08419#S5.T3)summarizes the three factors, whose product7×3\.71×2\.04≈537\\times 3\.71\\times 2\.04\\approx 53recovers the gmean\. Across the suite the expansion range is bounded\. The three lowest\-bloat benchmarks \(471\.omnetpp,483\.xalancbmk,445\.gobmk\) use shorter x86\-64 encodings than the suite mean on both density and amplification axes; the two highest\-bloat benchmarks switch with opt level, with464\.h264refat\-O3\(62\.5×62\.5\\times\) driven by vectorized long\-encoding hot paths and429\.mcfat\-O2\(59\.2×59\.2\\times\) driven by a high amplification factor on a very small\.text\. Optimization level is otherwise immaterial: the geometric means differ by≈3%\\approx 3\\%and all per\-benchmark swings from\-O2to\-O3are within±3×\\pm 3\\timesexcept464\.h264ref\(\+5\.8×\+5\.8\\times\)\. All expansion ratios reported in this section are for\.text\. The overall binary footprint grows by a smaller factor on benchmarks that bundle substantial static data, sinceElevatorpasses data sections through unchanged \(445\.gobmkis the clearest case, carrying a∼4\\sim\\\!4MB pattern database alongside its623623KB\.text\)\. The three decomposition factors admit independent optimization paths discussed together with the runtime\-side options in Section[5\.5](https://arxiv.org/html/2605.08419#S5.SS5)\. ### 5\.5\.Design Choices and Potential Optimizations Since the instruction stream dominates overhead, every natural avenue for improvingElevatorcomes down to emitting fewer instructions per translated x86 instruction\. Flag computation is the clearest case\.Elevatorcurrently runs a backward flag\-liveness pass over linear instruction chains, which handles the common chained\-arithmetic pattern inside a single block\. The flag computation work could be reduced further by making the analysis per\-flag rather than per\-EFLAGS\. An x86 arithmetic instruction writes up to six condition flags \(PF, AF, SF, ZF, CF, OF\), but the branch or condition that eventually reads them almost always reads only one or two, most often ZF for equality tests\. Computing and materializing only the subset of flags that is actually live at each flag\-writing site, rather than all six, would shrink the code generated for every live flag write\. A second potential optimization lies at the boundary between the translated binary and external libraries\. An x64 call into a shared library expects arguments laid out in the x64 System V convention \(Figure[2\(b\)](https://arxiv.org/html/2605.08419#S4.F2.sf2)\); AArch64 libraries expect the AAPCS64 convention\.Elevatorbridges the two by conservatively copying up tonnpotential stack argument slots from x64 positions to AArch64 positions on every external call, because the translator cannot determine statically how many arguments the callee consumes, nor what size and type each one has\. The size question matters as much as the count: both ABIs classify each argument into register or stack slots based on its type and byte size \(scalars, small composites, and larger\-than\-16 \-byte aggregates are handled differently\), so even recovering argument counts through heuristic dataflow analysis is not sufficient to elide the copy\. Resolving callee signatures at translation time, from the dynamic symbol table, library headers, or debug information, would provide both pieces and allow the copy to shrink to the actual number of argument bytes that the callee expects on the stack\. We leave this as future work, sinceElevatoris designed to handle stripped legacy binaries where such signature information is typically unavailable\. The larger determinant ofElevator’s instruction count is how the tiles \(Section[4\.1](https://arxiv.org/html/2605.08419#S4.SS1)\) themselves are written\.Elevator’s code generator is deliberately architecture\-agnostic: it emits portable C that faithfully encodes the semantics of each x86 instruction, with no architecture\-specific intrinsics, no inline assembly, and no backend\-specific library calls\. The host compiler is responsible for lowering the resulting C to target machine code\. For instance, specializing the vector\-heavy tiles to emit AArch64 intrinsics directly would produce shorter sequences than the compiler generates from portable C today, closing part of the inflation gap to Box64\. We have not pursued that specialization for several reasons\. The tiles are the correctness\-validated core ofElevator\(Section[5\.1](https://arxiv.org/html/2605.08419#S5.SS1)\), and replacing them with target\-specific variants would reopen that validation surface on a new architecture\. The portable form also means that support for new x86 extensions lands on every backend simultaneously, without per\-target engineering work\.Elevatoritself is purely static and follows ABI conventions only at external\-library boundaries, so the correctness of the tile set directly determines whether the translated binary behaves the same as the input\. The resulting tiles may not emit the most compact AArch64 sequences the hardware allows, but they are correctness\-tested against the original x86 semantics, and they remain the natural default if a target\-specialized variant is later introduced as an optional overlay\. The question this paper answers is whether superset disassembly with tile\-based lowering produces a binary translator that is correct, delivers microarchitectural behavior close to native, and is competitive with a mature dynamic translator at the whole\-program level\. The data in this section supports each of those claims\. Measuring the same benchmarks under an AArch64\-specialized tile set, or retargeting the tiles to another ISA, is useful follow\-up work rather than evidence for or against the approach itself\. Finally, given thatElevatoralready inflates binary size by a significant amount, adding specialized instruction sequences for hot code paths to achieve better performance is also a worthwhile direction for future optimizations\. ## 6\.Current Implementation Limitations ABI Differences\.Our zero\-assumption argument\-reorganizing ABI translation falls short for rare fundamental incompatibilities between source and target, such as structure\-layout or argument\-passing\-convention mismatches\. Identifying these and writing the interceptor functions that bridge them remains manual\. Multi\-Threaded Binaries\.Elevatorcurrently supports only single\-threaded binaries\. The framework is largely designed to accommodate multi\-threading, but two challenges remain\. First, thetcbhead\_tstructure that underlies thread\-local storage \(TLS\) differs significantly between x64 and AArch64\. Second, x64 uses a stronger \(TSO\) memory model than AArch64, and optimal fence placement to recover x86 ordering is undecidable at the binary level\(Deshpandeet al\.,[2024](https://arxiv.org/html/2605.08419#bib.bib25)\); conservative placement degrades performance\(Becket al\.,[2023](https://arxiv.org/html/2605.08419#bib.bib15)\)\. Hardware supporting the RCpc memory model \(AArch64 v8\.3\(Arm, Inc\.,[2025b](https://arxiv.org/html/2605.08419#bib.bib311)\)\) addresses this mismatch directly\. Exception Handling\.Elevatordoes not yet support binaries that use exception handling, which primarily affects C\+\+ exceptions \(as in benchmark471\.omnetpp\)\. The main technical requirement is a stack unwinder that fetches x64 return addresses from the stack, since x64 stores them on the stack while AArch64 places them in a register\. We have not implemented this because the engineering effort would not yield proportional scientific insight; this is a current implementation restriction, not a fundamental limitation of our approach\. x64 Extensions\.Elevatorsupports a substantial portion of x64’s instruction set, but several extensions remain unsupported\. While supporting the remainder of SSE would be straightforward, AVX2 and later expand the existing 128\-bit registers to 256\-bit and 512\-bit widths\. Both exceed AArch64’s SIMD register width of 128\-bits and would require implementing an additional memory\-backed register context, or using the AArch64’s SVE extensions\. Self Modifying and JIT\-Compiled Code\.Elevator, like all fully static binary rewriters, does not support self modifying or just\-in\-time\-compiled code\. ## 7\.Future Work Key priorities for extending this work to enhance its scope and applicability include support for multi\-threaded input binaries and expansion to additional target instruction set architectures\. Beyond these architectural expansions, we plan to implement several optimization strategies, including dead code elimination and optimized flag computation to reduce both binary size overhead and runtime performance impact\. ## 8\.Summary and Conclusion We have presented what we believe is the first fully static whole\-program binary cross\-compiler from one ISA to another\. Previous static translators have all relied on heuristics; such approaches become less and less practical as the size of the input program grows, since successful translation depends on successive heuristics getting*every*decision right\. Our approach doesn’t use any heuristics at all and hence is practical for input programs of any size and complexity from any toolchain; our prototype implementation is mature enough to handle the entire SPECint 2006 benchmark suite, enabling a realistic evaluation on a range of input programs closely resembling real\-world requirements\. We can also translate input programs containing exotic overlapping/nested/obfuscated code constructs that existing binary translators have not been able to handle correctly\. Performance\-wise, our current, not yet fully optimized implementation already matches or outperforms the state\-of\-the\-art QEMU JIT\-accelerated emulation framework on a majority of SPECint 2006 benchmarks\. From a practical perspective, our approach lowers the risk of deploying cross\-ISA translation since the code that will ultimately be executed is generated in its entirety ahead of time\. It can therefore be rigorously tested, certified, and possibly cryptographically signed in the same manner as traditional native binaries\. In contrast, approaches based on emulation and JIT compilation implicitly depend on additional runtime components and any tests validated under any specific version of these runtime components don’t necessarily transfer to any other version of these same components\. ## References - I\. Agadakos, D\. Jin, D\. Williams\-King, V\. P\. Kemerlis, and G\. Portokalidis \(2019\)Nibbler: debloating binary shared libraries\.InProceedings of the 35th Annual Computer Security Applications Conference,ACSAC ’19,New York, NY, USA,pp\. 70–83\.External Links:ISBN 9781450376280,[Link](https://doi.org/10.1145/3359789.3359823),[Document](https://dx.doi.org/10.1145/3359789.3359823)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p1.1)\. - A\. Altinay, J\. Nash, T\. Kroes, P\. Rajasekaran, D\. Zhou, A\. Dabrowski, D\. Gens, Y\. Na, S\. Volckaert, C\. Giuffrida, H\. Bos, and M\. Franz \(2020\)BinRec: dynamic binary lifting and recompilation\.InProceedings of the Fifteenth European Conference on Computer Systems,EuroSys ’20,New York, NY, USA\.External Links:ISBN 9781450368827,[Link](https://doi.org/10.1145/3342195.3387550),[Document](https://dx.doi.org/10.1145/3342195.3387550)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p7.1),[§5](https://arxiv.org/html/2605.08419#S5.p4.1)\. - K\. Anand, M\. Smithson, K\. Elwazeer, A\. Kotha, J\. Gruen, N\. Giles, and R\. Barua \(2013\)A compiler\-level intermediate representation based binary analysis and rewriting system\.InProceedings of the 8th ACM European Conference on Computer Systems,EuroSys ’13,New York, NY, USA,pp\. 295–308\.External Links:ISBN 9781450319942,[Link](https://doi.org/10.1145/2465351.2465380),[Document](https://dx.doi.org/10.1145/2465351.2465380)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\. - Arm, Inc\. \(2025a\)*Glossary: What is Windows on Arm \(WoA\)?*\.Note:[https://www\.arm\.com/glossary/windows\-on\-arm](https://www.arm.com/glossary/windows-on-arm)\(accessed 2025\-08\-19\)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p9.1)\. - Arm, Inc\. \(2025b\)*The Armv8\.3 Architecture Extension*\.Note:[https://developer\.arm\.com/documentation/109697/2025\_09/Feature\-descriptions/The\-Armv8\-3\-architecture\-extension](https://developer.arm.com/documentation/109697/2025_09/Feature-descriptions/The-Armv8-3-architecture-extension)\(accessed 2025\-12\-10\)Cited by:[§6](https://arxiv.org/html/2605.08419#S6.p2.1)\. - E\. Bauman, Z\. Lin, K\. W\. Hamlen,et al\.\(2018\)Superset disassembly: statically rewriting x86 binaries without heuristics\.\.InSymposium on Network and Distributed System Security \(NDSS\),Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p11.1),[§3](https://arxiv.org/html/2605.08419#S3.p4.1),[§5\.4](https://arxiv.org/html/2605.08419#S5.SS4.p1.1),[§5](https://arxiv.org/html/2605.08419#S5.p4.1)\. - M\. Beck, K\. Bhat, L\. Stričević, G\. Chen, D\. Behrens, M\. Fu, V\. Vafeiadis, H\. Chen, and H\. Härtig \(2023\)AtoMig: automatically migrating millions lines of code from tso to wmm\.InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2,ASPLOS 2023,New York, NY, USA,pp\. 61–73\.External Links:ISBN 9781450399166,[Link](https://doi.org/10.1145/3575693.3579849),[Document](https://dx.doi.org/10.1145/3575693.3579849)Cited by:[§6](https://arxiv.org/html/2605.08419#S6.p2.1)\. - F\. Bellard \(2005\)QEMU, a fast and portable dynamic translator\.In2005 USENIX Annual Technical Conference \(USENIX ATC 05\),Anaheim, CA\.External Links:[Link](https://www.usenix.org/conference/2005-usenix-annual-technical-conference/qemu-fast-and-portable-dynamic-translator)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p3.1),[§5](https://arxiv.org/html/2605.08419#S5.p1.1)\. - D\. Bruening, T\. Garnett, and S\. Amarasinghe \(2003\)An infrastructure for adaptive dynamic optimization\.InProceedings of the International Symposium on Code Generation and Optimization: Feedback\-Directed and Runtime Optimization,CGO ’03,USA,pp\. 265–275\.External Links:ISBN 076951913X,[Link](https://dl.acm.org/doi/10.5555/776261.776290)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p3.1)\. - D\. Brumley, I\. Jager, T\. Avgerinos, and E\. J\. Schwartz \(2011\)BAP: a binary analysis platform\.InProceedings of the 23rd International Conference on Computer Aided Verification,CAV’11,Berlin, Heidelberg,pp\. 463–469\.External Links:ISBN 9783642221095,[Link](https://dl.acm.org/doi/10.5555/2032305.2032342)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p2.1)\. - C\. Chen, S\. Sugita, Y\. Nada, H\. Irie, S\. Sakai, and R\. Shioya \(2025\)Biotite: a high\-performance static binary translator using source\-level information\.InProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction,CC ’25,New York, NY, USA,pp\. 167–179\.External Links:ISBN 9798400714078,[Link](https://doi.org/10.1145/3708493.3712693),[Document](https://dx.doi.org/10.1145/3708493.3712693)Cited by:[§5](https://arxiv.org/html/2605.08419#S5.p4.1)\. - S\. Chevalier and Box64 Contributors \(2024\)Box64: linux userspace x86\_64 emulator with a twist, targeted atARM64hosts\.Note:[https://github\.com/ptitSeb/box64](https://github.com/ptitSeb/box64)Accessed: 2026\-04\-20Cited by:[§5](https://arxiv.org/html/2605.08419#S5.p1.1)\. - A\. Cunningham \(2025\)Apple details the end of intel mac support and a phaseout for rosetta 2\.External Links:[Link](https://arstechnica.com/gadgets/2025/06/apple-details-the-end-of-intel-mac-support-and-a-phaseout-for-rosetta-2/)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p9.1)\. - A\. D’Antras, C\. Gorgovan, J\. Garside, and M\. Luján \(2017\)Low overhead dynamic binary translation on arm\.SIGPLAN Not\.52\(6\),pp\. 333–346\.External Links:ISSN 0362\-1340,[Link](https://doi.org/10.1145/3140587.3062371),[Document](https://dx.doi.org/10.1145/3140587.3062371)Cited by:[§5](https://arxiv.org/html/2605.08419#S5.p4.1)\. - C\. Deshpande, F\. Parzefall, F\. Hetzelt, and M\. Franz \(2024\)Polynima: practical hybrid recompilation for multithreaded binaries\.InProceedings of the Nineteenth European Conference on Computer Systems,EuroSys ’24,New York, NY, USA,pp\. 1126–1141\.External Links:ISBN 9798400704376,[Link](https://doi.org/10.1145/3627703.3650065),[Document](https://dx.doi.org/10.1145/3627703.3650065)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p10.1),[§6](https://arxiv.org/html/2605.08419#S6.p2.1)\. - A\. Di Federico, M\. Payer, and G\. Agosta \(2017\)Rev\.ng: a unified binary analysis framework to recover cfgs and function boundaries\.InProceedings of the 26th International Conference on Compiler Construction,CC 2017,New York, NY, USA,pp\. 131–141\.External Links:ISBN 9781450352338,[Link](https://doi.org/10.1145/3033019.3033028),[Document](https://dx.doi.org/10.1145/3033019.3033028)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\. - A\. Dinaburg and A\. Ruef \(2014\)McSema: static translation of x86 instructions to LLVM\.Note:Presented at*REcon 2014*\(Montreal, Canada\)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\. - S\. Dinesh, N\. Burow, D\. Xu, and M\. Payer \(2020\)RetroWrite: statically instrumenting cots binaries for fuzzing and sanitization\.In2020 IEEE Symposium on Security and Privacy \(SP\),Vol\.,pp\. 1497–1511\.External Links:[Document](https://dx.doi.org/10.1109/SP40000.2020.00009)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\. - G\. J\. Duck, X\. Gao, and A\. Roychoudhury \(2020\)Binary rewriting without control flow recovery\.InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation,PLDI 2020,New York, NY, USA,pp\. 151–163\.External Links:ISBN 9781450376136,[Link](https://doi.org/10.1145/3385412.3385972),[Document](https://dx.doi.org/10.1145/3385412.3385972)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p2.1)\. - T\. Dullien and S\. Porst \(2009\)REIL: a platform\-independent intermediate representation of disassembled code for static code analysis\.Proceeding of CanSecWest\.Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p2.1)\. - A\. Engelke, D\. Okwieka, and M\. Schulz \(2021\)Efficient llvm\-based dynamic binary translation\.InProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments,VEE 2021,New York, NY, USA,pp\. 165–171\.External Links:ISBN 9781450383943,[Link](https://doi.org/10.1145/3453933.3454022),[Document](https://dx.doi.org/10.1145/3453933.3454022)Cited by:[§5](https://arxiv.org/html/2605.08419#S5.p4.1)\. - D\. Hong, C\. Hsu, P\. Yew, J\. Wu, W\. Hsu, P\. Liu, C\. Wang, and Y\. Chung \(2012\)HQEMU: a multi\-threaded and retargetable dynamic binary translator on multicores\.InProceedings of the Tenth International Symposium on Code Generation and Optimization,CGO ’12,New York, NY, USA,pp\. 104–113\.External Links:ISBN 9781450312066,[Link](https://doi.org/10.1145/2259016.2259030),[Document](https://dx.doi.org/10.1145/2259016.2259030)Cited by:[§5](https://arxiv.org/html/2605.08419#S5.p4.1)\. - R\. N\. Horspool and N\. Marovac \(1980\)An approach to the problem of detranslation of computer programs\.The Computer Journal23\(3\),pp\. 223–229\.Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p4.1)\. - A\. Iliasov \(2003\)Templates\-based portable just\-in\-time compiler\.SIGPLAN Not\.38\(8\),pp\. 37–43\.External Links:ISSN 0362\-1340,[Link](https://doi.org/10.1145/944579.944588),[Document](https://dx.doi.org/10.1145/944579.944588)Cited by:[§4](https://arxiv.org/html/2605.08419#S4.p1.1)\. - M\. Kolsek and 0\. Team \(2017\)Did microsoft just manually patch their equation editor executable? why yes, yes they did\. \(cve\-2017\-11882\)\.Note:[https://blog\.0patch\.com/2017/11/did\-microsoft\-just\-manually\-patch\-their\.html](https://blog.0patch.com/2017/11/did-microsoft-just-manually-patch-their.html)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p1.1)\. - M\. Kolsek \(2017\)*Did Microsoft Just Manually Patch Their Equation Editor Executable? Why Yes, Yes They Did\. \(CVE\-2017\-11882\)*\.Note:[https://blog\.0patch\.com/2017/11/did\-microsoft\-just\-manually\-patch\-their\.html](https://blog.0patch.com/2017/11/did-microsoft-just-manually-patch-their.html)\(accessed 2025\-08\-20\)Cited by:[§1](https://arxiv.org/html/2605.08419#S1.p2.1)\. - C\. Luk, R\. Cohn, R\. Muth, H\. Patil, A\. Klauser, G\. Lowney, S\. Wallace, V\. J\. Reddi, and K\. Hazelwood \(2005\)Pin: building customized program analysis tools with dynamic instrumentation\.InProceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation,PLDI ’05,pp\. 190–200\.External Links:ISBN 1595930566,[Link](https://doi.org/10.1145/1065010.1065034),[Document](https://dx.doi.org/10.1145/1065010.1065034)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p3.1)\. - Microsoft \(2024\)*How emulation works on arm*\.Note:[https://learn\.microsoft\.com/en\-us/windows/arm/apps\-on\-arm\-x86\-emulation](https://learn.microsoft.com/en-us/windows/arm/apps-on-arm-x86-emulation)\(accessed 2025\-08\-19\)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p9.1)\. - K\. Miller, Y\. Kwon, Y\. Sun, Z\. Zhang, X\. Zhang, and Z\. Lin \(2019\)Probabilistic disassembly\.InProceedings of the 41st International Conference on Software Engineering,ICSE ’19,pp\. 1187–1198\.External Links:[Link](https://doi.org/10.1109/ICSE.2019.00121),[Document](https://dx.doi.org/10.1109/ICSE.2019.00121)Cited by:[§5\.4](https://arxiv.org/html/2605.08419#S5.SS4.p1.1)\. - K\. M\. Nakagawa \(2021\)Project Champollion: Reverse engineering Rosetta 2\.Note:[https://github\.com/FFRI/ProjectChampollion](https://github.com/FFRI/ProjectChampollion)\(accessed 2025\-08\-19\)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p9.1)\. - N\. Nethercote and J\. Seward \(2007\)Valgrind: a framework for heavyweight dynamic binary instrumentation\.InProceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation,PLDI ’07,New York, NY, USA,pp\. 89–100\.External Links:ISBN 9781595936332,[Link](https://doi.org/10.1145/1250734.1250746),[Document](https://dx.doi.org/10.1145/1250734.1250746)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p3.1)\. - M\. Panchenko, R\. Auler, B\. Nell, and G\. Ottoni \(2019\)BOLT: a practical binary optimizer for data centers and beyond\.InProceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization,CGO 2019,pp\. 2–14\.External Links:ISBN 9781728114361,[Link](https://dl.acm.org/doi/10.5555/3314872.3314876)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p1.1),[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\. - C\. Pang, R\. Yu, Y\. Chen, E\. Koskinen, G\. Portokalidis, B\. Mao, and J\. Xu \(2020\)SoK: all you ever wanted to know about x86/x64 binary disassembly but were afraid to ask\.External Links:2007\.14266,[Link](https://arxiv.org/abs/2007.14266)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\. - F\. Parzefall, C\. Deshpande, F\. Hetzelt, and M\. Franz \(2024\)What you trace is what you get: dynamic stack\-layout recovery for binary recompilation\.InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2,ASPLOS ’24,New York, NY, USA,pp\. 1250–1263\.External Links:ISBN 9798400703850,[Link](https://doi.org/10.1145/3620665.3640371),[Document](https://dx.doi.org/10.1145/3620665.3640371)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p10.1)\. - I\. Piumarta and F\. Riccardi \(1998\)Optimizing direct threaded code by selective inlining\.SIGPLAN Not\.33\(5\),pp\. 291–300\.External Links:ISSN 0362\-1340,[Link](https://doi.org/10.1145/277652.277743),[Document](https://dx.doi.org/10.1145/277652.277743)Cited by:[§4](https://arxiv.org/html/2605.08419#S4.p1.1)\. - C\. Qian, H\. Hu, M\. Alharthi, P\. H\. Chung, T\. Kim, and W\. Lee \(2019\)RAZOR: a framework for post\-deployment software debloating\.In28th USENIX Security Symposium \(USENIX Security 19\),Santa Clara, CA,pp\. 1733–1750\.External Links:ISBN 978\-1\-939133\-06\-9,[Link](https://www.usenix.org/conference/usenixsecurity19/presentation/qian)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p1.1)\. - H\. Shacham \(2007\)The geometry of innocent flesh on the bone: return\-into\-libc without function calls \(on the x86\)\.InProceedings of the 14th ACM Conference on Computer and Communications Security,CCS ’07,New York, NY, USA,pp\. 552–561\.External Links:ISBN 9781595937032,[Link](https://doi.org/10.1145/1315245.1315313),[Document](https://dx.doi.org/10.1145/1315245.1315313)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p8.1)\. - B\. Shen, J\. Chen, W\. Hsu, and W\. Yang \(2012\)LLBT: an llvm\-based static binary translator\.InProceedings of the 2012 International Conference on Compilers, Architectures and Synthesis for Embedded Systems,CASES ’12,New York, NY, USA,pp\. 51–60\.External Links:ISBN 9781450314244,[Link](https://doi.org/10.1145/2380403.2380419),[Document](https://dx.doi.org/10.1145/2380403.2380419)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\. - B\. Shen, W\. Hsu, and W\. Yang \(2014\)A retargetable static binary translator for the arm architecture\.ACM Trans\. Archit\. Code Optim\.11\(2\)\.External Links:ISSN 1544\-3566,[Link](https://doi.org/10.1145/2629335),[Document](https://dx.doi.org/10.1145/2629335)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\. - A\. M\. Turing \(1937\)On computable numbers, with an application to the entscheidungsproblem\.Proceedings of the London Mathematical Societys2\-42\(1\),pp\. 230–265\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1112/plms/s2-42.1.230)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p4.1),[§3](https://arxiv.org/html/2605.08419#S3.p2.1)\. - Valgrind Project \(2024\)Vex IR\.Note:[https://github\.com/smparkes/valgrind\-vex/blob/master/pub/libvex\_ir\.h](https://github.com/smparkes/valgrind-vex/blob/master/pub/libvex_ir.h)\(accessed 2025\-08\-19\)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p2.1)\. - M\. Wenzl, G\. Merzdovnik, J\. Ullrich, and E\. Weippl \(2019\)From hack to elaborate technique–a survey on binary rewriting\.ACM Computing Surveys \(CSUR\)52\(3\)\.External Links:[Link](https://doi.org/10.1145/3316415)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p1.1),[§2](https://arxiv.org/html/2605.08419#S2.p2.1)\. - D\. Williams\-King, H\. Kobayashi, K\. Williams\-King, G\. Patterson, F\. Spano, Y\. J\. Wu, J\. Yang, and V\. P\. Kemerlis \(2020\)Egalito: layout\-agnostic binary recompilation\.InProceedings of the Twenty\-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems,ASPLOS ’20,New York, NY, USA,pp\. 133–147\.External Links:ISBN 9781450371025,[Link](https://doi.org/10.1145/3373376.3378470),[Document](https://dx.doi.org/10.1145/3373376.3378470)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1),[§5](https://arxiv.org/html/2605.08419#S5.p4.1)\. - C\. Wimmer, M\. Haupt, M\. L\. Van De Vanter, M\. Jordan, L\. Daynès, and D\. Simon \(2013\)Maxine: an approachable virtual machine for, and in, java\.ACM Trans\. Archit\. Code Optim\.9\(4\)\.External Links:ISSN 1544\-3566,[Link](https://doi.org/10.1145/2400682.2400689),[Document](https://dx.doi.org/10.1145/2400682.2400689)Cited by:[§4](https://arxiv.org/html/2605.08419#S4.p1.1)\. - H\. Xu and F\. Kjolstad \(2021\)Copy\-and\-patch compilation: a fast compilation algorithm for high\-level languages and bytecode\.Proc\. ACM Program\. Lang\.5\(OOPSLA\)\.External Links:[Link](https://doi.org/10.1145/3485513),[Document](https://dx.doi.org/10.1145/3485513)Cited by:[§4](https://arxiv.org/html/2605.08419#S4.p1.1)\. - S\. B\. Yadavalli and A\. Smith \(2019\)Raising binaries to llvm ir with mctoll \(wip paper\)\.InProceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems,LCTES 2019,New York, NY, USA,pp\. 213–218\.External Links:ISBN 9781450367240,[Link](https://doi.org/10.1145/3316482.3326354),[Document](https://dx.doi.org/10.1145/3316482.3326354)Cited by:[§2](https://arxiv.org/html/2605.08419#S2.p5.1)\.
Similar Articles
Theseus, a static Windows emulator
Theseus is a new static Windows/x86 emulator that translates programs at compile-time rather than interpreting or JIT-compiling them at runtime, representing an alternative approach to traditional emulation architectures.
Fast Byte Latent Transformer
This paper introduces BLT Diffusion and speculative decoding techniques for byte-level language models to significantly reduce generation latency and memory bandwidth costs while maintaining quality.
AngelSlim/Hy-MT1.5-1.8B-1.25bit
Tencent's AngelSlim team released Hy-MT1.5-1.8B-1.25bit, a highly compressed 1.25-bit machine translation model supporting 33 languages that fits in 440MB for on-device use. It utilizes the Sherry quantization algorithm to achieve world-class translation quality comparable to much larger models.
QBE – Compiler Back End
QBE is a compact, hobby-scale compiler backend that provides 70% of the performance of industrial optimizing compilers in 10% of the code, supporting amd64, arm64, and riscv64 with a simple SSA-based intermediate language.
A Linux desktop in x86_64 Assembly
A developer rebuilt their entire Linux desktop stack—from shell to terminal, window manager, and utilities—in pure x86_64 Assembly using Claude Code, achieving microsecond startup times and hours of extra battery life.