Disassembly

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

Disassembly involves recreating assembly code from binary machine code. In this article we discuss concepts surrounding disassembly and a basic algorithm used for disassembly.

Table of contents.

Introduction.
Instructions and data.
Disassembly with indirection.
Disassembly with relocation.
Summary.
References.

Prerequisite.

Assembler, Linker and Loader

Introduction.

Assemblers are simpler compared to compilers as their only function is to translate assembly code to binary machine code via a one-to-one correspondence.
Since assembly is simple, disassembly is often just as simple because the assembly instructions have a one-to-one correspondence with machine code.

A disassembler is the opposite of an assembler, it reverses the process and recreates assembly code from binary machine code.
A basic assembler can be implemented by reading in bytes and table lookups.

Instructions and data.

Assembly languages have separate code and data segments however an assembled program may contain data in the code segment.

The first step in disassembly is to distinguish between instructions and data in the sequence of bytes presented to the disassembler, specifically the addresses instructions start so as to decode them and data types so as to decode values correctly.

Initially the only thing we have is the entry point - start address of the binary program which points to an instruction, we analyze this instruction and from it we draw conclusions about other addresses, we repeat this until no new conclusions can be drawn.

A basic disassembly algorithm.

Data definitions

$A_{I}$ is the set of addresses where an instruction starts, each is associated with a label.
$A_{D}$ is the set of addresses where data items start, each is associated with a label and a type.

Initialization
$A_{I}$ is filled with the start address of the binary program.
$A_{D}$ is empty.

Inference rules
For each address A in $A_{I}$ , decode instruction at A and call it I.

If I is not a jump instruction, the address following I must be $A_{I}$
If I is an unconditional jump, the conditional jump or routine call instruction to address L, L must be in $A_{I}$ associated with a different label.
If I is a conditional jump of routine call instruction, the following address of I must be in $A_{I}$ .
If I accesses data at address L and uses its type T, L must be in $A_{D}$ , associated with different label from other labels and type T.

Jump instructions include conditional and unconditional jumps, routine calls and return
We use information in $A_{I}$ and $A_{D}$ to convert the binary sequence into assembly code starting from the beginning.
For each address A in $A_{I}$ we produce symbolic code for the instruction I preceded by a label if it has one. If I has more than one addresses, they will be in $A_{I}$ or $A_{D}$ and have labels therefore the labels can be produced in I.
For each address A in $A_{D}$ , we produce formatted data for the bit pattern preceded by the label.
If the external symbol table is available we can identify addresses and use their original names instead of labels to improve readability.

Address locations in analyzed segments won't be in either $A_{I}$ or $A_{D}$ since they point to the middle of an instruction or data item, also they may be addressing unused data or unreachable code. Others may occur multiple times. It is also a possibility that an address is in both $A_{I}$ and $A_{D}$ which means that the program uses instructions as data and/or vice versa.

Such analysis allows a program to be disassembled correctly, however it is convenient to flag these occurrences for manual inspection.

Disassembly with indirection.

Programs use indirect addresses which are obtained by computation rather than a direct derivation from an instruction. The discussed approach cannot discover indirect addresses.

Translation of switches and computed routine calls are the main sources of indirect instruction addresses.

An example
We have the following C switch code.

switch (ch) {
    case ’ ’ : code to handle space; break;
    case ’! ’ : code to handle exclamation mark; break;
    .
    .
    .
    case ’~’: code to handle tilde ; break;
}

Two possible intermediate translations are as follows.

/* common code */
                reg := ch;
                IF reg < 32 GOTO L_default;
                IF reg > 127 GOTO L_default;
                reg := reg − 32; /* slide to zero */
                
/* jump table */                    /* address table */
                                    reg := reg + L_address_table;
GOTO_INDEXED reg, L_jump_table; GOTO_INDIRECT reg;
L_jump_table:                       L_address_table:
GOTO L032;                              L032;
GOTO L033;                              L033;
.                                       .
.                                       .
.                                       .
GOTO L127;                              L127;
        L032: code to handle space; GOTO L_default;
        L033: code to handle exclamation mark; GOTO L_default;
              .
              .
              .
        L127: code to handle tilde; GOTO L_default;
            L_default:

From the above we can see that both the left and right translations use switch tables while the middle column is common to both.

The left translation uses a table of jump instructions into which flow of control is led.

The right translation uses a table of addresses which is picked and applied in an indirect jump.

GOTO_INDEXED reg, L_jump_table jumps to L_jump_tabl[reg];
GOTO_INDIRECT reg jumps to mem[reg].
Generally the problem of obtaining the information that L032, L033,...,L127, LO, L1, L_show_landscape and L_show_portrait are addresses and should be in $A_{I}$ is unsolvable however we will see two techniques that produce the desired results.
One is for switch tables another is for routine pointers both of which require a form of control flow analysis, however, obtaining a control flow graph is an issue since at this point we don't have the full code.

A switch table is signaled by the occurrence of an indexed jump J on register R and is preceded by code to load R.

A program slice of R at J is the program segment determining the value of R at position J. These are useful for understanding the program, debugging it an optimizing it.

We first perform a backwards scan through the disassembled code so as to find instruction $I_{R}$ that set R. This is repeated for registers in $I_{R}$ from which R is set and we stop after a register is loaded from memory or when we arrive at the beginning of the routine or program.

We now perform a forward scan, symbolically interpreting instructions so as to create symbolic expressions for the registers.

An example

Assuming the forward scan results in the following instruction sequence;

Load_Mem SP−12,R1
Load_Const 8,R2
Add_Reg R2,R1

It is rewritten as;

R1 := mem[SP−12];
R2 := 8;
R1 := R1 + R2

Then by forward substitution it is converted to;

R1 := mem[SP−12] + 8;

If this is successful, we are left with a short sequence of conditional jumps followed by an indexed jump all with expressions as parameters.

The function of this sequence is similar in all case-testing boundaries, finding the switch table and indexing it - there exist few patterns for it and a simple pattern match will suffice so as to find the right one.

Constants in the sequence are matched to the pattern's parameters. This gives the position and size of the switch table and so we can extract the addresses from the table and insert them in $A_{I}$ .

Otherwise the code is flagged for manual inspection.

Translation of a computed routine call

    reg1 := pic.width − pic.height;
    IF reg1 > 0 GOTO L0;
    reg2 := L_show_portrait;
    GOTO L1;
L0: reg2 := L_show_landscape;
L1: LOAD_PARAM pic;
    CALL_REG reg2;

The above code loads addresses of L_show_portrait or L_show_landscape into a register meaning they occur as addresses in Load_Addr instructions.
These instructions are used to load data addresses therefore we need symbolic interpretation so as to find the use of the loaded values.

Again with the issue of an incomplete control-flow graph, to complete it we need the information we are trying to extract - a chicken-and-egg problem.

This is handled by using an unknown node or hell node in the graph which will be the source and destination of jumps we don't have information on.

All jumps on registers will follow edges leading into this node, for inter-procedural analysis, outgoing edges will lead to all code positions after routine jumps. For incoming edges we assume all registers are live and for outgoing we assume all registers have unknown clients.

This node enables us to complete the control-flow graph after which we can run a symbolic interpretation algorithm to obtain value sets for all registers at all positions.

Disassembly with relocation information.

Relocation information is provided by the assembler.

The relocation bit map as described in the prerequisite article states that for every byte position if it is relocatable and if it is, whether it pertains to the code or data segment. By scanning it we can find addresses in instructions and insert them in $A_{I}$ or $A_{D}$ after which the disassembly algorithm does the rest.

Most assemblers will still reconstruct the control-flow graph so as to obtain better information on routine boundaries and data types.

Summary.

Disassembly involves recreating assembly code from binary machine code
Decompilation on the other hand works its way from an executable binary or assembly code to yield a higher-level source code. The idea is to examine the program to gain understanding of it functioning then recompiling it.
Reasons for disassembly could be we need to change a legacy program for which we don't have the source code for therefore we disassemble it, make changes then reassemble it, we can also disassemble binary code so as to build a dependency graph, apply optimizations, apply security checks and measurement code then reassemble, this is referred to as binary rewriting an example of a binary rewriting system is valgrind, disassembly can also be used for malware analysis since basic static and dynamic methods are not good enough, disassembly is also used for optimizing code size and power consumption.