Decompilation

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

Decompilation is the process of converting executable machine code into human readable code. In this article we discuss the steps involved in the decompilation process.

Table of contents.

Introduction.
Disassembly.
Lifting and data flow analysis.
Control flow analysis.
Type analysis.
Summary.
References.

Prerequisites.

Disassembly.

Introduction.

Decompilation is the process of converting executable code(machine readable) into high level code(human readable).

A decompiler also referred to as a reverse compiler is the program that performs the inverse process of compiling, it takes a program compiled in a high level language and produces a high-level language program that performs the same functions as the executable. Its input is machine dependent and output is language dependent.
Note that some languages are easily decompiled and therefore produce code that is close in resemblance to the original e.g Java, in that, java stores high-level information in executables which is why it is portable.

Although being portable source code recovery is much easier since there are alot of details of how the application functions.

Decompilation involves four steps, namely,

disassembly - whereby machine code is transformed to assembly code.
Lifting and data flow analysis whereby we transform the assembly code into a high-level internal representation.
Control flow analysis where we recover the control flow structure information e.g if and while statements and their nesting levels.
Type analysis where we recover the types of the used variables, functions etc.

Disassembly.

Since we have discussed disassembly in depth in a previous article, we shall go over a few details inorder to get the basic idea.

The mapping between assembly and machine code is a one-to-one correspondence and thereby the translation should be feasible. However as we have discussed it is difficult for the disassembler to distinguish between code and data.

Two strategies taken to solve this are as follows,

Disassembling sections filled with code then treat the rest as data.
Considering starting addresses provided in the binary's header and recursively disassembling reachable code from the address.

Disassembly has various pitfalls for example, programs get obfuscated and thus many steps are targeted to fool the disassembler and without correct disassembly, we cannot proceed with decompilation.

Lifting and data flow analysis.

Assuming we have a correct disassembly, the structure cannot be analyzed. The decompiler will perform actions resembling a backwards form of instruction selection but they cannot tile sequences of assembly instruction with sequences of abstract instructions since different compilers will produce different assembly code for the same sequence of abstract instructions.

Additionally a single abstract instruction can expand into a long sequence of real instructions.

To deal with this, one approach would be to translate for example, a complex x86_64 into a simpler RISC instruction set.

Another approach would be to translate it into a semantics preserving complex instruction set which has cross platform methods for its analysis.

To translate x86_64 to a three-address code representation we perform reverse instruction selection.

Once we have this representation we eliminate machine specific details using data flow analysis so as to recover variables, expressions and statements.

We then replace registers with a temp and perform a function-call expansion step in reverse where we replace sequences of moves into arguments registers followed by a parameterized call.
For this we make passes over all functions so as to determine the number of arguments each take in order to deal with certain moves being optimized out.

We then perform a modified version of SSA analysis then an extended copy-propagation pass so as to collapse expressions.

Control flow analysis.

At this stage we have obtained a control flow graph with real variables and can produce C code equivalent to the original machine code.

Although producing C code here is undesirable since few programs are written with as much abuse of the goto keyword similar to the code we obtain here.
Control flow graphs are generated by structured programs using if, for and while statements, only then is it desirable for a compiler to recover original structure so as to arrive at the original code.

Dominator nodes are a primary element of this analysis.
Given a start node a, a node b is said to dominate a node c if all paths from a to c go through b.

An immediate dominator of c is b such that for every node v, if d dominated c, then either d=b or d dominates b.

Loops.

In this section we shall see how to structure three classes loops namely;

While loops where the starting node is a conditional and the latching an unconditional.
Repeat loops where the latching node is a conditional.
Endless loops where both the start and latching nodes are unconditional.

A latching node is a node with the back-edge to the start node.
To structure loops we consider intervals on a digraph.
If h is a node in G, interval l(h) is the maximal subgraph whereby h is the only entry node in which all closed paths contains h.
In theory there exists a set {h1,...,hk} of header nodes such that the set {l(h1,...,l(hk)} is a partition of the graph.
An algorithm exists to find such a partition.
Then we define derived graphs of G as follows,

G' = G
$G^{n + 1}$ - graph formed by contracting every interval of $G^{n}$ into a single node.

Eventually we will reach a fixed point where the resulting graph is irreducible.

For any interval l(h) there is a loop rooted at h if there is a back-edge from h to a node z $\in$ l(h).
To find such a node we perform a DFS algorithm on the interval.
To find nodes in the loop we define h as part of the loop and proceed by noting that a node k is in the loop iff its immediate dominator is in the loop and h is reachable from k.
To find loops, we compute the derived graphs of G until we arrive at a fixed point then find loops in each derived graph.
If a node is found to be a latching node for two loops, one loop is labeled with a goto instead.

ifs.

An if statement is a 2-way conditional branch with a common end node.
The final node is referred to as the follow node and it is dominated by the header node.

To structure such, we perform a post-ordering and post-order traversal of the graph which will guarantee analysis of the internal nested ifs before the external ones.

To find if statements,

For every conditional node a, we find the set of nodes dominated by a.
Produce G' from G by reversing arrows and filter nodes from the above set the don't dominate a in G'.
Find the closest node to a in the resulting set by taking into consideration the one with the highest post-order number.
The resulting node is the follow node of a.

Type analysis.

At this stage we determine what the types of variables are, each variable and function is assigned a type as part of the process of recovering the structure of the program.

Type analysis is an open problem meaning research is being done, however a model of a simple type analysis is as follows;

Multiplication, subtraction, division, bitwise operations such as xor, binary or are forced to be integers.
Dereferencing forces a parameter to be a pointer.
Return values of standard library functions are maintained.
A variable that is branched on becomes a boolean.
When two variables are added together and one is a pointer, the other will be an integer.
When two variables are added together and one is an integer, the other will be either a pointer or integer.
When two variables are compared with == or !=, they will have the same type.
When two variables are compared with <, >, >=, <=, they are both integers.
When a value is returned from main() function, it is an integer.
When the value of a variable is moved into another variable then they have the same type.
If the dereferenced value of a pointer has type T then the pointer is of type T*.
The sum of a pointer of type T* and an integer is a pointer, not necessarily of type T*.

Analysis is performed across function boundaries so as to to get high-quality types.
This analysis however cannot distinguish structures and arrays.

Summary.

Decompilation has various uses for example, assuming there exists a legacy system whose source code is lost and we need to fix bugs and port into a modern architecture, decompilation will aid such a problem.
Decompilation can also be applied in malware analysis whereby we don't have the source code, we therefore decompile the executable to get an approximation of the original code.

References.

A Structuring Algorithm for Decompilation