Data-flow analysis in Compiler Design

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

All optimizations in a compiler depend on data-flow analysis.
We have three instances of data-flow problems, reaching definitions, live variables and available expressions.

Table of contents.

Introduction.
Abstraction.
Schemas.
3.1. Statement semantics.
3.2. Flow of control.
Schemas on basic blocks.
Reaching definitions.
Live-variable analysis.
Available expressions.
Summary.
References.

Introduction.

Data-flow analysis is the body of techniques involved in deriving information about the flow of data along program execution paths.
For example, to implement global common sub-expression elimination, we are required to determine whether two textually identical expressions evaluate the same value along any execution path of the program.

Abstraction.

The execution of a program is viewed as a series of transformations of the program state that consists of values of all variables in the program including the ones associated with stack frames below the top of the run-time stack.

When an intermediate-code statement is executed, the input state is transformed into a new output state.
To analyze this phenomenon, we consider all possible sequences of paths the program takes during execution after which we extract the information we need to solve a data-flow analysis problem.

The flow graph tells us the following about an execution path:

Within a basic block, the program point after a statement is the same as the program point before the next statement.
If there is an edge from a block $B_{1}$ to $B_{2}$ , then the program point after the last statement of B1 is followed immediately by the program point before the first statement of B2.

We therefore define an execution path from point $p_{1}$ to $p_{n}$ as the sequence of points p1, p2, ..., pn such that for each i = 1, 2, ..., n-1, either;

$p_{i}$ is the point that immediately precedes a statement and $p_{i}$ + 1 the point immediately following the same statement.
pi is the end of a block and pi + 1 is the beginning of the following block.

Generally, there exists an infinite number of possible execution paths through a program and there is no finite upper bound on the length of an execution path.

Analysis of a program summarizes all possible states of the program that can occur into a finite set of facts.
Note that, different analyses may choose to abstract different information and therefore there is no perfect representation of a state in data-flow analysis.

Schemas.

For every data-flow analysis, a program is assigned a data flow value representing an abstraction of the set of all possible program states that can be observed for that point.
This set of values is the domain for this application. E.g, the domain of data-flow values for reaching definitions is the set of all subsets of definitions in the program.
Now, we want to associate with each point in the program the set of definitions that can reach that point.
As the choice of abstraction depends on the analysis goal, we only track relevant information.
We denote data-flow values before and after each s statement by IN[s] and OUT[s] respectively.
Now, the data-flow problem is to find a solution to a set of constraints on the IN[s] and OUT[s] for all statements s.

We have two sets of constraints, the first is based on statements semantics of the statements while the second is based on the flow of control.

Statement semantics.

An example: Assuming the analysis involves determining the constant value of variables at points. If a variable a has a value v before executing statement b = a, both a and b will have value v after the statement.
This relationship is referred to as a transfer function.

Such a function can be of two flavors:
For information propagated forward along execution paths, the transfer function of a statement s is denoted by $f_{s}$ , it takes the data-flow value before the statement an produces a new data-flow value after the statement:
OUT[s] = $f_{s}$ (IN[s])

For information flowing backwards up the execution paths, the transfer function $f_{s}$ for a statement s converts a data-flow value after the statement to a new data-flow value before the statement:
IN[s] = $f_{s}$ (OUT[s])

Flow of control.

Within a basic block, the control flow is simple, if a block B has statements s1, s2, ..., sn, then the control-flow value out of $s_{i}$ is similar to the control-flow value into $s_{i+1}$ , that is:
IN[ $s_{i+1}$ ] = OUT[ $s_{i}$ ], for all i = 1, 2, ..., n-1.

Control flow edges between basic blocks create complex constraints between the last statement of a basic block and the first statement of the following block. E.g is we want to collect definitions reaching a point in a program, the set of definitions after the last statement of the block is the union of definitions after the last statements of each predecessor block.

Schemas on basic blocks.

A data-flow schema involves data-flow values at each point in the program.
Control flows from the start to the end of a block without interrupting a branch.

We restate this schema in terms of these data-flow values that enter and leave the blocks.
We also represent these data-flow values immediately before and after basic block B by IN[B] and OUT[B] respectively.
Constraints involving IN[B] and OUT[B] are derived from those involving IN[s] and OUT[s] for various statements s in B as follows.

Assuming block B has statements s1, ..., sn, if $s_{1}$ is the first statement, of block B, then IN[B] = IN[ $s_{1}$ ].
In a similar way, of $s_{n}$ is the last statement of block B denoted by $f_{B}$ , is derived by composing the transfer functions of the statements in the block.
That is, if $f_{si}$ is the transfer function of statement si, then:
$f_{B}$ = $f_{sn}$ , ... $f_{s2}$ , $f_{s1}$

The relationship between the start and end of the block is
OUT[B] = $f_{B}$ (IN[B])

We can rewrite the constraints between blocks dur to control flow by substituting IN[B] and OUT[B] for IN[ $s_{1}$ ] and OUT[ $s_{n}$ ] respectively.

Unlike linear arithmetic equations, data-flow equations don't have a unique solution and therefore the goal is to find the most precise solution that satisfies two sets of constraints.
We need a solution in addition to supporting code improvements, it doesn't try to justify unsafe transformations, this excludes those transformations that change what a program computes.

Reaching definitions.

This is a common and useful data-flow schema.
The idea is this, by knowing the point in the program where a variable x may be defined when control reaches point p, we can gather a lot of information about x.

In addition to a compiler being able to determine if x is a constant at point p in instances, a debugger, can tell whether it is possible for constant x to be undefined if it is used at point p.

A 'definition of variable x' is the statement assigning or may assign a value to x.
We state that a definition d reaches a point p if there exists a path from the point following d to p such that definition d is not killed along the path, on the other hand, If there exists another definition of x that is along this path we kill variable x.
And so, If the definition d of variable x reaches point p, then d might be at the place at which the value of x that was used at p was last defined.

Procedure parameters, array accesses, and indirect references all have one thing in common, they have aliases therefore it is difficult to tell if a statement refers to a variable x.

We also get to kill a definition of variable x if there exists another definition of x along the path.
We kill a definition variable, that is if there exists another path definition of x along this path.

Program analysis is conservative if we don't know whether statement s assigns a value to x, we assume that it may assign to it, that is, variable x after statement s might either have its original value before s or a new value created by s.

Live-variable analysis.

Code-improving transformations depend on the information computed in the direction opposite to the current flow of control.
In live variable analysis, we ask, for a variable x and a point p, can the value of x at p be used along the same path in the flow graph starting at p?
If the value x can be used, we refer to x as live otherwise x is dead.

Live-variable analysis is used in register allocation for basic blocks, where after
a value is computed in a register and presumably used within a block, there is no need to store that value in the register if it is dead at the end of the block.

Also if all available registers are full, and we need another register, we will preferably use a register containing a dead value since the said value does not have to be stored.

Available expressions.

Given an expression, x + y is available at point p if, for every path from the entry node to p, the latter evaluates to x + y, and after the last similar evaluation before reaching p, there aren't any subsequent assignments to x or y, therefore for every available-expressions data-flow schema, we state that a block kills expression x + y if it assigns x or y and doesn't subsequently need to recompute x + y.

A block generates expression x + y if it definitely evaluates x + y and doesn't subsequently define x or y.

The idea of killing and generating an available expression is not similar to that of reaching definitions. Nevertheless, they behave essentially as they do for reaching definitions.

To conclude, available expressions can also be used in detecting global common subexpressions.

Summary.

All optimizations in the compiler depend on data-flow analysis.
We have discussed three instances of data-flow problems namely; reaching definitions, live variables and available expressions.

The definition of each of the above problems is given by the domain of data-flow values, the direction the data is flowing, the family of transfer functions, the boundary condition, and the meet operator.

References.

Modern Compiler Design. Chapter 5, Dick Grune, Koen Langendoen.
Compilers Principles, Techniques, & Tools. Chapter 9, Ravi Sethi, Jeffrey D. Ullman.