Compiler Architecture

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we discuss compiler architectures based on two important aspects defining a compiler's architecture, these are, the granularity of data passing through the compiler and the flow of control between compiler modules, we have also gone over the properties of a good compiler and the generated code, portability and retargetability in compiler design.

Table of contents.

Introduction.
Compiler width.
Flow of control.
Properties of good compilers
Properties of generated code.
Portability and retargetability.
Summary.
References.

Introduction.

A compiler is a program that accepts source code e.g a java program and generates machine code specific to a computer architecture. That is, if we intend to write compilers for N programming languages to run of M different computer architectures, then we need to write NxM compilers for each language-architecture combination.

In this article we discuss architectural questions involving compiler architectures, these are granularity of data passed between the compiler modules and the flow of control between compiler modules. We also discuss compiler portability and retargetability, the former deals with how to write N+M compilers instead of NxM, the latter deals with the ease with which a compiler can be made to generate code for another machine architecture.

Compiler width.

A compiler will have modules which transform, refine and pass information between the, i.e from a module Mn to Mn+1.

Sizes of information passed between these modules impacts the structure of a compiler which leads to two types of compiler, we shall refer to them as narrow and broad.

A narrow compiler reads a small part of the program(tokens), processes them and produces object code, it then discards information about the tokens and repeats until the end of the program text is arrived at.

These compilers need memory linear to the length of the program although the proportionality constant is slower since they gather information at a much slower rate.

An example

while not Finished:
    Read some data D from the source code;
    Process D and produce the corresponding object code, if any;

Narrow compilers consist of an imperative loop as shown above.

'Real' compilers are implemented as narrow compilers, however narrow compilers also have an option to incorporate a broad component, for example, a C compiler reading each routine entirely then discarding everything except the obtained global information.

A broad compiler will read the entire program and puts it through a series of transformations such as lexical, syntactic, contextual, optimization, code generation after which we get the desired object code which is written to a file.
This type of compiler will need memory proportional to program size.
In an educational, theoretical view, these are preferable as they represent a simpler model inline with a functional programming paradigm.

An example

Object code ←
    Assembly(
        CodeGeneration(
            ContextCheck(
                Parse(
                    Tokenize(
                        SourceCode
                    )
                )
            )
        )
    );

Broad compilers consist of function calls as can be seen above.

Flow of control.

In broad compilers each module will be in control of both the processor and the data when it is currently running.

In narrow compilers data moves from module to module and thus control will move forward and backward so as to activate the appropriate module at the appropriate time.

We shall focus on narrow compilers seeing as the demonstrate some complexity in the flow of control.

Modules are essentially filters in that, they read, process and produce results and thus are programmed as loops that execute function calls to obtains chunks of information from a previous module and routine calls to write the chunks to the next module.

An example - a filter

while ObtainedFromPreviousModule(Ch):
    if Ch = 'a':
        //check for other 'a's:
        if ObtainedFromPreviousModule(Ch1):
            if Ch1 = 'a':
                //we have 'aa':
                OutputToNextModule('b');
            else −− Ch1 != 'a':
                OutputToNextModule('a');
                OutputToNextModule(Ch1);
        else // No more characters:
            OutputToNextModule('a');
            exit;
    else −− Ch != 'a':
        OutputToNextModule(Ch);

The above filter copies input characters to the output while placing the sequence aa by b. It obtains its input by calling its predecessors in the module sequence, this call may yield a character or fail.

Transformed characters are passed to the next module.

Control will remain inside the while loop all the time except for routine calls to the previous and next module.

Main loops are easily programmable and understandable except they cannot be used as a universal programming model for compiler modules in that they don't interface well with other main loops.

Assume we want to connect the main loop above that converts aa -> b to another which converts bb -> c, such that the output of the first becomes the input of the second (just like Linux pipes). In such a case we need a transfer of control which will leave both environments intact.

This is solved by coroutine calls which involve separating stacks for two loops so as to preserve both their environments.

Properties of good compilers.

A good compiler should generate correct code.
A good compiler should completely conform to the language specifications. Programs developed with such a compilers will exhibit portability.
Good compiler should be able to handle programs of arbitrary size as far as memory permits. Keep in mind that code can be written programmers and also generated by a program.
Speed, compiler writers keep their compilers linear in the input meaning the compilation time will be a linear function of the length of the input file.
Size, this is never a factor since computers nowadays have Gbs of primary memory.

Speed and size are important factors to be considered when programs call a compiler again at run time e.g for just-in-time compilation.

Properties of generated code.

Correctness
This is the most important property of generated code while being the most vulnerable one.
The main weapon against incorrect code is small semantics-preserving transformation - the huge transformation from source code to binary object code which is decomposed into several semantics-preserving transformations each small enough to be understood and proven correct locally.
Speed.
To produce faster code,

We design code transformations which yield faster code, even better no code at all and do the analysis required for their correct applications.
Partial Evaluation whereby we evaluate part of the program during compilation.
Duplication of code segments in line rather than making jumps to them. e.g replacing function calls by the body of the called function and unrolling a loop statement by repeating the loop body. The improvements although minor allow the expanded code to be optimized.
Other optimization techniques outside compilers are efficient algorithms and writing code is assembly language.

Size.
The size of the code matters, i.e code for embedded applications e.g in remote controllers, smart cards, cars where memory restricts code size.
Techniques to reduce code size in such applications involve aggressive suppression of unused code, assorted code compression, threaded code, procedural abstraction and use of special hardware.
Power consumption.
Power management consists of saving of energy so as to increase operation time while battery-powered, secondly limiting peak heat dissipation to protect the processor.
If a program is faster, it finishes sooner and uses less energy.

Portability, retargetability.

A program is portable if it takes a reasonable amount of effort to make it run on different machine types.

Many programs today can be ported by editing of the makefile so as to reflect the local situation and compiling. Oftenly the task of adapting to the local situation is automated e.g by using GNU's autoconf.

With compilers, machine dependence not only resides in the program but also in the output and thus we have to consider a further form of machine independence.
The ease with which a program can be made to generated code for another machine is referred to as retargetability. This is achieved by replacing the entire compiler back-end. Now retarbetability will involve the ease of coming up with a new back-end.

Note that replacing the back-end doesn't mean writing another from scratch since some of the back-end code is machine-independent.

Summary.

A compiler is a program that accepts source code e.g a java program and generates machine code specific to a computer architecture.

We have discussed architectural questions involving compiler architectures, granularity of data passed between the compiler modules and the flow of control between compiler modules.

A program is portable if it takes a reasonable amount of effort to make it run on different machine types.

Retargetability is the ease with which a program can be made to generated code for another machine.

References.

Compilers Principles, Techniques, & Tools, by A.V.Aho, R.Sethi & J.D.Ullman, Pearson
Education
Principle of Compiler Design, A.V.Aho and J.D. Ullman, Addition – Wesley
Modern Compiler Design 2nd Edition Dick Grune • Kees van Reeuwijk • Henri E. Bal