How to compile a compiler? [Bootstrapping]

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we will learn the process and types of bootstrapping. Bootstrapping involves using compilers to compile other compilers or themselves.

Table of contents.

Introduction.
Full bootstrap.
Using an interpreter.
Incremental bootstrapping.
Summary.
References.

Prerequisites.

Bootstraping a compiler

Introduction.

The idea of bootstrapping involves using compilers to compile compilers or themselves. For this, we need a solid foundation or a machine that we will use to run compilers on.

Oftentimes, a compiler does not exist for a source language, it just does not run on the desired machine. Assuming we want a compiler for ML to x86 machine code and want it to run on an x86 machine architecture.

We have access to an ML-compiler that generates ARM machine code that runs on ARM machines. A way of obtaining the desired compiler would be to perform a binary translation, that is, write a compiler from an ARM machine code to x86 code. This allows the translated compiler to run on an x86 and at the same time generate ARM machine code that runs on ARM machine architecture.

We use the ARM-to-x86 compiler to translate this into x86 code. This introduces the following problems;

Extra passes make compilation time-consuming.
Efficiency lost during translation.
We still have to make the ARM-to-x86 compiler run on an x86 machine.

Another better solution is to write an ML-to-x86 compiler in ML and then compile it using the ML compiler on the ARM;

bc6

We can now run the ML-to-x86 compiler on the ARM and let it compile itself.

bc7

We now have the desired compiler it can compile itself on the x86 platform. This is useful if the compiler is later extended or as a partial test of correctness.
If the compiler when it compiles itself, yields a different object code than the one obtained from the process, it must have an error. The converse is not true, That is, even if the same target is obtained, there are errors in the compiler.

We can combine the above diagrams into a single diagram that covers both executions;

bc8

Above, the ML-to-x86 compiler written in ARM has two functions, it is the output from the first compilation and the compiler that will perform the second compilation.
The compiler that is input to the second compilation step looks like the output of the left most compiler. In such a case, we avoid confusion since the left most compiler is not running, and also, languages don't match.

Full bootstrap.

In the prerequisites article, we learned about a bootstrapping process that relies on the existing compiler for the desired programming language. More precisely, it is referred to as half-bootstrapping. What about the times there is no pre-existing compiler, for example in developing a new programming language?
In such situations, we perform a full bootstrap. A commonly used method is writing and using a QAD(Quick and Dirty) compiler using a pre-existing programming language. The compiler should allow programs written in the new language to be executed.
Also, a 'real' compiler is written using the new language and bootstrapped using the QAD compiler.

Assuming we designed a language M+ and write a compiler from M+ to ML in ML. First, we compile it so it can run on a target;

bc9

The QAD compiler is then used to compile the real compiler;

bc10

The outcome is an ML program that requires compilation;

bc11

The output of the above is a compiler with the needed functionalities, however, it is still slow, this is because it has been compiled using the QAD compiler. An improved outcome can be obtained by allowing the generated compiler to compile itself.

bc12

The output is a faster compiler but with the same needed functionalities. Note that bootstrapping might not be complete even if a compiler with the right functionality is produced.

Using an interpreter.

We can also opt to write an interpreter instead of writing a QAD compiler. From our example, we write an M+ interpreter in ML, first we need to compile;

bc13

Then we use the above to run the M+ compiler directly;

bc14

We used the real compiler for the compilations therefore nothing is gained when we use the generated compiler to compile itself, this step can also be used as a test and also for extensions.

Incremental bootstrapping.

We can also build a new programming language and also build its compiler incrementally. Here the first step involves writing a compiler for a small subset of the language, then using the same subset to write it.
The first compiler is bootstrapped in any of the methods described in previous sections, thereafter the following process is performed repeatedly;

Slight language subset extension.
Compiler extension to compile extended subset, without new features.
Compile a new compiler using the previous one.

At each step, features introduced in a previous step are used in the compiler. Even when the whole language is compiled, the process is continued to improve the compiler quality.

Summary.

Note that bootstrapping might not be complete even if a compiler with the right functionality is produced.
In situations whereby there is no pre-existing compiler, for example in developing a new programming language, we perform a full bootstrap. A commonly used method is writing and using a QAD(Quick and Dirty) compiler using a pre-existing programming language. The compiler should allow programs written in the new language to be executed.
Incremental bootstrapping involves building a new programming language and building its compiler incrementally.

References.

Basics of Compiler Design - Torben Ægidius Mogensen