Code Generation from AST to LLVM IR

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we will learn how to transform an AST (Abstract Syntax Tree) into an LLVM IR (Intermediate Representation).

Table of contents.

Introduction.
Initial setup.
Code generation for expressions.
Code generation for functions.
Summary.
References.

Prerequisites.

Creating AST nodes using LLVM

Introduction.

In the prerequisite article, we built a class hierarchy that represents different AST nodes in a programming language whereby the node type has a class where the name of the class represents a syntax element of the programming language.
In this article, we will transform the AST into an LLVM IR(Intermediate Representation).

Initial setup.

First, we define virtual code generation methods in each class of the AST as follows;

// the base class for all nodes of the AST
class ExprAST{
    public:
      virtual ~ExprAST() = default;
      virtual Value *codegen() = 0;
};

/// class for numeric literals
class NumberExprAST : public ExprAST{
    double Val;

    public:
        NumberExprAST(double d) : Val(d) {}
        Value *codegen() override;
};

We use the codegen() method to emit an IR for the AST node along with its dependencies.

All codegen() methods return an LLVM value object. Value is a class representing a SSA(Static Single Assignment) register or SSA value in LLVM.
SSA values are computed while the related instructions execute and don't get a new value until the instruction executes again. Therefore, we cannot change an SSA value.

The next step in the setup is to set up an error reporting mechanism that reports errors during code generation;

static LLVMContext TheContext;
static IRBuilder<> Builder(TheContext);
static std::unique_ptr<Module> TheModule;
static std::map<std::string, Value *> NamedValues;

Value *LogErrorV(const char *Str){
    LogError(Str);
    return nullptr;
}

The static variables above are used during code generation.
TheContext is an opaque object that owns core LLVM data structures such as type and constant value tables.

Builder is a helper object that allows us to generate LLVM instructions easily. We assume it has been set up to generate code into something.

TheModule is an LLVM construct containing functions and global variables. It is the top-level structure LLVM IR uses to contain the code. It owns memory for all generated IR, this is why the codegen() method returns a raw Value rather than a smart pointer unique_ptr<Value>

NamedValues tracks the values defined in the current scope and their LLVM representations. We can view it as a symbol table for the code.

Now we can start generating code for each expression in the programming language.

Code generation for expressions.

In this article, we will discuss code snippets used to generate code for the four expressions discussed in the prerequisite article.

Numeric literals

Value *NumberExprAST::codegen(){
    return ConstantFP::get(TheContext, APFloat(Val));
}

We use the ConstantFP class to represent numeric constants. This class holds numeric values in an APFloat that is capable of holding floating-point constants of arbitrary precision.

The above code creates and returns ConstantFP

Variables

Value *VariableExprAST::codegen() {
    // Look up variable in function.
    Value *V = NamedValues[Name];
    if (!V)
        LogErrorV("Unknown variable name");
    return V;
}

The above code checks for the specified name in the map and returns its value. If the name is not in the map, then an unknown variable is being referenced.

Binary expressions

Value *BinaryExprAST::codegen(){
    Value *L = LHS->codegen();
    Value *R = RHS->codegen();
    if (!L || !R)
        return nullptr;

    switch (Op){
        case '+':
            return Builder.CreateFAdd(L, R, "addtmp");
        case '-':
            return Builder.CreateFSub(L, R, "subtmp");
        case '*':
            return Builder.CreateFMul(L, R, "multmp");
        case '<':
            L = Builder.CreateFCmpULT(L, R, "cmptmp");
            // Convert bool 0/1 to double 0.0 or 1.0
            return Builder.CreateUIToFP(L, Type::getDoubleTy(TheContext), "booltmp");
        default:
            return LogErrorV("invalid binary operator");
    }
}

The idea here is to recursively produce code for the left side of the expression followed by the right side, then compute the result of the binary expression.
Above, we perform a simple switch on the opcode to create the correct LLVM instruction.

The builder class starts to show its value. Since IRBuilder already knows where to insert the created instruction, all we do is specify the instruction to create, e.g, CreateFAdd, operands to use, and a name for the generated instruction.

Function calls

Value *CallExprAST::codegen(){
    // name lookup in global module table.
    Function *CalleeF = TheModule->getFunction(Callee);
    if (!CalleeF)
        return LogErrorV("Unknown function referenced");

    // argument mismatch error.
    if (CalleeF->arg_size() != Args.size())
        return LogErrorV("Incorrect # arguments passed");

    std::vector<Value *> ArgsV;
    for(unsigned i = 0, e = Args.size(); i != e; ++i){
        ArgsV.push_back(Args[i]->codegen());
        if (!ArgsV.back())
          return nullptr;
    }

    return Builder.CreateCall(CalleeF, ArgsV, "calltmp");
}

The above code looks up a function name in the LLVM module symbol table. Once we have the function to call, we recursively codegen each argument that is supposed to be passed in the function and create an LLVM call instruction. Since LLVM uses native C calling conventions, this allows calls to also call into the standard library functions without any additional effort.

Code generation for functions

Prototypes are used both for function bodies and other external function declarations. We first start with the following code snippet;

Function *PrototypeAST::codegen() {
  // Make function type: double(double,double) etc.
  std::vector<Type*> Doubles(Args.size(), Type::getDoubleTy(TheContext));
  FunctionType *FT = FunctionType::get(Type::getDoubleTy(TheContext), Doubles, false);

  Function *F = Function::Create(FT, Function::ExternalLinkage, Name, TheModule.get());

The above function returns a Function* instead of a Value*. Since a prototype involves the external interface for a function, it is sensible for it to return the LLVM Function it corresponds to during code generation.

FunctionType::get creates a FunctionType that should be used for a prototype. In this programming language, all function arguments are doubles, therefore, in the first line, we create a vector N LLVM double types.

The code uses FunctionType::get to create a function type that takes N doubles as arguments and returns a single double as the result.

In the last line in the above code snippet, we create an IR Function that corresponds to the Prototype. It indicates the type, linkage, name to be used and module to insert into.

ExternalLinkage means that the function can be defined outside the module. We pass a Name which represents the specified user.

Now for all function arguments we set their names according to the given names in the function Prototype.

// Set names for all arguments.
unsigned Idx = 0;
for (auto &Arg : F->args())
  Arg.setName(Args[Idx++]);

return F;

The above step is not mandatory although maintaining consistency with names not only makes the IR more readable but also allows subsequent code to directly reference arguments for their names, instead of looking them up in the Prototype AST.

We now have a function prototype without a body. This is how LLVM IR represents function declarations. For function definitions we codegen and attach a function body.

Function *FunctionAST::codegen(){
   // check for an existing function from previous 'extern' declaration.
  Function *TheFunction = TheModule->getFunction(Proto->getName());

  if (!TheFunction)
    TheFunction = Proto->codegen();

  if (!TheFunction)
    return nullptr;

  if (!TheFunction->empty())
    return (Function*)LogErrorV("Function cannot be redefined.");

We first search TheModule's symbol table for an existing version of the function. If we get a null return, it means no previous version exists therefore we codegen on from the Prototype.

At this point the Builder is already setup.

// Create a new basic block to start insertion into.
BasicBlock *BB = BasicBlock::Create(TheContext, "entry", TheFunction);
Builder.SetInsertPoint(BB);

// Record the function arguments in the NamedValues map.
NamedValues.clear();
for (auto &Arg : TheFunction->args())
    NamedValues[Arg.getName()] = &Arg;

Above we first create a basic block and then insert it into TheFunction. We then inform the Builder that the new instructions should be inserted at the end of the basic block in the second line.

We don't have control flow therefore, our function only has a single block.

We now add function arguments to NamedValues map so they are accessible to VariableExprAST

if (Value *RetVal = Body->codegen()) {
    // Finish off the function.
    Builder.CreateRet(RetVal);

    // Validate the generated code, checking for consistency.
    verifyFunction(*TheFunction);

    return TheFunction;
}

When the insertion point is set up and NamedValues map is populated, we call codegen(). If no errors occur, it emits code to compute the expression into the entry block and return the computed value.

If still no errors, we create an LLVM ret instruction that completes the function.
When the function is built, we call verifyFunction which does consistency checks on the generated code to determine if the compiler is working OK. This is essential since it catches many bugs.

When done we validate and return the function.

Now to handle error cases. We handle them simply by deleting the produced function using eraseFromParent method.

  // Error reading body, remove function.
  TheFunction->eraseFromParent();
  return nullptr;
}

The user can now be able to redefine a function that was incorrectly typed before.

Summary.

An LLVM IR is what connects the front-end of a compiler with its back-end. This allows LLVM to parse multiple source languages and then generate code for multiple targets.

In general, front-ends produce the Intermediate representation while the compiler back-ends consume it.