Create AST nodes using LLVM

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article, we learn how to build a class hierarchy that represents different AST nodes in a programming language whereby the node type has a class where the name of the class represents a syntax element of the programming language.

Table of contents.

  1. Introduction.
  2. The base class.
  3. Numeric literals.
  4. Variables.
  5. Binary operators.
  6. Function calls.
  7. Function prototypes.
  8. Function definitions.
  9. Summary.
  10. References.

Introduction.

LLVM is a library used for the construction, optimization, and production of intermediate and binary machine code. It is also used as a compiler framework whereby we build the compiler front-end, that is the parser and the lexer, then build a back-end that converts the LLVM output to machine code.

In addition to being a compiler framework, it is also used as a JIT(Just In Time) compiler.

We will be creating different classes which will represent different AST(Abstract Syntax Tree) nodes for a programming language. Each AST node type is represented by a class whereby the class name represents the type of syntax element in the programming language.

We have the following code that computes the nth Fibonacci number written in this programming language;

// Get the nth fibonacci number
def fib(n)
  if n < 3 then
    1
  else
    fib(n - 1) + fib(n - 2)

From the above we can see identifiers, keywords, operators and comments. Each of these is a part of the syntax of this programming language and therefore needs to be represented in the Abstract Syntax Tree. That is, the if-then identifiers need an AST type/class that represents them, so as all other elements such as variables, and function definitions.

The base class.

The following is the base class for all other expression nodes in the AST;

class ExprAST {
    public:
      virtual ~ExprAST() {}
};

All other classes extend the above class so they can be a specialized AST node representing a specific type.

Above, we define a virtual destructor which will be useful when we might want to delete an instance of a derived class using a pointer to this base class.

Numeric literals.

Now to represent all numbers in the programming language.

class NumberExprAST : public ExprAST {
    double Val;

    public:
      NumberExprAST(double d) : Val(d) {}
};

class NumberExprAST : public ExprAST means that NumberExprAST extends the base class we had previously defined.

public makes it publicly known that we have inherited from ExprAST base class. We can make it private by removing public.

We also have a constructor for the NumberExprAST class which is public. This is similar to the following Java code;

public NumberExprAST(double d){
    this.d = d;
}

Variables.

To represent variable references in the programming language;

class VariableExprAST : public ExprAST{
    std::string Name;

    public:
        VariableExprAST(const std::string &Name) : Name(Name) {}
};

From the above, const std::string &Name is a reference to an object of type std::string

Binary operators.

class BinaryExprAST : public ExprAST{
    char Op;
    std::unique_ptr<ExprAST> LHS, RHS;

    public:
        BinaryExprAST(char op, 
                      std::unique_ptr<ExprAST> LHS,
                      std::unique_ptr<ExprAST> RHS)
                  :   Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
};

Char Op could be programming language operators such as +, -, *, /, >, <.

std::unique_ptr represents a smart pointer. This pointer owns and manages another object through a pointer. When it goes out of scope it disposes of the object. We create smart pointers for the left and right sides of the AST using LHS(left-hand side) and RHS(right-hand side).

We use std::move to indicate that an object might be moved from where it came from so it can be owned by another object. In other words, we are transferring ownership from one part of the code to another. This allows for the efficient transfer of resources. In this case, we are transferring ownership from outside the class to inside the class.

Function calls.

class CallExprAST : public ExprAST{
    std::string Callee;
    std::vector<std::unique_ptr<ExprAST>> Args;

    public:
        CallExprAST(const std::string &Callee,
                  std::vector<std::unique_ptr<ExprAST>> Args)
                : Callee(Callee), Args(std::move(Args)) {}
};

Here we state that a function call such as Fib(10) that returns the 10th Fibonacci number. It has a Callee which is the name of the function, Fib, and an argument list Args [10], a vector of unique pointers to a node in the AST.

Function prototypes.

class PrototypeAST{
    std::string Name;
    std::vector<std::string> Args;

    public:
        PrototypeAST(const std::string &name, std::vector<std::string> Args)
        : Name(name), Args(std::move(Args)) {}

        const std::string &getName() const { return Name; }
};

This is the signature of the function definition. For example in this programming language it is the fib(n) part of the function definition def fib(n)

Function definitions.

class FunctionAST{
    std::unique_ptr<PrototypeAST> Proto;
    std::unique_ptr<ExprAST> Body;

    public:
        FunctionAST(std::unique_ptr<PrototypeAST> Proto,
                  std::unique_ptr<ExprAST> Body)
        : Proto(std::move(Proto)), Body(std::move(Body)) {}
};

In this programming language, this is def fib(n). The above has its function prototype and the function body. The former holds the name of the function and its arguments list fib(n).

Summary.

The front-end of a compiler consists of; lexical analysis, semantic analysis, syntax analysis, and intermediate code generation.

We have learned how to build a class hierarchy that represents the different AST nodes in a programming language where each AST node type is represented by a class where the name of the class represents a syntax element of the programming language such as binary operators, variables, function definitions, function calls, etc.

References.

  1. Kaleidoscope Programming Language
  2. Introduction to compiler design
  3. Intermediate representations (IR) in Compiler Design.
  4. Code generation from AST to LLVM IR.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.