×

Search anything:

Constructing a Parser in Compiler Design

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

In this article, we will be constructing a recursive descent parser for the Kaleidoscope programming language in Compiler Design.

Table of contents.

  1. Introduction.
  2. The parser.
  3. Summary.
  4. References.

Prerequisites.

  1. Building a lexer

Introduction.

Parsing in compiler design gets its input from the lexer which produces tokens given a high-level programming language as input.

Parsing can be divided into two types, the first is bottom-up and the other in top-down parsing.

Top-down parsing.
It is also referred to as recursive or predictive parsing.
A top-down parser generates a parse tree for an input string using grammar productions by expanding non-terminals.
We can further divide top-down parsing into two types, recursive-descent parsing which we cover here, and non-recursive descent parsing.

Bottom-up parsing.
A bottom-up parser generates a parse tree for an input string using grammar productions by compressing non-terminals.
There are two types of bottom-up parsing, these are, LR-parsing and Operator precedence parsing.

In the prerequisite article, we created classes that are a representation of AST nodes. In this article w will be constructing a recursive descent parser for the Kaleidoscope programming language.

A recursive-descent parser is a top-down parser built from mutually recursive procedures where each procedure implements the non-terminals of a programming language grammar.

The parser.

First we have to get the tokens from the input.

static int currTok; // current token
static int getNextToken() { 
    return currTok = getTok();
}

We declare a current token, this can be identifiers, for characters the token returned is their ASCII representation, number tokens.
getNextToken function saves the current token from the stream to currTok variable.

Parsing binary expressions

// PARSING BINARY EXPRESSIONS
static map<char, int> BinopPrecedence;

// get the precedence of the pending binary operator token.
static int GetTokPrecedence(){
    switch(currTok){
        case '<':
        case '>':
            return 10;
        case '+':
        case '-':
            return 20;
        case '*':
        case '/':
            return 40; // highest precedence
        default:
            return -1;
    }
}

Operator precedence using shunting yard algorithm. The shunting yard algorithm is used for parsing arithmetic or logical expressions or both.

With binary expressions, the parser will have to implement operator precedence. In Kaleidoscope, we handle operator precedence by assigning each expression a value that represents its precedence.

We use a map to store the precedence operators. The higher the number the higher the precedence.

Reporting errors.

// error reporting for expressions
void LogError(const char *Str) {
    fprintf(stderr, "LogError: %s\n", Str); // print error
}

The above routine will enable the parser to print out errors encountered during parsing.

Parsing number expressions

// PARSING NUMBER EXPRESSIONS
static unique_ptr<ExprAST> ParseNumberExpr() {
    auto Result = make_unique<NumberExprAST>(numStr); // create and allocate
    getNextToken(); // consume the number
    return move(Result);
}

make_unique is a library function used to create a unique pointer and allocate memory to store an object.
make_unique, is used for directly calling unique_ptr constructors.
std::move changes the ownership of the smart pointer from the function so that the caller gets and owns the smart pointer.

Parsing parenthesis

// PARSING PARENTHESIS EXPRESSIONS
static unique_ptr<ExprAST> ParseParenExpr(){
    getNextToken(); // eat (. --> we expect '(' to come first
    auto V = ParseExpression();
    if (!V) return nullptr; // the above statement failed
    
    if(currTok == ')'){ // if we previously ate '(' we expect ')'
        getNextToken(); // eat ).
        return V; // return expression
    }else{
        LogError("expected ')'"); // not got what was expected
        return nullptr;
    }
    return V;
}

By 'eat (' we change the value of the current token - currTok to be the next token after the current token.

Parsing a identifier or function call expression

We now handle variable references and function calls.

// PARSING IDENTIFIERS AND FUNCTION CALL EXPRESSIONS
static unique_ptr<ExprAST> ParseIdentifierOrCallExpr(){
    string IdName = identifierStr;

    getNextToken();  // eat identifier.

    if(currTok == '('){ // parse args list
        getNextToken();  // eat (. -> eat open paren
        vector<unique_ptr<ExprAST>> Args; // strore arguments
        while(1){
            auto Arg = ParseExpression();
            if (Arg)
                Args.push_back(move(Arg)); // push to vector
            else
                return nullptr; // return null pointer

            if(currTok == ')'){
                getNextToken(); // eat
                break;
            }else if(currTok == ','){ // more arguments
                getNextToken(); //eat
                continue;
            }else{ // report errors
                LogError("Expected ')' or ',' in argument list \n");
            }
        }
    }
    return make_unique<VariableExprAST>(IdName);
}

We implement error reporting and recursion. We also use a look-ahead that determines if the current identifier is a stand-alone variable reference or if it is a function call expression.

Parsing primaries
A primary can be a number or an identifier variable, it can also be an expression within a parenthesis.

// PARSING PRIMARIES
static unique_ptr<ExprAST> ParsePrimary() {
    switch (currTok) {
        case tok_identifier: // identifiers
            return ParseIdentifierOrCallExpr(); // parse identifier
        case tok_number: // number literal
            return ParseNumberExpr(); // parse number literal
        case '(': // parenthesis
            return ParseParenExpr(); // parse parenthesis
        default: // report error
            LogError("Unknown token. expected an expression \n");
            return nullptr;
    }
}

Parsing binary expressions in the right-hand side
The function below parses a sequence of pairs. It takes a precedence and s pointer to an expression as its function arguments.

// PARSE RIGHT-HAND SIDE
static unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec, unique_ptr<ExprAST> LHS){
    // If this is a binop, find its precedence.
    while (1) { // keep parsing right hand side
        int TokPrec = GetTokPrecedence(); // get precedence

        // If this is a binop that binds at least as tightly as the current binop,
        // consume it, otherwise we are done.
        if (TokPrec < ExprPrec) // precedence is < than curr precedence
            return LHS; // return left-hand side
        else{
            int BinOp = currTok;
            getNextToken();  // eat binop

            // Parse the primary expression after the binary operator.
            auto RHS = ParsePrimary(); // parse right-hand side
            if(RHS){
                int NextPrec = GetTokPrecedence();
                if(TokPrec < NextPrec){ // get next
                    RHS = ParseBinOpRHS(TokPrec+1, move(RHS));
                    if(!RHS) return nullptr;
                }
                // merge curr LHS, curr RHS to make a new binary expression AST as new LHS
                LHS = make_unique<BinaryExprAST>(BinOp, move(LHS), move(RHS));
            }else return nullptr;
            
        }
    }
}

Above we pass the function a precedence number, it is the minimal operator precedence the function can eat.

Parsing expressions

An expression in Kaleidoscope is a primary expression followed by a sequence of [binop, primaryexpr] pairs.

// PARSE EXPRESSION
static unique_ptr<ExprAST> ParseExpression() {
    auto LHS = ParsePrimary();

    if(LHS){
        return ParseBinOpRHS(0, move(LHS)); // parse left side
    }else return nullptr;
}

Parsing function prototypes
We now parse function prototypes in Kaleidoscope. An example is the following;
In the function prototype foobar(a b), we have the 5 tokens, the function name foobar, the left and right parenthesis, and the parameter list.

// PARSING FUNCTION PROTOTYPES - function signature
static unique_ptr<PrototypeAST> ParsePrototype(){
    if (currTok != tok_identifier){ // current token, not token identfier
        LogError("Expected function name in prototype \n"); // report error
        return nullptr;
    }

    string fnName = identifierStr;
    getNextToken(); // eat identifier

    if(currTok != '('){ // report error
        LogError("Expected '(' in prototype \n");
         return nullptr;
    }

    // Read the list of argument names.
    vector<string> argNames; // srore argument names
    while (getNextToken() == tok_identifier)
        argNames.push_back(identifierStr); // add to vector
    if(currTok != ')'){ // report error
        LogError("Expected ')' in prototype \n");
         return nullptr;
    }

    // success.
    getNextToken();  // eat ')'.

    return make_unique<PrototypeAST>(fnName, move(argNames)); // unique pointer to a prototype AST
}

Parsing function definitions
A function definition is a function prototype with an expression the implement the body. We parse function definitions as follows;

// PARSING FUNCTION DEFINITIONS
static unique_ptr<FunctionAST> ParseDefinition(){
    getNextToken();  // eat 'def' token
    auto Proto = ParsePrototype();
    if(!Proto) return nullptr;

    auto E = ParseExpression();
    if(E) return make_unique<FunctionAST>(move(Proto), move(E)); // unique pointer to a new function AST

    return nullptr; // otherwise return null pointer
}

Parsing extern keyword
extern is a keyword in Kaleidoscope, we support it to declare functions such as sin, cos and also support forward declarations of user functions.

// PARSING THE EXTERN KEYWORD
static unique_ptr<PrototypeAST> ParseExtern(){
    getNextToken();  // eat extern token
    return ParsePrototype();
}

extern is just a prototype without a body.

Parsing top-level expressions
Additionally we allow a user to type an arbitrary top-level expressions and evaluate them in the fly, for this we define anonymous nullary(zero argument) functions.

// PARSING TOP-LEVEL EXPRESSIONS
static unique_ptr<FunctionAST> ParseTopLevelExpr(){
    auto E = ParseExpression();
    if(E){
        // Make an anonymous proto.
        auto proto = make_unique<PrototypeAST>("", vector<string>());
        return make_unique<FunctionAST>(move(proto), move(E));
    }
    return nullptr;
}

Top-level parsing

// TOP_LEVEL PARSING
static void handleDefinition(){
    if(ParseDefinition()) fprintf(stderr, "Parsed a function definition. \n");
    else getNextToken();
}

static void handleExtern(){
    if(ParseExtern()) fprintf(stderr, "Parsed an extern. \n");
    else getNextToken();
}

static void handleTopLevelExpression(){
    if(ParseTopLevelExpr()) fprintf(stderr, "Parsed a top-level expression. \n");
    else getNextToken();
}

The driver code

// DRIVER CODE - repl
static void run(){
  while (1) {
    fprintf(stderr, "ready> ");
    switch(currTok){
        case tok_eof:
            return;
        case ';': // ignore top-level semicolons.
            getNextToken();
        break;
        case tok_def:
            handleDefinition();
        break;
        case tok_extern:
            handleExtern();
        break;
        default:
            handleTopLevelExpression();
        break;
    }
  }
}

int main(){
    fprintf(stderr, "ready> ");
    getNextToken();
    run();
    return 0;
}

Compilation and execution

$ clang++  main.cpp -o main.bin
$ ./main.bin

Testing
Once compiled and executed the code we pass some Kaleidoscope syntax and have the following results;

$ ./main.bin
ready> def foo(x y) x+foo(y, 4.0);
Parsed a function definition.
ready> def foo(x y) x+y y;
Parsed a function definition.
Parsed a top-level expr
ready> def foo(x y) x+y );
Parsed a function definition.
Error: unknown token when expecting an expression
ready> extern sin(a);
ready> Parsed an extern
ready> ^D
$

Summary.

Parsing in compiler design gets its input from the lexer which produces tokens given a high-level programming language as input. Parsing can be bottom-up or top-down.
The shunting yard algorithm is used for parsing arithmetic or logical expressions or both.

We have implemented a recursive descent parser for the Kaleidoscope programming language.

References.

  1. Code Generation from AST to LLVM IR
  2. Parsing in compiler design
  3. Syntax Analysis
  4. Shunting Yard Algorithm
  5. Full code
Constructing a Parser in Compiler Design
Share this