Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
In this article, we will be constructing a recursive descent parser for the Kaleidoscope programming language in Compiler Design.
Table of contents.
- Introduction.
- The parser.
- Summary.
- References.
Prerequisites.
Introduction.
Parsing in compiler design gets its input from the lexer which produces tokens given a high-level programming language as input.
Parsing can be divided into two types, the first is bottom-up and the other in top-down parsing.
Top-down parsing.
It is also referred to as recursive or predictive parsing.
A top-down parser generates a parse tree for an input string using grammar productions by expanding non-terminals.
We can further divide top-down parsing into two types, recursive-descent parsing which we cover here, and non-recursive descent parsing.
Bottom-up parsing.
A bottom-up parser generates a parse tree for an input string using grammar productions by compressing non-terminals.
There are two types of bottom-up parsing, these are, LR-parsing and Operator precedence parsing.
In the prerequisite article, we created classes that are a representation of AST nodes. In this article w will be constructing a recursive descent parser for the Kaleidoscope programming language.
A recursive-descent parser is a top-down parser built from mutually recursive procedures where each procedure implements the non-terminals of a programming language grammar.
The parser.
First we have to get the tokens from the input.
static int currTok; // current token
static int getNextToken() {
return currTok = getTok();
}
We declare a current token, this can be identifiers, for characters the token returned is their ASCII representation, number tokens.
getNextToken function saves the current token from the stream to currTok variable.
Parsing binary expressions
// PARSING BINARY EXPRESSIONS
static map<char, int> BinopPrecedence;
// get the precedence of the pending binary operator token.
static int GetTokPrecedence(){
switch(currTok){
case '<':
case '>':
return 10;
case '+':
case '-':
return 20;
case '*':
case '/':
return 40; // highest precedence
default:
return -1;
}
}
Operator precedence using shunting yard algorithm. The shunting yard algorithm is used for parsing arithmetic or logical expressions or both.
With binary expressions, the parser will have to implement operator precedence. In Kaleidoscope, we handle operator precedence by assigning each expression a value that represents its precedence.
We use a map to store the precedence operators. The higher the number the higher the precedence.
Reporting errors.
// error reporting for expressions
void LogError(const char *Str) {
fprintf(stderr, "LogError: %s\n", Str); // print error
}
The above routine will enable the parser to print out errors encountered during parsing.
Parsing number expressions
// PARSING NUMBER EXPRESSIONS
static unique_ptr<ExprAST> ParseNumberExpr() {
auto Result = make_unique<NumberExprAST>(numStr); // create and allocate
getNextToken(); // consume the number
return move(Result);
}
make_unique is a library function used to create a unique pointer and allocate memory to store an object.
make_unique, is used for directly calling unique_ptr constructors.
std::move changes the ownership of the smart pointer from the function so that the caller gets and owns the smart pointer.
Parsing parenthesis
// PARSING PARENTHESIS EXPRESSIONS
static unique_ptr<ExprAST> ParseParenExpr(){
getNextToken(); // eat (. --> we expect '(' to come first
auto V = ParseExpression();
if (!V) return nullptr; // the above statement failed
if(currTok == ')'){ // if we previously ate '(' we expect ')'
getNextToken(); // eat ).
return V; // return expression
}else{
LogError("expected ')'"); // not got what was expected
return nullptr;
}
return V;
}
By 'eat (' we change the value of the current token - currTok to be the next token after the current token.
Parsing a identifier or function call expression
We now handle variable references and function calls.
// PARSING IDENTIFIERS AND FUNCTION CALL EXPRESSIONS
static unique_ptr<ExprAST> ParseIdentifierOrCallExpr(){
string IdName = identifierStr;
getNextToken(); // eat identifier.
if(currTok == '('){ // parse args list
getNextToken(); // eat (. -> eat open paren
vector<unique_ptr<ExprAST>> Args; // strore arguments
while(1){
auto Arg = ParseExpression();
if (Arg)
Args.push_back(move(Arg)); // push to vector
else
return nullptr; // return null pointer
if(currTok == ')'){
getNextToken(); // eat
break;
}else if(currTok == ','){ // more arguments
getNextToken(); //eat
continue;
}else{ // report errors
LogError("Expected ')' or ',' in argument list \n");
}
}
}
return make_unique<VariableExprAST>(IdName);
}
We implement error reporting and recursion. We also use a look-ahead that determines if the current identifier is a stand-alone variable reference or if it is a function call expression.
Parsing primaries
A primary can be a number or an identifier variable, it can also be an expression within a parenthesis.
// PARSING PRIMARIES
static unique_ptr<ExprAST> ParsePrimary() {
switch (currTok) {
case tok_identifier: // identifiers
return ParseIdentifierOrCallExpr(); // parse identifier
case tok_number: // number literal
return ParseNumberExpr(); // parse number literal
case '(': // parenthesis
return ParseParenExpr(); // parse parenthesis
default: // report error
LogError("Unknown token. expected an expression \n");
return nullptr;
}
}
Parsing binary expressions in the right-hand side
The function below parses a sequence of pairs. It takes a precedence and s pointer to an expression as its function arguments.
// PARSE RIGHT-HAND SIDE
static unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec, unique_ptr<ExprAST> LHS){
// If this is a binop, find its precedence.
while (1) { // keep parsing right hand side
int TokPrec = GetTokPrecedence(); // get precedence
// If this is a binop that binds at least as tightly as the current binop,
// consume it, otherwise we are done.
if (TokPrec < ExprPrec) // precedence is < than curr precedence
return LHS; // return left-hand side
else{
int BinOp = currTok;
getNextToken(); // eat binop
// Parse the primary expression after the binary operator.
auto RHS = ParsePrimary(); // parse right-hand side
if(RHS){
int NextPrec = GetTokPrecedence();
if(TokPrec < NextPrec){ // get next
RHS = ParseBinOpRHS(TokPrec+1, move(RHS));
if(!RHS) return nullptr;
}
// merge curr LHS, curr RHS to make a new binary expression AST as new LHS
LHS = make_unique<BinaryExprAST>(BinOp, move(LHS), move(RHS));
}else return nullptr;
}
}
}
Above we pass the function a precedence number, it is the minimal operator precedence the function can eat.
Parsing expressions
An expression in Kaleidoscope is a primary expression followed by a sequence of [binop, primaryexpr] pairs.
// PARSE EXPRESSION
static unique_ptr<ExprAST> ParseExpression() {
auto LHS = ParsePrimary();
if(LHS){
return ParseBinOpRHS(0, move(LHS)); // parse left side
}else return nullptr;
}
Parsing function prototypes
We now parse function prototypes in Kaleidoscope. An example is the following;
In the function prototype foobar(a b), we have the 5 tokens, the function name foobar, the left and right parenthesis, and the parameter list.
// PARSING FUNCTION PROTOTYPES - function signature
static unique_ptr<PrototypeAST> ParsePrototype(){
if (currTok != tok_identifier){ // current token, not token identfier
LogError("Expected function name in prototype \n"); // report error
return nullptr;
}
string fnName = identifierStr;
getNextToken(); // eat identifier
if(currTok != '('){ // report error
LogError("Expected '(' in prototype \n");
return nullptr;
}
// Read the list of argument names.
vector<string> argNames; // srore argument names
while (getNextToken() == tok_identifier)
argNames.push_back(identifierStr); // add to vector
if(currTok != ')'){ // report error
LogError("Expected ')' in prototype \n");
return nullptr;
}
// success.
getNextToken(); // eat ')'.
return make_unique<PrototypeAST>(fnName, move(argNames)); // unique pointer to a prototype AST
}
Parsing function definitions
A function definition is a function prototype with an expression the implement the body. We parse function definitions as follows;
// PARSING FUNCTION DEFINITIONS
static unique_ptr<FunctionAST> ParseDefinition(){
getNextToken(); // eat 'def' token
auto Proto = ParsePrototype();
if(!Proto) return nullptr;
auto E = ParseExpression();
if(E) return make_unique<FunctionAST>(move(Proto), move(E)); // unique pointer to a new function AST
return nullptr; // otherwise return null pointer
}
Parsing extern keyword
extern is a keyword in Kaleidoscope, we support it to declare functions such as sin, cos and also support forward declarations of user functions.
// PARSING THE EXTERN KEYWORD
static unique_ptr<PrototypeAST> ParseExtern(){
getNextToken(); // eat extern token
return ParsePrototype();
}
extern is just a prototype without a body.
Parsing top-level expressions
Additionally we allow a user to type an arbitrary top-level expressions and evaluate them in the fly, for this we define anonymous nullary(zero argument) functions.
// PARSING TOP-LEVEL EXPRESSIONS
static unique_ptr<FunctionAST> ParseTopLevelExpr(){
auto E = ParseExpression();
if(E){
// Make an anonymous proto.
auto proto = make_unique<PrototypeAST>("", vector<string>());
return make_unique<FunctionAST>(move(proto), move(E));
}
return nullptr;
}
Top-level parsing
// TOP_LEVEL PARSING
static void handleDefinition(){
if(ParseDefinition()) fprintf(stderr, "Parsed a function definition. \n");
else getNextToken();
}
static void handleExtern(){
if(ParseExtern()) fprintf(stderr, "Parsed an extern. \n");
else getNextToken();
}
static void handleTopLevelExpression(){
if(ParseTopLevelExpr()) fprintf(stderr, "Parsed a top-level expression. \n");
else getNextToken();
}
The driver code
// DRIVER CODE - repl
static void run(){
while (1) {
fprintf(stderr, "ready> ");
switch(currTok){
case tok_eof:
return;
case ';': // ignore top-level semicolons.
getNextToken();
break;
case tok_def:
handleDefinition();
break;
case tok_extern:
handleExtern();
break;
default:
handleTopLevelExpression();
break;
}
}
}
int main(){
fprintf(stderr, "ready> ");
getNextToken();
run();
return 0;
}
Compilation and execution
$ clang++ main.cpp -o main.bin
$ ./main.bin
Testing
Once compiled and executed the code we pass some Kaleidoscope syntax and have the following results;
$ ./main.bin
ready> def foo(x y) x+foo(y, 4.0);
Parsed a function definition.
ready> def foo(x y) x+y y;
Parsed a function definition.
Parsed a top-level expr
ready> def foo(x y) x+y );
Parsed a function definition.
Error: unknown token when expecting an expression
ready> extern sin(a);
ready> Parsed an extern
ready> ^D
$
Summary.
Parsing in compiler design gets its input from the lexer which produces tokens given a high-level programming language as input. Parsing can be bottom-up or top-down.
The shunting yard algorithm is used for parsing arithmetic or logical expressions or both.
We have implemented a recursive descent parser for the Kaleidoscope programming language.
References.
- Code Generation from AST to LLVM IR
- Parsing in compiler design
- Syntax Analysis
- Shunting Yard Algorithm
- Full code