Compiler Front-end: Building a lexer

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

We will build a lexical analyzer for the Kaleidoscope programming language which converts source code into tokens. This is the first program to interact with the source during compilation.

Table of contents.

Introduction.
The lexical analyzer.
Summary.
References.

Prerequisites.

LLVM, IR, SSA

Introduction.

Lexical analysis is the first phase of a compiler. The first program to process code written in a high-level programming language such as Java, C/C++. A lexical analyzer breaks syntaxes into tokens, this is done by removing whitespace, comments in the source code.

A lexeme is a sequence of alphanumeric characters in a token.
A pattern is a means to specify grammar rules. They determine what a token is. For example, tokens can be constants, keywords, operators, identifiers, etc in a programming language.

A lexical analyzer scans through the whole source code and identifies tokens.

A drawback of using a lexical analyzer is that it requires a lot of runtime overhead to generate lexer tables and produce tokens.

In this article, we are implementing a lexical analyzer for the Kaleidoscope programming language. The following code returns the nth Fibonacci number;

# nth fibonacci number.
def fib(x)
  if x < 3 then
    1
  else
    fib(x-1)+fib(x-2)
# 10th fibonacci number - 55
fib(10)

We have comments, identifiers, keywords, variables, etc. We will convert them to their corresponding tokens.

The lexical analyzer.

We first start by defining the different tokens that we will have in this programming language;

enum Token{
    // end of file token
    tok_eof = -1,

    // def keyword
    tok_def = -2,

    // extern keyword
    tok_extern = -3,

    // function names and variable names
    tok_identifier = -4,

    // numbers
    tok_number = -5,
};

Above we have specified tokens for identifiers, keywords, variable names, and numbers for the Kaleidoscope programming language.

Global variables are bad practice although we use them for learning purposes. First, we can make the mistake of redeclaring a global variable for the second time, secondly, they take longer to find, unlike local variables.

// global variables
static std::string IdentifierStr; // identifier string saved here
static double numVal; // number saved here

When called it parses the characters in a string to recognize tokens and return the token which is represented with an integer.

static int getTok(){
    static int lastChar = ' ';

    // remove whitespace
    while(isspace(lastChar))
        lastChar = getchar();

lastChar is initialized to an empty whitespace.
getchar() reads a single character from standard input. Standard input could be a file or input being typed into the terminal.
The function getTok() calls getchar() which reads characters and consumes them while recognizing and storing them and the last character is read but not processed.
We also ignore white space from the source code, we do this using the while loop, The loop reads characters until it finds a character that is not whitespace.

The next step involves recognizing identifiers and keywords such as the def keyword.
We accomplish this task using the following code;

// recognize identfiers and keywords - gets identifiers
    if(isalpha(lastChar)){ // [a-zA-Z][a-zA-Z0-9] - specifies valid identifiers
        IdentifierStr = lastChar;
        while(isalnum((lastChar = getchar()))) // while next letter is alphanumeric
            IdentifierStr += lastChar;
        if(IdentifierStr == "def") return tok_def; // def keyword, return the corresponding token
        if(IdentifierStr == "extern") return tok_extern; // extern keyword, ' '
        return tok_identifier;
    }

Now we recognize numbers in the source code;

// recognizing numbers
    if(isdigit(lastChar) || lastChar == '.'){ // if input is a digit or dot (.)
        string numStr; // hold digits
        do{
            numStr += lastChar; // append to numStr
            lastChar = getchar(); // get next character
        }while(isdigit(lastChar) || lastChar == '-');
        numStr = strtod(numStr.c_str(), 0); // do while numbers/dots are available
        return tok_number; // return number token
    }

Above we use the sttod C function to convert it to a numeric value which we store in numVal.

Now we handle comments in the code;

// process comments
    if(lastChar == '#'){ // '#' sign starts comments
        do
            lastChar = getchar();
        while(lastChar != EOF && lastChar != '\n' && lastChar != '\r'); // not end of file, new line or carriage return, read
        
        if(lastChar != EOF) return getTok(); // recursively find other tokens
    }

Comments start with a pound sign in this programming language; We read characters while we have not reached the end of file or newline character and carriage return characters.

Finally, if the last character is not the end of file, we look for other tokens in the source code.

We check for the end of the file;

// check the end of file
    if(lastChar == EOF) return tok_eof;

    // return character in ASCII code
    int currChar = lastChar;
    lastChar = getchar();
    return currChar;

If everything checks, that is, the input doe not match any of the cases, it is an operator such as + or we have reached the end of file - EOF.
If so we return the ASCII code of the character.

Compilation and execution;

$ g++ lexer.cpp -o lexer.bin; ./lexer.bin

Above we compiler the code to its binary equivalent, we can view the binary code by executing the command;

$ hexdump lexer.bin

When executed;

A number e.g 1, 2 ... returns -5 which is a token.
A character corresponds to -4.
A keyword such as def returns -2 as specified.
The extern keyword returns -3.
Other characters such as $, %, &, etc return their corresponding ASCII code.

Summary.

A lexical analyzer is the first program that interacts with source code during compilation.
A lexical analyzer scans through the whole source code and identifies tokens.
A lexeme is a sequence of alphanumeric characters in a token.

Following is the complete code in C++:

#include<iostream>
#include<string>

using std::string;
using std::cout;
using std::endl;

enum Token{
    // end of file token
    tok_eof = -1,

    // def keyword
    tok_def = -2,

    // extern keyword
    tok_extern = -3,

    // function names and variable names
    tok_identifier = -4,

    // numbers
    tok_number = -5,
};

// global variables
static std::string identifierStr; // identifier string saved here
static double numStr; // number saved here

// get tokens, remove white space
static int getTok(){
    static int lastChar = ' '; // empty string

    // remove whitespace
    while(isspace(lastChar))
        lastChar = getchar();

    // recognize identfiers and keywords - gets identifiers
    if(isalpha(lastChar)){ // [a-zA-Z][a-zA-Z0-9] - specifies valid identifiers
        identifierStr = lastChar;
        while(isalnum((lastChar = getchar()))) // while next letter is alphanumeric
            identifierStr += lastChar;
        if(identifierStr == "def") return tok_def; // def keyword, return the corresponding token
        if(identifierStr == "extern") return tok_extern; // extern keyword, ' '
        return tok_identifier; // return identifier token
    }

    // recognizing numbers
    if(isdigit(lastChar) || lastChar == '.'){ // if input is a digit or dot (.)
        string numStr; // hold digits
        do{
            numStr += lastChar; // append to numStr
            lastChar = getchar(); // get next character
        }while(isdigit(lastChar) || lastChar == '-');
        numStr = strtod(numStr.c_str(), 0); // do while numbers/dots are available
        return tok_number; // return number token
    }

    // process comments
    if(lastChar == '#'){ // '#' sign starts comments
        do
            lastChar = getchar();
        while(lastChar != EOF && lastChar != '\n' && lastChar != '\r'); // not end of file, new line or carriage return, read
        
        if(lastChar != EOF) return getTok(); // recursively find other tokens
    }

    // check the end of file
    if(lastChar == EOF) return tok_eof;

    // return character in ASCII code
    int currChar = lastChar;
    lastChar = getchar(); // reset lastChar
    return currChar;
}

int main(){
    while(true)
        cout << "Token: " << getTok() << endl;
    return 0;
}

// compilation and execution
//  g++ lexer.cpp -o lexer.bin; ./lexer.bin

Table of contents.

Prerequisites.

Introduction.

The lexical analyzer.

Summary.

References.