Compiler Front-end: Building a lexer
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
We will build a lexical analyzer for the Kaleidoscope programming language which converts source code into tokens. This is the first program to interact with the source during compilation.
Table of contents.
- Introduction.
- The lexical analyzer.
- Summary.
- References.
Prerequisites.
Introduction.
Lexical analysis is the first phase of a compiler. The first program to process code written in a high-level programming language such as Java, C/C++. A lexical analyzer breaks syntaxes into tokens, this is done by removing whitespace, comments in the source code.
A lexeme is a sequence of alphanumeric characters in a token.
A pattern is a means to specify grammar rules. They determine what a token is. For example, tokens can be constants, keywords, operators, identifiers, etc in a programming language.
A lexical analyzer scans through the whole source code and identifies tokens.
A drawback of using a lexical analyzer is that it requires a lot of runtime overhead to generate lexer tables and produce tokens.
In this article, we are implementing a lexical analyzer for the Kaleidoscope programming language. The following code returns the nth Fibonacci number;
# nth fibonacci number.
def fib(x)
if x < 3 then
1
else
fib(x-1)+fib(x-2)
# 10th fibonacci number - 55
fib(10)
We have comments, identifiers, keywords, variables, etc. We will convert them to their corresponding tokens.
The lexical analyzer.
We first start by defining the different tokens that we will have in this programming language;
enum Token{
// end of file token
tok_eof = -1,
// def keyword
tok_def = -2,
// extern keyword
tok_extern = -3,
// function names and variable names
tok_identifier = -4,
// numbers
tok_number = -5,
};
Above we have specified tokens for identifiers, keywords, variable names, and numbers for the Kaleidoscope programming language.
Global variables are bad practice although we use them for learning purposes. First, we can make the mistake of redeclaring a global variable for the second time, secondly, they take longer to find, unlike local variables.
// global variables
static std::string IdentifierStr; // identifier string saved here
static double numVal; // number saved here
When called it parses the characters in a string to recognize tokens and return the token which is represented with an integer.
static int getTok(){
static int lastChar = ' ';
// remove whitespace
while(isspace(lastChar))
lastChar = getchar();
lastChar is initialized to an empty whitespace.
getchar() reads a single character from standard input. Standard input could be a file or input being typed into the terminal.
The function getTok() calls getchar() which reads characters and consumes them while recognizing and storing them and the last character is read but not processed.
We also ignore white space from the source code, we do this using the while loop, The loop reads characters until it finds a character that is not whitespace.
The next step involves recognizing identifiers and keywords such as the def keyword.
We accomplish this task using the following code;
// recognize identfiers and keywords - gets identifiers
if(isalpha(lastChar)){ // [a-zA-Z][a-zA-Z0-9] - specifies valid identifiers
IdentifierStr = lastChar;
while(isalnum((lastChar = getchar()))) // while next letter is alphanumeric
IdentifierStr += lastChar;
if(IdentifierStr == "def") return tok_def; // def keyword, return the corresponding token
if(IdentifierStr == "extern") return tok_extern; // extern keyword, ' '
return tok_identifier;
}
Now we recognize numbers in the source code;
// recognizing numbers
if(isdigit(lastChar) || lastChar == '.'){ // if input is a digit or dot (.)
string numStr; // hold digits
do{
numStr += lastChar; // append to numStr
lastChar = getchar(); // get next character
}while(isdigit(lastChar) || lastChar == '-');
numStr = strtod(numStr.c_str(), 0); // do while numbers/dots are available
return tok_number; // return number token
}
Above we use the sttod C function to convert it to a numeric value which we store in numVal.
Now we handle comments in the code;
// process comments
if(lastChar == '#'){ // '#' sign starts comments
do
lastChar = getchar();
while(lastChar != EOF && lastChar != '\n' && lastChar != '\r'); // not end of file, new line or carriage return, read
if(lastChar != EOF) return getTok(); // recursively find other tokens
}
Comments start with a pound sign in this programming language; We read characters while we have not reached the end of file or newline character and carriage return characters.
Finally, if the last character is not the end of file, we look for other tokens in the source code.
We check for the end of the file;
// check the end of file
if(lastChar == EOF) return tok_eof;
// return character in ASCII code
int currChar = lastChar;
lastChar = getchar();
return currChar;
If everything checks, that is, the input doe not match any of the cases, it is an operator such as + or we have reached the end of file - EOF.
If so we return the ASCII code of the character.
Compilation and execution;
$ g++ lexer.cpp -o lexer.bin; ./lexer.bin
Above we compiler the code to its binary equivalent, we can view the binary code by executing the command;
$ hexdump lexer.bin
When executed;
- A number e.g 1, 2 ... returns -5 which is a token.
- A character corresponds to -4.
- A keyword such as def returns -2 as specified.
- The extern keyword returns -3.
- Other characters such as $, %, &, etc return their corresponding ASCII code.
Summary.
A lexical analyzer is the first program that interacts with source code during compilation.
A lexical analyzer scans through the whole source code and identifies tokens.
A lexeme is a sequence of alphanumeric characters in a token.
Following is the complete code in C++:
#include<iostream>
#include<string>
using std::string;
using std::cout;
using std::endl;
enum Token{
// end of file token
tok_eof = -1,
// def keyword
tok_def = -2,
// extern keyword
tok_extern = -3,
// function names and variable names
tok_identifier = -4,
// numbers
tok_number = -5,
};
// global variables
static std::string identifierStr; // identifier string saved here
static double numStr; // number saved here
// get tokens, remove white space
static int getTok(){
static int lastChar = ' '; // empty string
// remove whitespace
while(isspace(lastChar))
lastChar = getchar();
// recognize identfiers and keywords - gets identifiers
if(isalpha(lastChar)){ // [a-zA-Z][a-zA-Z0-9] - specifies valid identifiers
identifierStr = lastChar;
while(isalnum((lastChar = getchar()))) // while next letter is alphanumeric
identifierStr += lastChar;
if(identifierStr == "def") return tok_def; // def keyword, return the corresponding token
if(identifierStr == "extern") return tok_extern; // extern keyword, ' '
return tok_identifier; // return identifier token
}
// recognizing numbers
if(isdigit(lastChar) || lastChar == '.'){ // if input is a digit or dot (.)
string numStr; // hold digits
do{
numStr += lastChar; // append to numStr
lastChar = getchar(); // get next character
}while(isdigit(lastChar) || lastChar == '-');
numStr = strtod(numStr.c_str(), 0); // do while numbers/dots are available
return tok_number; // return number token
}
// process comments
if(lastChar == '#'){ // '#' sign starts comments
do
lastChar = getchar();
while(lastChar != EOF && lastChar != '\n' && lastChar != '\r'); // not end of file, new line or carriage return, read
if(lastChar != EOF) return getTok(); // recursively find other tokens
}
// check the end of file
if(lastChar == EOF) return tok_eof;
// return character in ASCII code
int currChar = lastChar;
lastChar = getchar(); // reset lastChar
return currChar;
}
int main(){
while(true)
cout << "Token: " << getTok() << endl;
return 0;
}
// compilation and execution
// g++ lexer.cpp -o lexer.bin; ./lexer.bin
References.
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.