Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
In programming, we use loops to repeat a sequence of instructions until a specified condition is met. In this article, we further extend Kaleidoscope to support for loops.
Table of contents.
- Introduction.
- The for expression.
- Lexer extensions.
- AST extensions.
- Parser extensions.
- LLVM IR extensions.
- Code generation extensions.
- Summary.
- References.
Prerequisites.
LLVM Control Flow: If-then-else.
Introduction.
In the prerequisite article, we learned how to extend support for if-then-else statements in the Kaleidoscope programming language. In this article, we further add looping features.
A loop is used when we want to perform an operation a desired number of times until a condition is met. We can have for loops, while loops, do while loops, infinite loops, nested loops for the above loops.
We also have loop control statements such as break and continue, the former terminates a loop immediately while the latter jumps to the next iteration skipping any code between.
The for expression.
For example let's print 100 stars * in Kaleidoscope using for loops:
extern putchard(char);
def printstar(n)
for i = 1, i < n, 1.0 in
putchard(42); # ascii 42 = '*'
# print 100 '*' characters
printstar(100);
In this case, the value 42 is the ASCII value representing the asterisks. We define a variable i that starts the iteration. While the condition - i < n is true, we will increment it by a step value. In this case, our step value is 1.0. In other cases, if the step is not specified, the default is 1.0. The body expression is executed while the loop evaluates to true.
Lexer extensions.
The lexer takes source code and produces a series of tokens. In this section, we will add support for for loops by creating new tokens and using if else statements to recognize tokens;
... in enum Token ...
// control
tok_if = -6, tok_then = -7, tok_else = -8,
tok_for = -9, tok_in = -10
... in gettok ...
if (IdentifierStr == "def")
return tok_def;
if (IdentifierStr == "extern")
return tok_extern;
if (IdentifierStr == "if")
return tok_if;
if (IdentifierStr == "then")
return tok_then;
if (IdentifierStr == "else")
return tok_else;
if (IdentifierStr == "for")
return tok_for;
if (IdentifierStr == "in")
return tok_in;
return tok_identifier;
The output of the lexical analysis is a stream of tokens which are then sent to the parser which makes sure we are using the correct syntax for the language, in this case, Kaleidoscope.
AST extensions.
Abstract trees are a form of intermediate code, they are used to represent the structure of source code in a hierarchical form. In our case, it boils down to capturing variable names and constituent expressions in the node;
/// ForExprAST - Expression class for for/in.
class ForExprAST : public ExprAST {
std::string VarName;
std::unique_ptr<ExprAST> Start, End, Step, Body;
public:
ForExprAST(const std::string &VarName, std::unique_ptr<ExprAST> Start,
std::unique_ptr<ExprAST> End, std::unique_ptr<ExprAST> Step,
std::unique_ptr<ExprAST> Body)
: VarName(VarName), Start(std::move(Start)), End(std::move(End)),
Step(std::move(Step)), Body(std::move(Body)) {}
Value *codegen() override;
};
Parser extensions.
The parser's input is a string of tokens from the lexical analysis. The parser is also responsible for reporting any syntax errors in the source code in regards to the for statement in this case. In well-formed programs, the parser constructs a parse tree and returns it as its output which is passed on to the following stages of compilation.
In our case, the only question is, how do we handle an optional step value, for this, we check to see if the second comma is present, if not, it sets the step value to null in the AST node;
/// forexpr ::= 'for' identifier '=' expr ',' expr (',' expr)? 'in' expression
static std::unique_ptr<ExprAST> ParseForExpr() {
getNextToken(); // eat the for.
if (CurTok != tok_identifier)
return LogError("expected identifier after for");
std::string IdName = IdentifierStr;
getNextToken(); // eat identifier.
if (CurTok != '=')
return LogError("expected '=' after for");
getNextToken(); // eat '='.
auto Start = ParseExpression();
if (!Start)
return nullptr;
if (CurTok != ',')
return LogError("expected ',' after for start value");
getNextToken();
auto End = ParseExpression();
if (!End)
return nullptr;
// The step value is optional.
std::unique_ptr<ExprAST> Step;
if (CurTok == ',') {
getNextToken();
Step = ParseExpression();
if (!Step)
return nullptr;
}
if (CurTok != tok_in)
return LogError("expected 'in' after for");
getNextToken(); // eat 'in'.
auto Body = ParseExpression();
if (!Body)
return nullptr;
return std::make_unique<ForExprAST>(IdName, std::move(Start),
std::move(End), std::move(Step),
std::move(Body));
}
Finally we hook it up as a primary expression as follows;
static std::unique_ptr<ExprAST> ParsePrimary() {
switch (CurTok) {
default:
return LogError("unknown token when expecting an expression");
case tok_identifier:
return ParseIdentifierExpr();
case tok_number:
return ParseNumberExpr();
case '(':
return ParseParenExpr();
case tok_if:
return ParseIfExpr();
case tok_for:
return ParseForExpr();
}
}
LLVM IR extensions.
The final step in developing a compiler front-end is the generation of an IR(Intermediate Representation) that can later be converted into machine-dependent instructions using the back-end. Referencing the loop in the second section - where we print 100 asterisks characters, we have the following corresponding LLVM IR;
declare double @putchard(double)
define double @printstar(double %n) {
entry:
; initial value = 1.0 (inlined into phi)
br label %loop
loop: ; preds = %loop, %entry
%i = phi double [ 1.000000e+00, %entry ], [ %nextvar, %loop ]
; body
%calltmp = call double @putchard(double 4.200000e+01)
; increment
%nextvar = fadd double %i, 1.000000e+00
; termination test
%cmptmp = fcmp ult double %i, %n
%booltmp = uitofp i1 %cmptmp to double
%loopcond = fcmp one double %booltmp, 0.000000e+00
br i1 %loopcond, label %loop, label %afterloop
afterloop: ; preds = %loop
; loop always returns 0.0
ret double 0.000000e+00
}
The above is not optimized for clarity. The loop has all the same constructs as seen before, a phi node, expressions, and a basic block.
Code generation extensions.
Finally, to generate the IR itself. First we output the start expression for the loop value;
Value *ForExprAST::codegen() {
// Emit the start code first, without 'variable' in scope.
Value *StartVal = Start->codegen();
if (!StartVal)
return nullptr;
We then set up the LLVM basic block for the start of the body of the loop. Here, the loop body is a single block, the body could also have multiple blocks.
We create a new block and insert as follows;
// Make the new basic block for the loop header, inserting after current
// block.
Function *TheFunction = Builder.GetInsertBlock()->getParent();
BasicBlock *PreheaderBB = Builder.GetInsertBlock();
BasicBlock *LoopBB =
BasicBlock::Create(TheContext, "loop", TheFunction);
// Insert an explicit fall through from the current block to the LoopBB.
Builder.CreateBr(LoopBB);
We create an actual block that starts the loop and creates an unconditional branch for fall-through between two blocks.
// Start insertion in LoopBB.
Builder.SetInsertPoint(LoopBB);
// Start the PHI node with an entry for Start.
PHINode *Variable = Builder.CreatePHI(Type::getDoubleTy(TheContext),
2, VarName.c_str());
Variable->addIncoming(StartVal, PreheaderBB);
Since the for loop preheader is already set, we switch to emitting code for the body of the for loop. First, we move the insertion point and create the PHI node for the loop induction variable.
We know the incoming value for the starting value so we add it to the phi node. Phi eventually gets a second value for the back edge;
// Within the loop, the variable is defined as equal to the PHI node. If it
// shadows an existing variable, we have to restore it, so save it now.
Value *OldVal = NamedValues[VarName];
NamedValues[VarName] = Variable;
// Emit the body of the loop. This, like any other expr, can change the
// current BB. Note that we ignore the value computed by the body, but don't
// allow an error.
if (!Body->codegen())
return nullptr;
The for loop now introduces a new variable to the symbol table meaning that the symbol table can have either function arguments or loop variables.
For this, it requires that before we codegen the body of the loop, we first add the loop variable as the current value for its name.
To handle this correctly, we remember the value that we are potentially shadowing in OldVal above.
Once the loop variable is set in the symbol table, the code recursively codegens the body.
This allows the body to use the loop variable and any references to it find it in the symbol table.
// Emit the step value.
Value *StepVal = nullptr;
if (Step) {
StepVal = Step->codegen();
if (!StepVal)
return nullptr;
} else {
// If not specified, use 1.0.
StepVal = ConstantFP::get(TheContext, APFloat(1.0));
}
Value *NextVar = Builder.CreateFAdd(Variable, StepVal, "nextvar");
The body is now emitted, we compute the next value of the iteration by adding the step value or 1.0. NextVar will be the value of the loop variable on the next loop iteration. The code is shown below;
// Compute the end condition.
Value *EndCond = End->codegen();
if (!EndCond)
return nullptr;
// Convert condition to a bool by comparing non-equal to 0.0.
EndCond = Builder.CreateFCmpONE(
EndCond, ConstantFP::get(TheContext, APFloat(0.0)), "loopcond");
Now we evaluate the exit value of the loop, here we are determining whether we should exit or proceed;
// Create the "after loop" block and insert it.
BasicBlock *LoopEndBB = Builder.GetInsertBlock();
BasicBlock *AfterBB =
BasicBlock::Create(TheContext, "afterloop", TheFunction);
// Insert the conditional branch into the end of LoopEndBB.
Builder.CreateCondBr(EndCond, LoopBB, AfterBB);
// Any new code will be inserted in AfterBB.
Builder.SetInsertPoint(AfterBB);
Finally, for the cleanups. Since we have NextVar's value, we can add the incoming value to the loop PHI node. Then, we remove the final loop variable from the symbol table so that it is not in scope after the for loop.
Finally, code generation of the for loop will always return 0.0.
Summary.
We use loops to repeat a sequence of instructions until a specified condition is met, we learned how to add for loops in Kaleidoscope programming language.
In the next article, we add support for user-defined operators.