Get this book -> Problems on Array: For Interviews and Competitive Programming
In this article, we discuss the different phases of a Complier such as Lexical Analysis, Syntax Analysis, Intermediate Code Generation and others.
Table of contents:
- Phases of Compiler
- Analysis (Machine independent/Language dependent)
- Lexical Analysis
- Syntax Analysis
- Semantic Analysis
- Intermediate code generation
- Synthesis (Machine Dependent/Language Independent)
- Code optimizer
- Code Generator
- Symbol table
- Error handling routine
Phases of Compiler
A compiler is a program that translates source code written in a high level programming language eg. java, C++, into machine code for some computer architecture.
The compiler machine code can then be executed against different sets of data.
An interpreter reads an executable source program written in a high level programming language and runs it against the data/input provided for the program to produce some output.
Both compilers and interpreters are written in high level programming languages such as java.
Note that, the interpreters source program does not produce machine code therefore it is machine independent.
A token is a sequence of characters representing a lexical unit matching with patterns such as operators, identifiers and keywords.
A lexeme is a sequence of characters in the source program that matches the patterns for a token. Simply, it is an instance of a token.
A pattern describes the rule lexemes take, for keyword as a token, the pattern is the sequence of characters forming the keyword, for identifiers it is matched by strings.
Phases of compiler
There are two main categories of compilation;
(A). Analysis (Machine independent/Language dependent)
This is the front end of a compiler where an intermediate representation of the source code is created.
Data from the source program is collected and saved to a data structure called a symbol table.
Analysis can be;
Linear - Scanning where a stream of characters is read from left to right and grouped into tokens with a meaning
Hierarchical - Tokens are categorized hierarchically into nested groups.
Semantic - Evaluates the meaning of the components of a source program.
Analysis comprises of lexical analysis, syntax analysis, semantic analysis and intermediate code generation which we discuss below.
1. Lexical Analysis
This is the first phase in th compiler. Source code is scanned from left to right, character by characters and grouped into tokens.
Character stream is grouped into meaningful sequences by the identified tokens(lexemes).
The lexical analyzer is also referred to as a scanner.
- Identification of illegal tokens.
- Identification of lexical units e.g, in source code.
- Classification of lexical units e.g, constants, keywords into different tables. (it ignores comments)
2. Syntax Analysis
Converts the stream of tokens into a parse tree.
A syntax analyzer can also be referred to as the parser.
All tokens are checked against the grammar of the source code to ensure correctness.
- Report syntax errors.
- Construction of a parse tree.
- Obtaining tokens from the lexical analyzer
- Checking for syntax errors.
3. Semantic Analysis
A semantic analyzer determines the validity of the parse tree.
An annotated syntax tree is the output from this phase.
- Type checking.
- Checking if source language permits provided operands or not.
- Collection of type information.
- Saving gathered information to symbol table or syntax tree.
- Report semantic errors.
- Checking for semantic errors.
4. Intermediate code generation
Now the parse tree is semantically verified, an intermediate code generator generates three address code(assembly-like instructions with three operands per instruction). each operand acts like a register.
The code is intermediate, that is, it is neither high-level or machine code.
This intermediate code will later be translated to machine code.
This phase acts as a bridge from analysis to synthesis.
- Maintaining precedence ordering of the source language.
- Translation of intermediate code into target language.
- Holding values computed during translation process.
- Holding operands of an instruction.
(B). Synthesis (Machine Dependent/Language Independent)
This is the back end of a compiler where an equivalent target program is created from the intermediate source code.
Synthesis constitutes of the code optimizer and code generator which we discuss below.
5. Code optimizer
It reduces the size of the program by reducing the number unnecessary of lines of code in the three address code so that the program takes the least amount of memory and exhibits a fast execution time, this improves performance.
Note that this alteration will not lose the meaning of the code.
- Removal of unused variables and unreachable code.
- Improving runtime and performance of program.
- Establishing trade offs between execution and compilation speed.
- Generates streamlined code from its intermediate representation.
- Removal of unaltered statements from a loop.
6. Code Generator
In this phase assembly code is generated from the optimized code.
For each variable used by the program a memory location is allocated.
- Converting intermediate code to target code.
- Selection and allocation of memory locations and registers.
This is the compilers data structure that stores identifiers with their name and types therefore enabling easier search and retrieval.
It interacts with the error handler and all phases of the compiler for updates.
It is responsible for scope management.
- Literal constants and strings.
- Compiler generated temporaries.
- Function names.
- Variable names and constants.
- Labels in source languages.
Functions in various phases of the compiler
- lexical analysis - Creation of a new table.
- Syntax analysis - Add information regarding types scope. etc.
- Semantic analysis - Use already stored information to check semantics and update accordingly.
- Intermediate code generation - Reference for run time allocation and storage of temporary variable information.
- Code optimization - Uses symbol table for machine-dependent optimization.
- Code generation - Uses the address information of identifiers stored in symbol table to generate code.
Error handling routine
This routine is responsible for detecting an error, reporting it and implement a recovery strategy for handling the error.
Common errors that happen are;
- Invalid character sequences during scanning.
- Invalid token sequences.
- Scope error.
- Parsing in semantic analysis.
Some of the error that can happen during the phases are;
- Lexical analyzer - Wrong spelling of tokens.
- Syntax analyzer - Missing parenthesis.
- Intermediate code generator - Mismatch of operands for an operator.
- Code optimizer - Unreachable statements.
- Code generator - Improper allocation of registers or full memory.
- Symbol table - Multiple declared identifiers.
An example of the compilation process.
Compiler and its phases slides
Phases of compiler
Syntax analysis in depth