In this article we discuss the process through which a compiler checks for correct syntactic and semantic conversions for a source language.
Table of contents.
- Design spaces for types.
- Environments for type checking.
- Type checking expressions.
- Type checking of function declarations.
- Type checking a program.
- Other type checking aspects.
Type checking is the process of verifying and enforcing constraints of types in values.
Lexical analysis and parsing phases in the compiler filter out many texts however many programing languages with well-formed requirements cannot be handled by the techniques used in these two phases because more often than not, they are not context free and thus cannot be able check by membership of a context-free grammar.
The type checking phase in compiler design is interleaved with the syntax analysis phase therefore it is done before execution or translation of a program(static typing) and thus the information is gathered for use by subsequent phases, for example, the translator will exploit type information for it to naturally combine calculation of types with the actual translation.
Syntax of a simple programming language.
Program → Funs Funs → Fun Funs → Fun Funs Fun → TypeId(TypeIds) = Exp TypeId → int id TypeId → bool id TypeIds → TypeId TypeIds → TypeId, TypeIds Exp → num Exp → id Exp → Exp + Exp Exp → Exp = Exp Exp → if Exp then Exp else Exp Exp → id(Exps) Exp → let id = Exp in Exp Exps → Exp Exps → Exp, Exps
It is a first-order functional programming language with recursive definitions.
A program is a list of function declarations.
Functions are mutually exclusive and no function can be declared more than once.
Each function declares its result type and the types and names of its arguments.
Functions and variables have separate name spaces.
Parameters should not be repeated.
The body of a function is an expression either an integer constant, variable names, sum expression, comparison, a function call or a locally declared expression.
Comparison is defined for booleans and integers.
Addition is defined for integers.
A program must contain main with an integer as its argument.
main returns an integer when called for execution.
Design space for types.
Statically typed languages are those programming languages which perform type checking at compile-time, these include C, C++, java.
Strongly typed languages are language implementations whereby whenever an operation is performed, arguments to the operation are of the specified type defined by the operation.
e.g You cannot concatenate a string type and a floating point number.
Weakly typed languages are implementations whereby there is no explicit specification of types of objects or variables.
These are mainly used for systems programming whereby you manipulate data(moving, encrypting, compressing) without regard to what the data represents.
Some programming languages such as C will combine both static and dynamic typing i.e, some types are checked before execution while others during execution.
The design space for static verses dynamic, and weak verse strong typing
Type-checking operates on the abstract syntax tree(AST) and can make several recursive passes on this tree each time gathering information or using information gathered from previous phases.
This information is what we call attributes.
Attributes can be Synthesized, (attributes passed up the AST) or Inherited, (attributes passed down the AST).
Attributes synthesized on one subtree can be inherited in another subtree for example a symbol table that is synthesized by a declaration and inherited by the declaration's scope.
A syntactic category represents a type in the data structure for the AST or a group of related non-terminal in a grammar.
Syntactic categories will have their own set of attributes.
Writing a type checker as a set of mutually recursive functions will result in one or more such functions for each syntactic category.
Environments for type checking.
We use the simple language in the previous section for static type checking.
A symbol table is needed to bind variables and functions to their types.
We use two symbol tables, one for variables and the other for functions.
A variable can be bound to either int or bool types.
A function is bound to its type(types of its arguments and result).
We write function types as a parenthesized list of argument types.
(int, bool) → int
This represents a function taking two parameters of types int and bool and a result type int which will be the return type for the function separated by an arrow.
We assume a stack-like behavior, that is, we don't preserve symbol tables for inner scopes once they are exited but preserve them for outer scopes so as no action is required to restore them.
Type checking expressions.
Symbol tables for variables and functions become inherited attributes when expressions are type checked.
int and bool expressions will re returned ad synthesized attributes.
The type checker function will use a concrete syntax for pattern matching purposes so that the presentation is independent of any specific data structure for the abstract syntax.
We assume that for terminals(variable names and numeric constants) with attributes there are predefined functions for extracting them.
Therefore id will have a function getname that extracts the name of the identifier and num has a function getvalue that extracts the value of the number.
For non-terminals we define functions which take an AST subtree and inherited attributes as arguments and return the synthesized attributes.
Type checking function for expressions
is the function responsible for type checking.
vtable is the symbol table for variables.
ftable is the symbol table for functions.
error function is responsible for reporting errors, if we let this function return, then the type-checker will continue reporting errors.
Cases handled by
- A number has type int.
- A variable type is found be a lookup in the variables symbol table. If not found, the lookup function returns unbound special value then an error is reported and the type checker guesses that the type is an int otherwise a type is returned.
- Plus(+) expression requires that both arguments have the same type(integer) and that the result is also an integer.
- Comparison requires both arguments to have the same type and in either case the result is of boolean type.
- Conditional expression must be of boolean type and both branches should have same types. The result of the condition is one of its branches. If branches have different types, an error is reported and a type is arbitrarily chosen for the whole expression which will be the type for then branch.
- When a function is called, the function name is looked up in the function environment to find the number of arguments, types of arguments and the return type. Number of arguments must correspond with the expected number and types must match declared types. The resulting type is the function's return type. If function name is absent in the symbol table for functions ftable an error is reported and the type checker guesses an int type for the result.
- Let-expression declares a new variable with the type of which is that of the expression defining the value of the variable. Bind function is used to extend the symbol table which is then used for checking the body of the expression and finding its type which is the type for the whole expression. Type errors cannot be caused by let-expression therefore no testing is done.
mentions non-terminal Exps and its related type-checking function .
builds a list of types in the expression list.
A list is written in [ ] with commas between elements.
The :: operator is used to add elements to the front of the list.
Type checking function declarations.
A function declaration explicitly declares the types of its arguments and this information is used to build a symbol table for variables used when type checking the body of a function.
The declared result type must match the function body type.
type checking funcion declarations
has an inherited attribute in the symbol table for functions which is passed down to the type check function for expressions.
uses TypeId and TypeIds functions to check for internal errors and returns no information.
returns a pair(name, type) of the declared type.
will build a symbol table from the (name, type) pair and also check if parameters have different names.
Type checking a program.
A program is said to be correct if all functions are type correct and there are no two definitions defining the same function name.
In addition, the function main should have one integer argument and one integer result.
Functions are type checked by use of the symbol table where they will be bound to their types.
Two passes are required for this, the first to construct the symbol table and the second to check function definitions from the constructed table, therefore there will be two functions operating over Fun and two functions operating over Funs.
One of the functions for Funs can be seen from image in the previous section.
The other returns the pair(name, type) of the declared function which consists of type arguments and result type and are obtained by the auxilliary function.
Functions for the syntactic category Funs are which builds the symbol table and checks for duplicate definitions while the functions calls the function for all functions.
The is function is the main function.
Other type checking aspects.
The language we defined at the begining of this article does not cover all aspects of type checking therefore in this section we consider other features and how they are handled.
When a variable is assigned a value, the type checker ensures that this value is the same type with the declared type of the variable.
A data structure might define a value with several components e.g struct or a value with different type at different times.
A type checker will need the data structure to describe complex types so as to be able to represent them.
This data structure is similar to the data structure used in an AST of declarations.
Overloading is whereby a similar name is used for several operations with several different types.
We can see this in the sample language defined in previous sections whereby the = operator is used for comparisons for both integers and boolean values, similarly + and - operators are used for both boolean an integer operations in most programming languages.
When these operators are predefined they only cover a finite number of cases and therefore all cases are tried however this requires disjoint argument types for different instances.
e.g if there is a function read that is defined to read either integers or booleans values from a text stream, the type checker must pass the expected type of each expression as an inherited attribute so as to pick the correct instance of the overloaded operator.
A language might have operators which are used for converting a type to another type, e.g converting an integer into a floating type.
To handle type checking in such cases, if the type checker discovers that arguments don't have the correct type, it will try to convert one or both of the arguments.
In some languages polymorphism(definition of a function over a large class of similar types) is allowed.
For type checking a function will explicitly declare generic parts and the type checker will insert the actual types at every use of the generic/polymorphic function so as to create instances of this type.
Some languages e.g Haskell require that programs are well typed but don't require explicit variable or function type declarations, therefore a type inference algorithm is used to gather information about the uses of functions and variables which is used for inference of these types.
A type error is reported when there exists inconsistent inferences.
By now we know differences between static, dynamic, strongly and weakly typed languages each with their properties and advantages.
Lexing and parsing also perform type checking however some languages have weel-formed requirements which cannot be handled by the techniques used in these two phases.
- Compilers, Principles,Techniques and Tools Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman.
- Basics of Compiler Design Torben Ægidius Mogensen Chapter 6.