Basics of YACC and Bison

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article we discuss YACC and Bison, both tools used to generate parsers for context free languages that are LALR(1).

Table of contents.

Introduction.
YACC.
Bison.
Summary.
References.

yacc-2

Introduction.

YACC and Bison are tools used to generate parsers. For a language to parsed by YACC or Bison, it has to be described by a context-free grammar and only those languages that are LALR(1) can be parsed by YACC or Bison.

In this article we discuss the two tools and show an example of generating a parser for a simple calculator.

YACC.

Inputs to computer programs will always have a specific structure, as a matter of fact we can say that a program defines the input language which it accepts.
An input language can be simple for example a sequence of numbers or as complex as a programming language.

YACC (Yet Another Compiler Compiler) is a tool used for describing input to a computer program.
First the user defines specifications:
These include;

Rules that describe the input elements.
The code that is to be invoked when a rule is recognized.
A definition or declaration of a low-level scanner that examines the input.

YACC then converts the specifications into subroutines that are responsible for examining the input stream.
This is referred to as parsing and it works by a call to a low-level scanner.

The scanner also referred to as a lexical analyzer take items from the input stream - tokens and compares them to the construct rules - grammar.
If a rule is identified, a subroutine is invoked - action, which is a fragment of C-language code that returns values and uses values returned by other actions.

Workings of YACC.

YACC is designed to be used with C code and the parser generated it also written in C programming language.

First we provide a file specifying the grammar with a .y extension to its name.
We then invoke yacc on this file and in turn YACC creates two files, y.tab.h file and y.tab.c both of which have hundreds of lines of C code which implement a LARL(1) parser for the specified grammar and the code for the specified actions.
This file provides an extern function yyparse.y which attempts to parse a valid sentence successfully.

We compile the C file as usual then link with the rest of our code and voilà! we have a parser.

By default this parser reads from standard input and writes to standard output similar to how a lex-generated scanner does.
Here are the steps above in list form:

yacc testFile.y - y.tab.c with C code is created for the parser.
gcc -c y.tab.c - normal C code compilation for the parser code.
gcc -o parse y.tab.o lex.yy.o -ll -ly - we link the parser, scanner and libraries.
./parse - we invoke the parser which reads from standard input.

We organize the input file as follows:

%{
Declarations
%}
Definitions
%%
Productions
%%
User subroutines

The first and last(declarations and user subroutines) sections are used for the C code that we want copied exactly to the generated C file, that is, declarations are copied at the top while subroutines at the bottom.

The definitions section is optional, we use it to configure parser features such as to define token codes, specify the precedence and associativity of operators or setting up global variables for the communication between the parser and scanner.

Let's look at an example to get a grip of the whole concept.
The below is a yacc input file for a simple calculator that not only recognizes binary postfix expressions and also evaluates them using a stack data structure.

%{  
    #include <stdio.h>
    #include <assert.h>
    
    static int Pop();
    static int Top();
    static void Push(int val);
%}

%token T_Int

%%
S    :    S E '\n' { printf("= %d\n", Top()); }
     |    
     ;
     
E    :  E E '+' { Push(Pop() + Pop()); }
     |  E E '-' { int op2 = Pop(); Push(Pop() - op2); }
     |  E E '*' { Push(Pop() * Pop()); }
     |  E E '/' { int op2 = Pop(); Push(Pop() / op2); }
     |  T_Int   { Push(yylval); }    
     ;
%%

static int stack[100], count = 0;

static int Pop(){
    assert(count > 0);
    return stack[--count];
}
static int Top(){
    assert(count > 0);
    return stack[count-1];
}
static void Push(int val){
    assert(count < sizeof(stack)/sizeof(*stack));
    stack[count++] = val;
}

int main(){
    return yyparse();
}

Note the following from the example above:

We have defined all token types returned from the scanner using %token in the definitions section so as to establish numeric codes to be used by the scanner to tell the parser about each scanned token. In addition to this, the global variable yyval stores additional information about lexeme.
We have used colons instead of arrows. The vertical bar is used to separate productions. The semi-colon terminates a rule.
C code specifying an action associated with a production is written within the braces.
The first rule identifies the start symbol of a grammar.
yyparse will be generated by yacc. It reads input from standard input then works its way back to the start symbol. It returns 0 is successful, 1 if not. yyerror is called when it encounters an error.

Now to create a scanner for the parser. The following is a lex file we use:

%{  
    #include "y.tab.h"
%}
%%
[0-9]+    { 
              yylval = atoi(yytext); 
              return T_Int;
          }
[-+*/\n]  { 
              return yytext[0];
          }
.         { // ignore the rest }

With the above specification, yylex returns ASCII of the calculator operators, identifies integers and ignores other characters.
After it assembles digits into an integer, it converts to a numeric value and stores it in yyval which is a global variable.
And finally T-Int type token is returned.

We tie this together by running yacc on the grammar to generate y.tab.c, then run lex on the scanner specification to generate lex.yy.c after which we compile the two .c files and link them to obtain our calculator.

The makefile:

calc:        lex.yy.o y.tab.o
             gcc -o calc lex.yy.o y.tab.o  -ly -ll
        
lex.yy.c:    calc.l y.tab.c
             flex calc.l
             
y.tab.c:     calc.y
             bison -vdty calc.y

Bison.

This is a general-purpose parser generator used to convert a grammar description for a LALR(1) context-free grammar into a parser which is written in C and parses the grammar.

Bison parsers are bottom-up parsers that is they start from the leaf nodes of a tree and work their way upward until a root node is arrived at.
It is an upgrade from yacc and one of the most commonly used LALR tools.

The workings of bison.

Bison is built to be used with C code and thus generates code in C.
First we give it a grammar specification file which we name using the .y extension then invoke bison on the .y file and it in turn creates y.tab.y and y.tab.c files with hundreds of C code implementing a LALR(1) parser for the specified grammar together with code for the specified actions.
This file also provides an extern function yyparse that parses a valid sentence.
We compile it as usual using C compiler.
The generated parser by default will read from standard input and write to standard output.

Here are the steps in list form:

bison textFile.y - a y.tab.c file containing C code for a parser is created.
gcc -c y.tab.c - compile parser code normally using gcc C compiler.
gcc -o parse y.tab.o lex.yy.o -ll -ly - we link the parser, scanner and libraries.
./parse - we execute the parser that reads from standard input.

The bison input file format is similar to the yacc file format therefore we shall not repeat it here.

Following the example calculator we defined in the previous section the makefile in this case would resemble the one below:

calc:        lex.yy.o
             y.tab.o gcc -o calc lex.yy.o y.tab.o  -ly -ll
         
lex.yy.c:    calc.l y.tab.c
             flex calc.l
             
y.tab.c:     calc.y
             bison -vdty calc.y

Summary.

Bison and YACC parser generators are to parsers what lex/flex is to scanners, we give it a grammar specification and it generates LALR(1) parser which identifies sentences in the grammar.

In this article we have discussed the basics of both and given a simple calculator example to demonstrate how they work.

References.

bison documentation
The lexical analyzer generator.
Compilers Principles, Techniques, & Tools 2nd Edition, Chapter 4.9.