Pixelate:Issue 15/Compiler

From Allegro Wiki

Jump to: navigation, search

Contents

Section 1: Introduction

Let's begin knowing the answers for some common questions, a couple of reasons to read this paper, and the software we will be using.

What a compiler is

I am sure you are absolutely bored about this question. How many times I have heard it cannot be approximated. Will not hurt one more.

Source code

When you are going to write a program, you first decide the language (C, C++, Basic, etc), and write the code for it. After you finished, you want it to become an executable.

Scanner

The first step of the compiler is the scanner. The scanner reads the source and converts strings into tokens. The program, then, becomes a list of tokens. The scanner checks if the source code has no syntax errors.

In C, an assign operation could be:

my_number = 100 + 120;

The scanner converts it into the tokens:

  • TK_VARIABLE
  • TK_ASSIGN
  • TK_NUMBER
  • TK_ADD
  • TK_NUMBER
  • TK_SEMICOLON

Sometimes, some symbols have different meaning depending in their place in a sentence. In example,'<' means 'lower than', while '<=' means 'lower or equal than'.

The scanner reads '<' and could generate the 'TK_LOWER_THAN' token. Then reads '=' and generates the 'TK_ASSIGN' token. However, there is a section inside the scanner, known as thelexical analyzer, which is in charge of selecting what to do when finding certain circumstances.

In this example, after reading '<', the analyzer could say to the scanner "I recognize two tokens begining with '<'. Show me the next character, and I will tell you which of those you should output". If the scanner tells the analyzer '=' is next, the analyzer could tell the scanner "Ah! You must generate the 'TK_LOWER_EQUAL_THAN' token!".

Parser

The second step is parsing the tokens. This is done by reading a token, selecting the right rule for that token, reading the next token, and verifying if the new token fits the selected rule.

In example, the assign operation rule in C could be:

assign_sentence:
 TK_VARIABLE TK_ASSIGN expression TK_SEMICOLON

And the 'expression' could be defined to support an addition:

expression:
 TK_NUMBER
 TK_VARIABLE
 expression	TK_ADD	expression

Here, the parser gets a TK_VARIABLE. There are many rules matching one with a variable in the begining. After reading the second (TK_ASSIGN), few rules match it, being 'assign_sentence' one.

Compiler

The compiler gets the sentences left by the parser and generates code in a temporary file, usually known as object file.

Linker

Sometimes, there is extra code that is not included by the original source code, because it has already been written by others. In this case, the linker searches for those pieces of codes in special files known as libraries, and inserts them into the object file, generating the executable file.

Some examples of applications to be used in each step could be:

  1. Vim, emacs, MSVC IDE, edit, pico, cat, etc.
  2. Lex, flex, etc.
  3. Bison, yacc, etc.
  4. Gcc, cl, g++, fp, tc, etc.
  5. Ld, link, lib, etc.

Why write a compiler

Writing a compiler implies creating or implementing a language, writing its scanner and parser (usually you can "melt" both of them, in example, having the scanner generate operation codes directly), and the supporting system to allow the code execution. Too much. Why to bother?

I have asked such question several times. I believe the best reason is expansibility. You are able to continue adding, changing and removing code without compiling a new executable, just changing scripts. Also, once you let the players know about the language, they will be able to add their own scripts to the program, thus having not only the programmer's but also the player's experience reflected on the scripts.

But, is the effort worth? It depends on what is your script engine target. If every script you write will give life to a character, and your game is a role playing game, I think it is justified. If it gives life a character and your game is a simulator, it might be justified. If it gives life to a chracter and your game is a tetris-like game, it is not. In the later, if your game has bonuses, and you want to use your scripting engine to implement them, it is fine. But if you want to implement a bonus system and your game is a role playing game, it might not. Of course, these are just personal measures. Coding a whole system is hard if you have never done so before, or if you don't have patience enough to debug it when an error arises.

Using tools like Flex/Lex and Bison/Yacc makes the process much easier as well. Implementing the scanner using a state machine and the parser using another takes some time, more if you have never heard about automatas.

So, answering the section title: In some circumstances, the scripting engine is a good project but which would only add too much complexity and will make the developer spend a good amount of time for a reward which is minimal. In some others, writing one will solve many problems. With time, you will easily know when it is worth and when it is not.

Needed software

For the first lessons we will be needing Bison, the Yacc interpreter of the Free Software Foundation. Note that a couple of instructions from Bison don't work for Yacc, so if you try to run the examples with the later, you might get some troubles.

For the final lessons, I will introduce Flex as a lexical analyzer. I will have to explain something about regular expressions as well, but that won't be a problem.

Bison and Flex generate C code, therefore you will need a C compiler, usually GCC, being it Linux standard gcc, Windows MinGW or DOS djgpp.

For building examples, I added both makefiles and batch files. If you don't have the make executable in your system you can edit use the .bat files.

To try the examples, I used the following software. If you have troubles with yours, please send me the different versions and I will try to fix them as soon as possible.

  • DOS djgpp
  • GCC version 2.952
  • MAKE version 3.78.1
  • BISON version 1.25
  • FLEX version 2.5.4

After installing, set your environment variables (should be explained in each distribution documentation) and run the programs with the '--version' flag. If they all execute, you are done.

See the last section for links.

Section 2: Overview

Writing a parser involves learning about grammars. Theory is always boring, but sometimes is necessary. Unfortunately there is just too much theory, so I will just cover the minimun necessary for understanding this.

Grammars

When we write something or speak to someone, we use a language. A language can be analyzed from two different but necessary points of view: syntax and semantic. Since I hate long speeches, I will put a couple of examples to show you the difference.

1. Syntax

We are tired of getting syntax errors while writing programs. They are easy to recognize: a syntax error is modifying a word so that it loses all meaning. In example, if you want to write "merciful" but wrote "mrciful", you just did a syntax error, because it is not a valid word. You cannot find "mrciful" in a dictionary.

In other words, a syntax error happens whenever you find a word which is not in your language. Now, you wanted to write "cat", but instead wrote "car". Is that a syntax error? Not at all. "cat" is a valid word in english. "car" as well. If we were writing a sentence, like "The cat has tiny legs", we end with "The car has tiny legs". Funny, but there are no syntax errors there. We recognize those five words in our vocabulary..

2. Semantic

As we said, "The car has tiny legs". I would like to see one of those cars. Anyway, we knew such sentence has no syntax errors. Does it have semantic errors? To know, we need to check our semantic rules. From our language classes back in primary school, we can divide the sentence in particles, something like:

     The -> article
     car -> noun
     has -> verb
     tiny -> adjective/modifier
     legs -> noun

This is a valid sentence because we were taught when in primary school that these particles in this order (article, noun, verb, modifier, noun) is valid. It is te format of an affirmative sentence. What if we get the following: "Has the car tiny legs"?

     Has -> verb
     the -> article
     car -> noun
     tiny -> adjective/modifier
     legs -> noun

Is this valid? Yes, it is still valid. The adjective is modifying a noun, and the article is modifying another noun. A question sentence is formatted this way. Finally, what if we have the following: "Legs car tiny the has"?

     Legs -> noun
     car -> noun
     tiny -> adjective/modifier
     the -> article
     has -> verb

Ack! This is awful! The adjective is modifying an article, and the article is modifying a verb. This is not right. We cannot fit this format in any of the formats we were taught.

Therefore, when you read "Legs car tiny the has", you don't understand. The syntax is right, you recognize those words, but the semantic is not, you cannot recognize the sentence. This is a semantic error.

The scanner reads from a stream and transforms it into list of tokens. This is the syntax check. If some symbols could be interpreted in two different ways, the lexical analyzer (usually inside the scanner, sometimes outside it as a new module) tells the scanner which token is. The parser, based on rules written by the programmer, decides if it is valid or not. This is the semantic check.

Coding a scanner is easy. 'If you get this string, generate this token'. It is straightforward. However, the parser is not that easy. The parser must take the token and find all the rules begining with that token. Then read a new token and find between the selected rules the ones that continue working with the new token. It must continue reading tokens and checking the rule until the rule either accepts the input or fails. How to write those rules?

Maybe the most recognized way of writing rules is the Backus-Naur Form, or BNF. According to its principles, each and all rules can be written as a sucession of terminal and non-terminal symbols.

Non-terminal symbol 
A non-terminal symbol is a symbol that can be divided into new terminal and non-terminal symbols.
Terminal symbol 
A terminal symbol is a symbol that cannot be divided into new symbols. To let the parser know when a symbol cannot be divided, we must declare them all before using them.

A rule has the following format:

RULE_NAME : RULE_BODY_1 | RULE_BODY_2 | ... | RULE_BODY_N

Since rules don't mind blank characters, the same rule could be written as follows:

RULE_NAME	: RULE_BODY_1
| RULE_BODY_2
| ...
| RULE_BODY_N

Or, as I use from now on:

RULE_NAME	:
	RULE_BODY_1
|	RULE_BODY_2
|	...
|	RULE_BODY_N

RULE_NAME is the name of the rule, and is a non-terminal symbol. Why a rule is a non-terminal symbol? Because as we said before, a non-terminal symbol can be divided into new non-terminal and terminal symbols. In this case, it can be divided into 'RULE_BODY_1', 'RULE_BODY_2', and so until 'RULE_BODY_N'.

Can a rule contain itself?

Yes, it can. Recursivity is one of the keys of a parser. If we want to recognize a number, we need recursivity. Suppose we define the terminal symbol NUMBER. A rule to recognize any number could be as follows:

number_rule	:
	NUMBER
|	NUMBER number_rule

Let's trace what this rule would do with the input '1'. The parser takes the 'number_rule' rule, and matches '1' with the first two sentences. Either of them is fine, so how does the parser know which one it must follow? For now, the parser "peeks" the next token. Since we had only '1' as input, it finds nothing. 'number_rule' was not defined to handle nothing, so the second sentence cannot be used. The first sentence expects a NUMBER and then nothing, so is is used. I will explain how the parser "peeks" later.

Now, what happens when the parser reads '5 10'? It reads 5 and peeks 10. The first sentence expects to find only a number, so it won't handle this input. The second sentence expects to find a number and a 'number_rule'. The later recursively could expand into a new number or again a number and a 'number_rule' sentence. Since we have 10, a number, the first sentence fits in this second passage. So, '5 10' can be parsed using in the first pass the second sentence of 'number_rule', and using in the second pass the first sentence of 'number_rule'.

I hope you understand recursivity :)

If you do, you would ask... can I put the recursivity this way:

number_rule	:
 	NUMBER
|	NUMBER number_rule

The answer is yes, you can write this kind of recursivity. In fact, you are enforced to use this one, known as 'left recursivity', as opposite as the 'right recursivity'. Why enforced?

Imagine 'number_rule' is a function. The left recursivity would look as this...

void number_rule(int level) {
   if (level < MAX_LEVEL)
       number_rule(level + 1);
   stack[level] = get_number(level);
}

...while right recursivity would look like...

void number_rule(int level) {
   stack[level] = get_number(level);
   if (level < MAX_LEVEL)
       number_rule(level + 1);
}

In the first case, we recursively call 'number_rule' until all elements are unwinded from the rule, and then we process each of them. In other words, we process them when we are "coming back" from the recursion. In the second case we process while "advancing" in the recursion.

What is the problem? When going deeper in the recursion, we are filling the stack with information we are not needing right now. When the recursion breaks, the parser won't be able to reduce elements from the stack, because they are all already in the stack. When using left recursivity, we add to the stack when coming back, thus the parser is able to reduce whenever possible.

Look ahead token

I said the parser "peeks" the next token. And I already gave an example why it would do that. Now, how does it peek? Basically, it asks for the next token. This token is not processed, but instead is a temporary token. This one is known as the look-ahead token.

Sometimes, the parser needs to check one token ahead to identify the current one. Sometimes two, and so on. When you write a parser, you set the number of ahead tokens you need like '(k)', where 'k' is such number. In example, Bison can generate a parser for any LALR(1) language. There are some tricks to be able to write a LALR(n) parser with Bison, but I will not speak about them.

Regular expressions

Suppose we get a list of words, and we want to recognize which ones are numbers, and which ones are letters. First, we need to tell the scanner when it should output a number token and when a letter token.

NUMBER :- [0-9]
LETTER :- [a-zA-Z]

We wrote two different sentences, one to recognize numbers and one to recognize letters. On the left side, the name of the token. On the right side, the regular expressions. The first one, '[0-9]', tells the scanner a token NUMBER is any number from 0 to 9. Just one. If we had defined a NUMBER had to have two numbers, we could have said '[0-9][0-9]'. We also define that a letter is a letter from a to z, both lower and uppercase. The '[' ']' mean a range. A range of numbers from 0 to 9, or a range of letters from a to z and from A to Z.

Now, we want to define how to recognize a C valid variable. This means it must begin with either a letter or '_', followed by any number of letters, numbers and '_'. A new regular expression will define it:

VARIABLE :- [a-zA-Z_][a-zA-Z_0-9]*

Here '*' means repetition between 0 and infinite. '+' means repetition between 1 and infinite. Both modify the range given just before them.

There are a lot of them. I will cover them later.

Section 3: Bison

The previous section should have given you some ideas. Now we will implement them using Bison.

Bison file layout

A file in bison is divided in four sections, two related to C code and other two related to bison code. Once compiled, the bison file generates a C file which contains the parser and which must be compiled and linked to your executable.

The general layout of a bison file is as follows:


 %{                                
     C declarations (optional)     
 %}                                
                                   
     Bison declarations (optional) 
                                   
 %%                                
                                   
     Bison definitions (required)  
                                   
 %%                                
                                   
     C definitions (optional)      

The %{ indicates where the C declarations section begins. The %} indicates where it ends, and at the same time, where the bison declarations section begins. The first %% (this is the only separator that MUST appear in a bison file) is the bison declarations section end mark, and at the same time, the bison definitions section begin mark. Finally, the second %% tells where the bison definitions section ends and where the C definitions section begins. The end of file tells where the C definitions section ends.

Comments can be both C and C++ comments. However, it is better to use C comments, as the generated code will be C and not C++.

1. C declarations section

In this section you setup variables for later use, include C headers like stdio.h and string.h, define the structs you might use, and setup definitions and constants. This section is copied to the C generated file as is.

2. Bison eclarations section

Here you can configure the behaviour of Bison. Some things you can define are terminal and non-terminal tokens, precedence, associativity, etc.

3. Bison definitions section

This is the most important part of the Bison file, in fact is the only section which is required to appear. Here you define the rules for your grammar. The rules will allow the parser to (duh!) parse the source file of your language.

4. C definitions section

Finally, support functions written in C are placed in this section. Between others: error reporting and the lexical analyzer, explained in the next section. This section is also copied to the C generated file as is.

Once you have finished writing your bison file, you execute the bison executable and get a C file which contains the parser code. This can be compiled by GCC to generate a program which will parse your grammar. Bison scripts follow the standard convent of suffixing them with '.y'. In other words, your parser could be named 'parser.y' (keep in mind that bison is a new version of Yacc). When compiled, bison generates a file with the suffix '.tab.c'. In other words, 'parser.y' becomes 'parser.tab.c'. If specified by command line, bison generates a header in 'parser.tab.h'. For DOS and Windows, bison generates a file suffixed with '_tab.c', in example, 'parser_tab.c' and 'parser_tab.h'.

This C generated file contains several #line directives. If you are compiling such file, the error messages will refer to the position of the error in the bison file instead of the C file. This comes handy when you have written a parser with several rules and you need to find out which one is not compiling.

Bison functions

Once bison compiles a parser file, it generates a C file with one important function: 'yyparse'. 'yyparse' must be called from your program in order to parse an input. It gets no parameters and returns 0 if the input was accepted or a number other than 0 if the input could not be parsed or if there was an internal processing problem.

The prototype for 'yyparse' is the following:

int yyparse();

(In fact, that is not the real prototype. However, it is more than suitable for us).

To get new tokens from the input, 'yyparse' will call a function named 'yylex'. The prototype for 'yylex' similar to 'yyparse':

int yylex();

'yylex' must be either code by ourselves (as in most of our examples) or by another program (Flex and Lex both generate the 'yylex' function for us). 'yylex' returns 0 whenever the end of input has been reached, or the last read token.

As the user file is being parsed by 'yyparse', an error may occur. Therefore, we need a function for error reporting. Both 'yyparse' and 'yylex' call 'yyerror', which prototype is as follows:

int yyerror(char *string);

'string' is the error description. You can do whatever you think useful for you, including just echoing the error onto the screen, writing it down into a file or just nothing. The return value is dismissed by the callers.

Note that Linux can throw different errors (depending on the version you have installed).

As you easily noticed, all functions (and soon you shall notice variables as well) are prefixed with 'yy'. So, if you are writing code, try not to write names begining with 'yy', because it might be reserved.

A first example

Our first example will be trivial: recognizing numbers and echoing them on screen. Not much science, but will do.

Let's begin filling the C declarations section. We will be using 'isdigit' from 'ctype.h' to recognize if a character is a number, plus 'printf' and 'getchar', from 'stdio.h'.

%{
#include 
#include 
%}

That was all we needed. Now, we must define the terminal tokens we are needing for this script. Since we only want to recognize numbers, a token representing a number is enough.

To declare tokens, we use (in the bison declarations section) the '%token' keyword. '%token' is followed by the name of the token. In our bison file, we will use NUMBER as the token representing numbers:

%token NUMBER
%%

We set the end of bison declarations section mark below the terminal token definition. Now comes the interesting part: writing the rules for our parser. Rules are written with a notation similar to the Backus-Naur Form, known as "BNF", the main formal system for context-free grammars.

By default, the entry point is the first rule to be found in the bison definitions section. Our rule will be named 'input', and will recognize when it read a valid line.

input:
       /* empty */
     | line input
;

First, remember that, since Bison generates a C parse, you can insert C and C++ comments. The 'input' rule is used when the parser recognizes one of two sentences: either 'empty' or a 'line' followed by more 'input'. Note that it Bison notation is fairly similar to the BNF notation.

What you write becomes the 'input' (since it is the first rule). How the parser knows when it has enough input to create a line? We tell it with a 'line' rule:

line:
       '\n'              {  printf("< ");            }
     | expression '\n'   {  printf("> %d\n< ", $1);  }
; 

We easily see two sentences, each with an action. The first sentence matches a line with a '\n' alone. This is what we call 'empty line'. The action to be taken when this kind of line is found is straightforward: write the input prompt ('< ') on screen. You don't usually need to display anything, but I do in order to let the user know the ENTER he pressed has been read, processed and executed (in this case, nothing has happened).

The second sentence is a bit more complex. It says that a line is an expression with an ENTER at the end. Let's check the action:

printf("> %d\n< ", $1);

It is a common C printf. The argument, though, is $1. What it means? Let's go back to the sentence:

expression '\n'

Here 'expression' is the first argument ($1) and '\n' is the second argument ($2). In other words, the printf will print the result of expression. But what returns expression?

expression:
       NUMBER              {  $$ = $1;  }
;
%%

How can we read this rule? Let's translate $$ as 'return value' and $1 as 'first argument of the rule'.

The rule 'expression' expectes to find a token of type NUMBER. If it finds such NUMBER token, it returns ($$) the 'first argument of the rule' (note that it only expects to find a NUMBER).

So, our parser recognizes a number. But it just does that: recognizes. We must tell it to display the number. Which we have already done in our previous rule. (Remember?)

We are missing something still: How Bison knows what a NUMBER is?

That is the task of the lexical analyzer. When you use Yacc, the lexical analyzer is done with Lex. With Bison, though, you can write it straightforward.

int yylex() {
   int value;
   while (((value = getchar()) == ' ') &&
                                     (value == '\t'));
   if (isdigit(value)) {
       ungetc(value, stdin);
       scanf("%d", &yylval);
       return (NUMBER);
   }
   return (value);
}

It should be easy to understand. Note that yylex() reads from the standard input with getchar(). Skips all the white values (spaces and tabs). Then, checks if the character is a digit with the function isdigit(). If the character is a number, we put the value back into the stream and use scanf() to read the number from stdin. Note that the value just read is put into the global 'yylval' variable, while the function returns NUMBER. As you should remember, NUMBER was declared as a token. In other words, our lexical analyzer recognized a number, and set its value into 'yylval'.

If the character read is not a number, we return the character.

The error handling function, yyerror(), is very simple:

int yyerror(char *string) {
   puts(string);
   return (0);
}

Whenever there is an error, yyparse() calls yyerror(), passing as argument a string with the error it caught. This function just outputs on screen what we get from yyparse().

Finally, our main function should only call the yyparse() method:

int main() {
   printf("Lesson 01: Recognizing a number\n< ");
   yyparse();
   return (0);
}

The flow

When called, yyparse() will call yylex(). yylex() returns a token. yyparse() checks if the token fits any given rule. If so, continues calling yylex() until either a rule has been completed or fails. If completed, executes the action for that rule. Let's see:

  1. We execute the program. yyparse() finds the first rule, 'input'. It says it is a 'line' followed by more 'input'.
  2. yyparse() resolves 'line', being it either a literal ENTER character or an 'expression'.
  3. yyparse() calls yylex(), which reads from the standard input.
  4. We press 1. Nothing happens (getchar() needs an ENTER to be called).
  5. We press 0. Nothing happens.
  6. We press ENTER. getchar() in yylex() is called, which reads the first character we typed (1).
  7. yylex() recognizes a digit. The 1 is set into the stream again with ungetc() and the digit is read with scanf(). 10 is set into yylval. yylex() returns NUMBER.
  8. yyparse() resolves 'expression' to be a NUMBER. Matches the expression rule, and executes its action (returning the number stored in yylval).
  9. To be a valid 'line', knowing it has already an 'expression', there must come an ENTER. yyparse() calls yylex() for the next character.
  10. yylex() reads the next character, an ENTER, and returns its value.
  11. yyparse() matches the ENTER with the ENTER needed to make a valid 'line', and executes the action for that 'line' (printing '>' and the result of the 'expression'). The screen output is
> 10 <

If we go back to step 2 and press ENTER directly, yyparse() declares it a valid 'line' (ENTER alone) and executes the action for that line (in this case, prints '< ' on screen).

If we type 'a' and ENTER, yylex() will return 'a'. Since there is no rule for an 'a' letter, yyparse() will catch a 'parser error', and call yyerror() with "parse error" as parameter. yyerror() outputs the message, and the program finishes.

In 'ex01.y' you will find the commented code for this example.

Section 4: Links

What you need

A list of things you need:

  • A GCC compiler. I tested with:
    • djgpp 2.95.2
    • MinGW 3.2.3
    • Linux GCC 3.2.2
    • VC++ 6.0
  • Bison. I tested with
    • version 1.25 for DOS/djgpp
    • version 1.35 for Linux
    • version 1.875 for Windows/MinGW
    • version 1.25 for MSVC

Here are some URLs:

As for GCC, djgpp can be obtained in Delorie' site (check Bison for djgpp URL) and MinGW from http://www.mingw.org.

Section 5: Conclusion

I hope the explanation was good enough for everyone to follow this tutorial. It is not difficult to understand Bison once you have written one or two examples, but until them it is discouraging.

Most of the examples were tested with DJGPP. After I switched to Windows 2000, I developed them with MinGW. You should have no problems working making it work under Linux, as I tested them in such configuration as well.

Any doubt, of course, just either send me a private message in Allegro.cc, or mail me at reybrujo@hotmail.com. Until the next tutorial!

Rey Brujo

Personal tools