Compiler Construction

Languages

A language is a set of strings; each string is a finite sequence of symbols taken from a finite aplhabet. For parsing, the strings are source programs, the symbols are lexical tokens, and the alphabet is the set of token types returned by the lexical analyzer.

A leftmost derivation is one in which the leftmost nonterminal symbol is always the one expanded; in a rightmost derivation, the rightmost nonterminal is always next to be expanded.

 

A grammar is ambiguous if it can derive a sentence with two different parse trees.

Grammars whose predictive parsing tables contain no duplicate entries are called LL(1).

 

Recursive descent, or predictive, parsing works only on grammars where the first terminal symbol of each subexpression provides enough information to choose which production to use.

First(y) is the set of all terminal symbols that can be any string derived from y.

Follow(x) is the set of terminals that can immediately follow x.

When computing first set of (X Y) it x is nullable you also have to take into account Y.

Eg for the grammar:

Z -> d Y -> X->Y
Z -> X Y Z Y -> c X -> a

 

  Nullable FIRST FOLLOW
X yes A c A c d
Y yes c A c d
Z no A c d  

 

Eliminating left recursion:

The two productions

E -> E + T

E -> T

 

are certain to cause duplicate entries in the LL(1) parsing table, since any token in FIRST(T) will also be in FIRST(E + T). The problem is that E appears as the first right-hand-side symbol in an E production; this is called left recursion. To eliminate left recursion, rewrite using right recursion.

The common way to rewrite is

A -> A a.
A -> b.

 |
 v

A -> b A'.
A'-> a A'.
A'-> .

 

To re-write using right recursion

Simliarly we can rewrite

S -> if E then S else S

S -> if E then S

As:

S -> if E then S X

X ->

X -> else S

 

From old notes

3.1 Type 2 grammar

In the Chomsky type 2 grammar the left hand side of every production is restricted to just a single

non-terminal symbol. Such symbols are often called syntactic categories. Type 2 grammars are known as context-free grammars and have been used frequently in the specification of the syntax of programming languages, most notably Algol 60 where it was first used. The notation is sometime called Backus Naur Form or BNF after two of the designers of Algol 60. A simple example of a type 2 grammar is as follows:

S −> A B

A −>! a

A −>A B b

B −> b c

B −> B a

A slightly more convenient way of writing the above grammar is:

S −> A B

A −> a | A B b

B −> b c | B a

The alphabet for this grammar is {S, A, B, a, b, c, d}. The non-terminals are S, A, B being

the symbols occurring on the left-hand-side of productions, with S being identified as the start

symbol. The terminal symbols are a, b, c, d, these being the characters that only appear on the right hand side. Sentences that this grammar generates include, for instance:

abc

abcbbc

abcbca

abcbbcaabca

 

 
 

Where the last sentence, for instance, is generated from the sentence symbol by means of the following productions:

 

A grammar is ambiguous if there are two or more ways of generating the same sentence.

Convince yourself that the follow three grammars are ambiguous:

a) S −> A B

A −> a | a c

B −> b | c b

b) S −> a T b | T T

T −> a b | b a

c) C −> if E then C else C | if E then C

This is the most restrictive of the phrase structured grammars. In it all productions are limited to being one of the following two forms:

A −> a

A −> a B

That is, the right hand side must consist of a single terminal symbol possibly followed by a single non-terminal. It is sometimes possible to convert a type 2 grammar into an equivalent type 3 grammar. Try this for the grammar for floating point constants given earlier.

Type 3 grammars can clearly be parsed using a finite state recogniser, and for this reason

they are often called regular grammars.

 

6.1 Relevant JVM instruction set

The JVM has already been covered in the course "Computer Design". This section can be seen as a reminder which focuses on particular points relevant to compilation.

In these notes the stack is upwards growing,6 so that subsequent parameters and local variables occupy increasing memory locations (this is not necessary for the JVM but makes the pictures easier to understand). The version of the JVM we present has a stack-pointer SP register which points to the first free location on the stack. In addition, it has a so-called frame-pointer FP register which points to a constant place during the activation of a procedure. Although strictly not necessary, this simplifies explanations and aids debuggers, since local variables will remain at the same constant offset from FP but their offset from SP will vary during pushes and pops to the stack. Evaluation of a procedure uses a stack frame which will be addressed from FP. Note that both arguments and local variables are addressed as negative offsets from FP. A stack frame has

the following form:

 

Here FP points to (or rather into) the current stack frame. FP points to the so-called linkage information which allows a function call to return. This consists of the previous value of FP, here called FP0 and a pointer L (the return address) to the machine instruction to be next evaluated after the function's return. To the left (at lower addresses) of the linkage information are stored the function's parameters and (user-declared) local variables; on the right is the local stack (used for evaluating temporaries and for storing arguments), SP is a pointer to the first free cell of this (which coincides with the first free cell of the whole stack). To clarify explanations of function call, we will use the word parameters to refer to the identifiers holding arguments to the current procedure and restrict the word arguments to refer to expressions which are being, or have been,

evaluated on the stack in order to be passed to a further function. At the end of each Java

statement we can expect the local stack to be empty and hence SP to point exactly two words further along the stack from FP. In general these notes will assume that the JVM stack is an array of (32-bit) words, and so addresses of stack items will differ by one.

 

8.2 Dynamic linking

Consider a situation in which a user has many small programs (maybe 50k bytes each in terms of object files) each of which uses a graphics library which is several megabytes big. The classical idea of linking (static linking) presented above would lead to each executable file being megabytes big too. In the end the user's disc space would fill up essentially because multiple copies of library code rather than because of his/her programs. Another disadvantage of static linking is the following.

Suppose a bug is found in a graphics library. Fixing it in the library (.OBJ) file will only fix it in my program when I re-link it, so the bug will linger in the system in all programs which have not been re-linked--possibly for years.

An alternative to static linking is dynamic linking. We create a library which defines stub

procedures for every name in the full library.

 

9.13 Static and dynamic scoping

This final point is worth a small section on its own; the normal state in modern programming languages is that free variables are looked up in the environment existing at the time the function was defined rather than when it is called. This is called static scoping or static binding or even lexical scoping; the alternative of using the calling environment is called dynamic binding and was used in many dialects of Lisp. The difference is most easily seen in the following example:

let a = 1;

let f() = a;

let g(a) = f();

print g(2);

Check your understanding of static and dynamic scoping by observing that this prints 1 under the former and 2 under the latter.

You might be tempted to believe that rebinding a variable like `a' in dynamic scoping is

equivalent to assigning to `a'. This is untrue, since when the scope ends (in the above by g being exited) the previous binding of `a' (of value one) again becomes visible, whereas assignments are

not undone on procedure exit.

 

10.1 Evaluation Using a Stack--Static Link Method

We saw in the first part of the notes how the JVM uses a stack to evaluate expressions and function calls. Essentially, two registers (FP and SP) respectively point to the current stack frame and its fringe.

Evaluation on a stack is more efficient than in the lambda-interpreter presented above in

that no search happens for variables, they are just extracted from their location. However, the JVM as defined only provides instructions which can access local variables (i.e. those on the local stack frame accessed by FP) and static variables (often called global or top-level variables in other languages) which are allocated once at a fixed location.

Indeed Java forbids the textual nesting of one function within another, so the question of how to access the local variables of the outer function from within the inner function does not need to be addressed in the JVM. However, for more general languages we need to address this issue.

The usual answer is to extend the linkage information so that in addition to holding the return address L and the old frame pointer FP0, also known as the dynamic link as it points to the frame of its caller, it also holds a pointer S, the static link17 which points to the frame of its definer.

 

Other notes

1. x+y*z:

<Exp> => <Exp> + <Term> => <Term> + <Term> => <Factor> + <Term> => x + <Term> => x + <Term> * <Factor> => x + <Factor> * <Factor> => x + y * <Factor> => x + y * z


x+y*z: Full derivation-tree of non-terminals and replacements.

 

The result of parsing is often returned as a parse-tree based on a simpler abstract syntax. The abstract syntax corresponds to the datatype of parse trees returned by the parser.


x+y*z: Parse tree.

 

2. (x+y)*z:

Parentheses '(' and ')' can be used to force a different parsing:


(x+y)*z: Full derivation-tree of non-terminals and replacements.

 


(x+y)*z: Parse tree.

 

Note that parentheses do not appear in any parse tree. They directed the parser to build a certain tree and after that they are unnecessary. The structure of the tree shows the binding of operators to operands without the need for (). Parentheses are only needed in strings.

·       Formal Grammars and Languages:

·       Definition of formal grammar (start symbol, terminal symbols, non-terminal symbols, grammar rules). (Also called a context-free grammar or a BNF grammar.)

·       Derivation sequences (leftmost and rightmost).

·       Language accepted by (or recognized by) a formal grammar.

·       Parse trees. (Each derivation sequence determines a unique parse tree, but a parse tree might have many derivations.)

·       Ambiguity. (Definition: There is a sentence with more than one parse tree, or with more than one leftmost derivation, or with more than one rightmost derivation.)

·       Ambiguity is bad.

·       Remove ambiguity by either:

·       rewriting the grammar, such as changing

 

E ----> E + E (AMBIGUOUS)

E ----> E * E

E ----> ( E )

E ----> id

to

 

E ----> E + T (NOT AMBIGUOUS)

E ----> T

T ----> T * F

T ----> F

F ----> ( E )

F ----> id

·       introducing special rules, such as

·       Precedence of one operator over another. (Present with + and * in the first example above.)

·       Associativity of an operator (left-to right or right-to-left). (Present with both + and * in the first example above.)

·       Rule that an else matches then nearest unmatched if. (Present in the following example:

 

stmt ----> "if" "(" expr ")" stat (AMBIGUOUS)

stmt ----> "if" "(" expr ")" stat "else" stat

·       Parsers and the Use of Grammars:

·       Recursive Descent Parsers:

·       Top-down parser, starts with the start symbol and works down toward the given sentence (string of terminal symbols), constructing a leftmost derivation).

·       Must rewrite grammar to avoid left recursion. (There are other possible problems also, beyond the scope of this course.) The simple grammar above would be rewritten:

 

E ----> T { "+" T } ({} means "zero of more times")

T ----> F { "*" F } (terminals are in quotes or id)

F ----> "(" E ")"

F ----> id

·       Write a function corresponding to each non-terminal symbol in the grammar. The code of the function mirrors the right side of the rule.dfdfd

 

Important notes (new)

Compilation as a sequence of transformations

 
 

●      Each transformation preserves semantics

●      Each transformation addresses only one aspect of "the gap"

●      Each transformation is fairly easy to understand.

 

In: Program Text

The compiler composes of a front, middle and back end

Front-end

 

Lexical Analysis

Analysis of the text to find meaning and validity, based on finite automaton and regular expressions.

The lexical tokens that the program text is split into are passed down to Parsing

 

Parsing

Parsing theory based on push-down automaton and context-free grammars.

This stage outputs error messages and a Parse Tree.


A simple parse tree

Middle End

Analysis of symbol definitions and use

This stage outputs errors such as variables not being declared. The parse tree is annotated and then passed on..

Type Checking

Type errors are reported here. The annotated parse tree is then passed on again..

Intermediate code generation

Depending on the compiler, intermediate code is now created. It may be optimised using optimisation algorithms, then a targeted assembly language file is outputted.

 

Back End

The assembly code is then assembled into an object code file, then linked with object code libraries.

Out: Machine code program

 

 


http://www.cs.ucl.ac.uk/staff/l.capra/teaching/2010/2010-regexp.pdf

 
 

 

 

 

 

 

 

Common example:

let rec compose f g =

let rec composed x = g (f x) in

composed in

let rec dbl x = x + x in

let rec inc x = x + 1 in

let rec dec x = x - 1 in

let h = compose inc (compose dbl dec) in

print_int (h 123)

 

let LABEL(composed) [f, g] x = let TMP1 = APPLY_CLOSURE(f x) in APPLY_CLOSURE(g TMP1) in

let LABEL(compose) [] f g = let cl = (LABEL(composed); [f, g] ) in cl in

 

let LABEL(dbl) [] x = (x + x) in

let LABEL(inc) [] x = let TMP1 = 1 in (x + TMP1) in

let LABEL(dec) [] x = let TMP1 = 1 in (x - TMP1) in

 

let closure dbl = ( LABEL(dbl); [] ) in

let closure inc = ( LABEL(inc); [] ) in

let closure dec = ( LABEL(dec); [] ) in

 

let TMP2 = APPLY_DIRECT(LABEL(compose) dbl dec) in

let h = APPLY_DIRECT(LABEL(compose) inc TMP2) in

let TMP3 = 123 in

let TMP4 = APPLY_CLOSURE(h TMP3) in

APPLY_DIRECT(LABEL(min_caml_print_int) TMP4)

 

 

Solutions:

 

Restrict programming language so the problem goes away! That is, use FORTRAN.

Force programmer to worry about it (use malloc and free in C...)

Automation

Reference Counting

Mark and Sweep

Copy Collection

Generational Collection

... there are other techniques...