Compiler Technology Compiler – software to translate high level language program into a lower lev

Compiler Technology Compiler – software to translate high level language program into a lower lev Compiler Technology Compiler – software to translate high level language program into a lower lev - Start

2018-11-03 21K 21 0 0


usually machine language, sometimes assembly language as an intermediate. for Java and . .net. , compilers convert code into an intermediate format. a just-in-time (JIT) compiler or interpreter then converts it into machine code. ID: 712683 Download Presentation

Embed code:
Download Presentation

Compiler Technology Compiler – software to translate high level language program into a lower lev

Download Presentation - The PPT/PDF document "Compiler Technology Compiler – softwar..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentations text content in Compiler Technology Compiler – software to translate high level language program into a lower lev


Compiler Technology

Compiler – software to translate high level language program into a lower level language

usually machine language, sometimes assembly language as an intermediate

for Java and


, compilers convert code into an intermediate format

a just-in-time (JIT) compiler or interpreter then converts it into machine code

Assembler – software to translate assembly language program into machine language


– software to translate a single instruction from the given language into machine code and execute the instruction

interpreters work within an environment of previously executed instructions

interpreted languages include LISP and its dialects, Unix/Linux shells, Ruby and Python

most web browsers contain a Java Virtual Machine – an interpreter to convert Java byte code into machine code and execute the code


More Terminology


– program that translates from a lower level language (usually assembly) into a higher-level language

this process is ambiguous because there are many ways a series of low level instructions can be represented in a high level way

Linker – program that connects the external references of source code to their definitions in a library

linking is often part of the compilation process although compilers allow you to separate this process for the translation process

Loader – program that loads executable code into memory

a linker/loader locates the external definitions and loads them with the linker

a dynamic linker/loader loads at run-time only the external definitions needed – a static linker/loader will load all definitions found in the program

Language rewriter – program that translates a program in a language to a new program in the same language

aspect-oriented programming relies on language rewriters for instance

it could also be used to replace deprecated instructions


Yet More Terminology

Preprocessing – a step taken by some compilers prior to compilation to handle various simple chores

in C for instance, # statements are preprocessor operations like textual substitution for #define

Symbol table – collection of user-defined entities (variables, subroutines, classes, constants)

for each entity, the symbol table records the name, type, size/internal structure

this is used by the compiler to bind references in the source code to a memory location and to determine the amount of memory that needs to be allocated to run the given subroutine (note that constants may be stored separately in a “literal table”)

Tokens – the reserved words and syntactic constructs that are defined in the language

for instance, ; is the semicolon operator and == is the equality operator


One-Pass Compilers

A compiler operates on the source program instruction-by-instruction

In a one-pass compiler, the compiler only makes one pass through the program, translating each instruction into machine code

this drastically limits the complexity of the source language in that there can be no forward references

subroutine calls cannot be to code listed beneath the call within the file

variable declarations must be placed in one location so that a symbol table can be generated – this would be the case in languages like Pascal and C

Java’s for loop would not be permissible because the compiler needs to first scan the for loop to determine if it is a counting loop or iterator loop before it can start translating the code

Pascal uses a one-pass compiler

C may be implemented in a one-pass compiler but


and Visual Studio are multi-pass compilers


Multi-pass Compilers

The multi-pass compiler makes multiple passes through the source code

usually, each pass corresponds to a different process of compilation

lexical analysis – breaking code into lexical units

syntactic parsing – identifying the type of instruction and its component parts, generating a parse tree

semantic parsing – applies semantic rules to the parse tree to ensure that the code meets the semantic requirements of the language

code generation – generating either the machine code or an intermediate form (e.g., assembly code, byte code)

code optimization – rearranging instructions to better first the parallel processing/pipelining aspects of the hardware

a variation is the two-pass compiler

one pass to perform lexical and syntactic analysis and form the symbol table

a second pass to perform semantic analysis and translation


Why One-Pass?

Such compilers are smaller and execute much faster

they are also far easier to write than multi-pass compilers

Pascal and C were both written with one-pass compilers in mind

thus we have restrictions in these languages that we won’t have in C++, Java,


Most languages today use multi-pass compilers because the language is too complex for a single pass

this allows for better language features

declarations anywhere

characters whose usage depends on context (for instance the period in COBOL used as the end of a sentence or a decimal point)

forward referencing instructions (such as subroutines defined lower in the file)

in addition, multiple passes allows for errors generated only for the given pass rather than either all found errors or the next error found (in the case of Pascal)

also the ability to optimize code


Compiler Passes: Preprocessing

LISP offers macros to replace code before substitution

this is more an interpreter operation than a compiler preprocessing step

PL/I was the first language to offer preprocessing – three types

%INCLUDE – include text into the source file (%XINCLUDE includes text that is not already in the file)

%DECLARE – declare preprocessing variable (character or fixed only)

%assignment – assign a variable a value

%ACTIVATE/%DEACTIVATE – activate/deactivate a variable

%DO/%END – start and end of a block of several preprocessor statements

%PROCEDURE – define a preprocessor subroutine

%SELECT – conditional statement for preprocessing, also %IF, %THEN, %ELSE

%REPLACE – textual substitution

C was the next language to offer this and simplified what PL/I had

OCAML (predecessor of F#) has preprocessing known as Camlp4


Compiler Passes: Lexical Analysis

Treat the source file as a stream of characters

Break the stream into meaningful units

tokens (or language primitives)

example: for(





analysis shown to the right

Simple left-to-right parsing

compare each grouping with the available tokens

apply rules for identifier names, literal numbers, special syntactic structures like



As we are recognizing regular expressions, we can use a finite state automataErrors caught in this pass are lexical errors other errors may still exist in the code but won’t be caught during this pass










 assignment-operator

0  literal-number

;  semicolon-operator


 identifier

<  relational-operator

n  identifier

;  semicolon-operator


 identifier

++  increment-operator

)  close-




Example Lexical Analysis


Example FSA

Let’s see how to recognize a legal identifier in C

rule: consist of letters, digits, underscores but starts with a letter or underscore

The FSA starts in state 1: start state

there are two links from state 1

one link is traversed if the input character is a letter or underscore and leads to state 2

the other link is traversed for any other character and leads to an error state

state 2 is reached if the first character is legal and has three links

one link loops back to state 2 and is traversed if the next input character is a letter, digit or underscore

one link leads to the end state (meaning a legal identifier) and taken if we have reached the end of this token

one link leads to an error state and is taken under any other situation

code from this FSA is given on the next slide


Implementing the Identifier FSA

c =





(c)&&c!=‘_’) return false; // error






(c)&&c!=‘_’) return false; // error

return true; // reached the end of token, return true

This code works as follows

get the first character

if not legal, return false

if legal, continue

while there are more characters if a character is not legal, return false when done, return true

We make the assumption that

there is

a way to determine the end

of this

token (denoted as EOF


but in actuality, this is more challenging and requires



Constructing a Lexical Analyzer

Define all of the tokens for the language

if a token can vary (e.g., identifier, literal value) further define it as a regular expression

build a FSA for each token

connect the FSAs together into a single, large FSA

The program LEX, built in 1975, is a tool for generating the lexical analyzer given the tokens (and their rules) of the language

an example input for LEX is given on the next slide of a language that has

numbers (integers only)

simple identifiers (consist only of lower case letters)

reserved words WHILE, DO, IF, THEN, ELSE, FI, OD and PRINT

punctuation marks of :=, ;, =, !=, <=, >=, <, >, =, +, -, *, /, (, )



#include "








"=" { return EQ; }

"!=" { return NE; }

"<" { return LT; }

"<=" { return LE; }

">" { return GT; }

">=" { return GE; }

"+" { return PLUS; }

"-" { return MINUS; }

"*" { return MULT; }

"/" { return DIVIDE; }

")" { return RPAREN; }

"(" { return LPAREN; }

":=" { return ASSIGN; } ";" { return SEMICOLON; } "IF" { return IF; } "THEN" { return THEN; } "ELSE" { return ELSE; } "FI" { return FI; }

"WHILE" { return WHILE; }

"DO" { return DO; }

"OD" { return OD; }

"PRINT" { return PRINT; }

[0-9]+ {







return NUMBER; }

[a-z]+ {


= sum of all ASCII

chars; return NAME; }

. {


("illegal token"); }






() { return 1; }




Syntactic Analysis

Given a sequence of lexical items, do they form a syntactically valid construct

e.g., a C for loop will be:





optionally: identifier assignment-operator expression (or literal number or identifier)

semicolon operator



expression (or literal number or identifier)

semicolon operator

optionally: identifier assignment-operator expression (or identifier with prefix/postfix increment operator)



optionally: open-curly-brace-operator


optionally: close-curly-brace-operator


e need to write a parser (recognizer) piece of code for every construct



The parser code will all be divided into functions, one per syntactic unit

these functions can be called recursively

functions will return true (valid parse) or false (error)

on error, the parser must identify what rule failed and generate a syntax error message based on that rule

these messages are hardcoded into the parser, one per rule

functions will have a side effect of building a parse tree for the unit

A parser function will contain function calls to retrieve the next token and then test that token against the expected token

e.g., in the function to recognize a for loop, the first token must be the for-operator followed by the open-




the parser’s logic is based on the grammar and can be automatically generated from the grammar but then we need to add other rules (such as balancing { })


Recursive Decent Parsing

The approach described is called recursive decent parsing

because grammar rules can have multiple right hand sides, parsing is O(2^n)

for instance, imagine the follow two rules



| C

B 


if we are currently in a state of A and our next token is ‘b’, do we select


or C and then


? if we pick the wrong one, we have to backtrack and this is where the complexity becomes exponential in nature

in reality, our programming languages do not have the type of ambiguity that we find in human languages (e.g., English) so there are few situations where we see rules like the above and thus the parsers will usually operate at a much better complexity

To use a RDP, we need the grammar to be non-left-recursive

If the grammar has left-recursive rules, we instead opt for a bottom-up parser, or chart parser

as we already examined these in chapter 4, we skip the details here



Yet Another Compiler


this is a tool, often used with LEX, to generate a parser given the grammar of a language

YACC generates a shift-reduce parsers that


“1 look ahead”

this means that the parser looks ahead by 1 token before deciding which rule to employ

although this makes the compiler more complicated, it reduces the amount of backtracking

Input into YACC is

the lexemes from LEX

the root node of the grammar (the start state)

all of the grammar rules

example input shown on the next several slides


Input into YACC: Lexemes

%start ROOT

%token EQ

%token NE

%token LT

%token LE

%token GT

%token GE

%token PLUS

%token MINUS

%token MULT

%token DIVIDE

%token RPAREN

%token LPAREN

%token ASSIGN


%token IF

%token THEN

%token ELSE

%token FI

%token WHILE

%token DO

%token OD %token PRINT %token NUMBER %token NAME



Input into YACC: Grammar



{ execute($1); }



designator ASSIGN expression { $$ = assignment($1, $3); }

| PRINT expression { $$ = print($2); }

| IF expression THEN





{ $$ =


($2, $4, $6); }

| IF expression THEN



{ $$ =


($2, $4, empty()); }

| WHILE expression DO stmtseq OD { $$ = whilestmt($2, $4); } ;




SEMICOLON statement { $$ =


($1, $3); }

| statement { $$ = $1; }




expr2 { $$ = $1; }

| expr2 EQ expr2 { $$ =


($1, $3); }

| expr2 NE expr2 { $$ = ne($1, $3); }

| expr2 LT expr2 { $$ = le($1, $3); }

| expr2 LE expr2 { $$ = le($1, $3); }

| expr2 GT expr2 { $$ =


($1, $3); }

| expr2 GE expr2 { $$ =


($1, $3); }



expr3 { $$ == $1; }

| expr2 PLUS expr3 { $$ = plus($1, $3); }

| expr2 MINUS expr3 { $$ = minus($1, $3); }




expr4 { $$ = $1; }

| expr3 MULT expr4 { $$ =


($1, $3); }

| expr3 DIVIDE expr4 { $$ = divide ($1, $3); }



PLUS expr4 { $$ = $2; }

| MINUS expr4 { $$ =


($2); }

| LPAREN expression RPAREN { $$ = $2; }

| NUMBER { $$ = number($1); }

| designator { $$ = $1; }



NAME { $$ = name($1); }



Parse Tree

For the expression b *b – 4 * a * c



void add_to_array(int *x, int y, int n)


int i = 0;

for (i = 0; i < n; i++) {

x[i] = x[i] + y;





Semantic Analysis

With a parse available, attributes are added to the stored content in the parse

for instance, types are added to identifier names

rules are then applied to ensure that the semantics of the language are being upheld, for example

proper types are used on the left and right hand side of an assignment statement

proper number and type of parameters between the actual and formal parameter list

proper operator(s) applied in the given context (+ used for numbers or Strings, ++ used for ordinal values, < and > used on comparable types of values)

type coercions, object binding (if possible), initializing local variables (if possible) are handled in this step

the compiler must implement the attribute grammar rules for semantic analysis


Code Generation

This is a planning step

given a parse, there are many ways to convert this into assembly code, for instance

which data are preloaded into registers

which addressing mode is used when referencing a datum in memory

which instruction format will be used (if the platform has multiple instruction formats for a type of instruction)

often, compilers use implement a solution to the graph coloring problem to determine variable to register loaded (this is an NP complete problem!)



How should

a = b * c + d * e; be compiled?

Assume the hardware has 8 registers, R0-R7, and ALU operations can reference up to 1 memory location

below are two solutions – the one on the


is more efficient if b, c, d and e are not referenced again,



all data into variables is more advantageous

load R1, b

load R2, c

mult R3, R1, R2

load R4, d

load R5, e

mult R6, R4, R5

add R7, R3, R6

store R7, a

load R1, b


R2, R1, c

load R3, d


R4, R3, eadd R5, R2, R4store R5, a


Table of Costs

To determine what addressing modes to use, the compiler may utilize a table of costs like the one below







Add M




Add R1



c + (R)

Add 100(R1)




Add (M)


Register Indirect


Add (R1)


Indirect Indexed


+ (R))

Add *100(R1)




Add #5



Code Optimization

There are many strategies that can reduce the run-time cost of executing code

code rearranging – fill what would be stalls in the pipeline with neutral instructions

algebraic simplification – move a result that is used again into a register

x = 7 * y + 5 – a; // move 7 * y into a register

z = 7 * y * c;

dead code elimination – remove code that cannot be reached



– replace the function call with the actual function body to save runtime overhead of using the run-time stack

loop optimization – rearrange nested loops that process data in non-contiguous order into one that processes data in contiguous order (this might improve cache performance)

loop unrolling – used to fill branch penalty spots with additional instructions

We explore many of these concepts in CSC 462/562

About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.