https://docs.google.com/document/d/1hs6rbidqd_rpV8WfR_aqjHfULgz8Bycv/edit?usp=sharing&ouid=112711162618141236570&rtpof=true&sd=true
The document provides a detailed overview of compiler design, covering the compilation process, formal languages, parsing techniques, memory management, and code optimization.
Here is a full summary of the key concepts organised by unit: UNIT 1: Introduction to CompilerCompiler and its Phases
A compiler is a translator that converts high-level language (HLL) into machine language. Its main goal is to transform code without changing its meaning and to report errors. Compilation executes in two parts: translating the source program into an object program (low-level language), and then translating the object program into the target program using an assembler.
The compilation process involves six main phases:
Error Handling is the mechanism in each phase to detect, report, and recover from errors.
Categories of Tokens:
CFG is a formal grammar used to generate all possible strings in a language, often used to describe programming languages and nested structures (like balanced parentheses). It is defined by $G=(V, T, P, S)$, which represents non-terminal symbols, terminal symbols, production rules, and the start symbol, respectively.
Derivation is the sequence of production rules used to generate the input string.
A parser breaks the sequence of tokens into smaller elements, outputting a parse tree.
Operator Precedence Parsing is a shift-reduce method that applies only to a class of grammars where: 1) no right-hand side has $\epsilon$, and 2) no two non-terminals are adjacent. Precedence is established between terminals using relations $\succ$ (higher), $\prec$ (lower), and $\doteq$ (equal).
LR Parser scans input Left-to-right (
Intermediate Code and Translation
A code generator translates intermediate representations into machine code or bytecode.
Key Responsibilities:
This optimization transforms intermediate code without involving specific hardware components to eliminate redundancies and reduce code size.
Function Preserving Optimization:
A local, machine-dependent technique that optimizes small sequences of 2–5 instructions (a "peephole") by replacing them with more efficient code. Techniques include redundant load/store elimination, constant folding, strength reduction, and eliminating null sequences (e.g., $a := a+0$).Data Flow Analysis
A technique used to analyze how data flows through a program to identify optimization opportunities.
Types of Data Flow Properties:
Here is a full summary of the key concepts organised by unit: UNIT 1: Introduction to CompilerCompiler and its Phases
A compiler is a translator that converts high-level language (HLL) into machine language. Its main goal is to transform code without changing its meaning and to report errors. Compilation executes in two parts: translating the source program into an object program (low-level language), and then translating the object program into the target program using an assembler.
The compilation process involves six main phases:
- Lexical Analysis (Scanner): Reads source code character-by-character, converting it into meaningful lexemes and representing them as tokens.
- Syntax Analysis: Takes tokens as input and generates a parse tree, checking for syntactic correctness.
- Semantic Analysis: Checks if the parse tree follows language rules, tracking identifiers and types, and produces an annotated tree syntax.
- Intermediate Code Generation: Generates code between HLL and machine language, designed for easy conversion to target machine code.
- Code Optimization (Optional): Improves the intermediate code for faster execution and less memory consumption by removing unnecessary lines.
- Code Generation: The final stage; translates the optimized intermediate code into the target machine language.
Error Handling is the mechanism in each phase to detect, report, and recover from errors.
- Lexical Errors: Detects illegal characters or invalid token formation (e.g.,
@var). - Syntax Errors: Checks for grammatical errors like missing semicolons or unmatched parentheses.
Categories of Tokens:
- Keywords: Reserved words with specific meanings (e.g.,
if,void). - Identifiers: User-defined names (variables, functions) starting with a letter or underscore.
- Constants: Fixed values that do not change (literals).
- Operators: Symbols performing actions (e.g.,
+,=,>). - Special Symbols: Used for structure and grouping (e.g.,
;,{},[]).
- Finite State Machine (FSM) / Finite Automata (FA): Used to recognize patterns in input, accepting or rejecting a string based on reaching a final state. FA is defined by a finite set of states (Q), input symbols ($\Sigma$), an initial state ($q_0$), a final state (F), and a transition function ($\delta$).
- DFA (Deterministic FA): The input character transitions to only one state and does not accept the null move.
- NDFA (Non-Deterministic FA): Can transition to any number of states for a particular input and accepts the NULL move.
- Regular Expression: A sequence of patterns defining a string, used to denote regular languages. Key operations include Union ($L \cup M$), Intersection ($L \cap M$), and Kleene closure ($L^*$, zero or more occurrences).
- LEX: A program used to generate a lexical analyzer, typically paired with the YACC parser generator.
CFG is a formal grammar used to generate all possible strings in a language, often used to describe programming languages and nested structures (like balanced parentheses). It is defined by $G=(V, T, P, S)$, which represents non-terminal symbols, terminal symbols, production rules, and the start symbol, respectively.
Derivation is the sequence of production rules used to generate the input string.
- Left-most Derivation: Scans and replaces the input with production rules from left to right.
- Right-most Derivation: Scans and replaces the input with production rules from right to left.
A parser breaks the sequence of tokens into smaller elements, outputting a parse tree.
- Top-Down Parsing (Predictive): Starts from the start symbol and derives the input string.
- Bottom-Up Parsing (Shift-Reduce): Starts with the input symbol and constructs the parse tree up to the start symbol by reversing the rightmost derivations.
Operator Precedence Parsing is a shift-reduce method that applies only to a class of grammars where: 1) no right-hand side has $\epsilon$, and 2) no two non-terminals are adjacent. Precedence is established between terminals using relations $\succ$ (higher), $\prec$ (lower), and $\doteq$ (equal).
LR Parser scans input Left-to-right (
L) and constructs a Rightmost derivation in reverse (R). It is divided into four main types:| Parser Type | Basis | Reduce Move Placement |
|---|---|---|
| LR(0) | Canonical collection of LR(0) items (production with a dot) | Entire row |
| SLR(1) | Canonical collection of LR(0) items | Only in the Follow of the left-hand side non-terminal |
| CLR(1) | Canonical collection of LR(1) items (LR(0) + look ahead symbol) | Only in the lookahead symbols |
| LALR(1) | Combines LR(1) items with the same core production but different lookahead symbols (e.g., $I_{3}$ and $I_{6}$ become $I_{36}$) | Only in the lookahead symbols |
- Syntax Directed Translation (SDT): A technique where semantic rules are associated with grammar productions ($Grammar + semantic\ rule = SDT$). These rules evaluate attributes (e.g.,
E.val) for non-terminals. - Three Address Code (TAC): An intermediate code that uses at most three addresses and one operator in the form $a = b\ op\ c$.
- Quadruple: Uses 4 fields:
op,arg1,arg2, andresult. Easy for global optimization. - Triples: Uses 3 fields:
op,arg1, andarg2. Temporaries are implicit, making code rearrangement difficult. - Indirect Triples: Uses pointers to a separate listing of computations, allowing for easier rearrangement than triples.
- Quadruple: Uses 4 fields:
- Basic Block: A sequence of three address statements with a single entry and a single exit point, used as a unit for optimization.
- Loop Detection: Uses Control Flow Analysis (CFA) and a Program Flow Graph (PFG); a cycle in the PFG indicates a loop.
- Stack Allocation: A runtime management technique where Activation Records (ARs) are pushed onto the stack when a procedure is invoked and popped when it ends. ARs store local variables.
- Activation Record Contents: Actual Parameters, Return Address, Return Value, Local Data, Old Stack Pointer (to the caller's AR), and optional access/control links.
- Accessing Non-Local Data: Accessing variables from an outer scope.
- Static Scoping: Scope determined at compile-time.
- Dynamic Scoping: Scope determined at run-time based on the execution stack.
- Mechanisms: Access Links (traverse the static scope chain) and Display (array of pointers to ARs).
- Heap Management: Manages dynamic memory allocation and deallocation. Strategies include First-Fit, Best-Fit, and Worst-Fit allocation.
- Garbage Collection (GC): Automatically identifies and reclaims unused memory.
- Types: Mark-and-Sweep (marks reachable objects, reclaims unmarked), Reference Counting (reclaims objects with zero references), and Generational GC (collects younger objects more frequently).
- Trace-based GC: Starts from a Root Set (global, stack variables) and recursively traces all reachable objects to mark them as live.
A code generator translates intermediate representations into machine code or bytecode.
Key Responsibilities:
- Instruction Selection: Choosing the most efficient machine instructions.
- Instruction Scheduling: Ordering instructions to minimize execution time.
- Register Allocation: Assigning variables to registers to reduce memory accesses.
This optimization transforms intermediate code without involving specific hardware components to eliminate redundancies and reduce code size.
Function Preserving Optimization:
- Common Subexpression Elimination: Removing repeated calculations of the same expression.
- Constant Folding: Replacing expressions computed at compile-time (e.g., $5+7$) with their constant value (e.g., $12$).
- Constant Propagation: Replacing a variable with its assigned constant value in later computations.
- Dead Code Elimination: Removing code that is never executed or whose result is never used.
- Copy Propagation: Replacing a variable with the source variable in copy statements (e.g., if $x=y$, use $y$ instead of $x$).
- Frequency Reduction (Code Motion): Moving loop-invariant code (statements whose values do not change) outside the loop.
- Loop Unrolling: Performing the loop operation multiple times within a single iteration to reduce the overall number of loop runs.
- Loop Jamming: Combining multiple loops that perform related operations into a single loop.
- Strength Reduction: Replacing costly operations (e.g., multiplication) with cheaper ones (e.g., addition or bit shifts).
- Redundancy Elimination: Evaluating a repeated expression once and substituting the value in subsequent occurrences.
A local, machine-dependent technique that optimizes small sequences of 2–5 instructions (a "peephole") by replacing them with more efficient code. Techniques include redundant load/store elimination, constant folding, strength reduction, and eliminating null sequences (e.g., $a := a+0$).Data Flow Analysis
A technique used to analyze how data flows through a program to identify optimization opportunities.
Types of Data Flow Properties:
- Available Expression: An expression whose value has already been computed and can be reused, used for common subexpression elimination.
- Reaching Definition: Tracks the definition of a variable that "reaches" a specific point, used in constant and variable propagation.
- Live Variable: A variable whose value is still needed for a future computation; used for register allocation and dead code elimination.
- Busy Expression: An expression whose evaluation exists along a path and none of its operands are redefined before evaluation; used for code movement optimization.
Comments