CD NOTES

https://docs.google.com/document/d/1hs6rbidqd_rpV8WfR_aqjHfULgz8Bycv/edit?usp=sharing&ouid=112711162618141236570&rtpof=true&sd=true











The document provides a detailed overview of compiler design, covering the compilation process, formal languages, parsing techniques, memory management, and code optimization.

Here is a full summary of the key concepts organised by unit: UNIT 1: Introduction to CompilerCompiler and its Phases

A compiler is a translator that converts high-level language (HLL) into machine language. Its main goal is to transform code without changing its meaning and to report errors. Compilation executes in two parts: translating the source program into an object program (low-level language), and then translating the object program into the target program using an assembler.

The compilation process involves six main phases:
  1. Lexical Analysis (Scanner): Reads source code character-by-character, converting it into meaningful lexemes and representing them as tokens.
  2. Syntax Analysis: Takes tokens as input and generates a parse tree, checking for syntactic correctness.
  3. Semantic Analysis: Checks if the parse tree follows language rules, tracking identifiers and types, and produces an annotated tree syntax.
  4. Intermediate Code Generation: Generates code between HLL and machine language, designed for easy conversion to target machine code.
  5. Code Optimization (Optional): Improves the intermediate code for faster execution and less memory consumption by removing unnecessary lines.
  6. Code Generation: The final stage; translates the optimized intermediate code into the target machine language.
The Symbol Table is a data structure maintained by the compiler to hold identifier names and their types for quick lookup.Error Handling and Lexical Analysis

Error Handling is the mechanism in each phase to detect, report, and recover from errors.
  • Lexical Errors: Detects illegal characters or invalid token formation (e.g., @var).
  • Syntax Errors: Checks for grammatical errors like missing semicolons or unmatched parentheses.
The Lexical Analyzer (Scanner) cleans up source code (removes spaces, tabs, comments) and groups characters into Tokens, which are the basic building blocks of the program.

Categories of Tokens:
  • Keywords: Reserved words with specific meanings (e.g., if, void).
  • Identifiers: User-defined names (variables, functions) starting with a letter or underscore.
  • Constants: Fixed values that do not change (literals).
  • Operators: Symbols performing actions (e.g., +, =, >).
  • Special Symbols: Used for structure and grouping (e.g., ;, {}, []).
Finite Automata, Regular Expressions, and LEX
  • Finite State Machine (FSM) / Finite Automata (FA): Used to recognize patterns in input, accepting or rejecting a string based on reaching a final state. FA is defined by a finite set of states (Q), input symbols ($\Sigma$), an initial state ($q_0$), a final state (F), and a transition function ($\delta$).
    • DFA (Deterministic FA): The input character transitions to only one state and does not accept the null move.
    • NDFA (Non-Deterministic FA): Can transition to any number of states for a particular input and accepts the NULL move.
  • Regular Expression: A sequence of patterns defining a string, used to denote regular languages. Key operations include Union ($L \cup M$), Intersection ($L \cap M$), and Kleene closure ($L^*$, zero or more occurrences).
  • LEX: A program used to generate a lexical analyzer, typically paired with the YACC parser generator.
UNIT 2: Context Free Grammar and ParsingContext Free Grammar (CFG) and Derivation

CFG is a formal grammar used to generate all possible strings in a language, often used to describe programming languages and nested structures (like balanced parentheses). It is defined by $G=(V, T, P, S)$, which represents non-terminal symbols, terminal symbols, production rules, and the start symbol, respectively.

Derivation is the sequence of production rules used to generate the input string.
  • Left-most Derivation: Scans and replaces the input with production rules from left to right.
  • Right-most Derivation: Scans and replaces the input with production rules from right to left.
A Parse Tree is a graphical representation where the root is the start symbol, all interior nodes are non-terminals, and all leaf nodes are terminals. Ambiguity occurs if a grammar has more than one parse tree, leftmost derivation, or rightmost derivation for a given input string.Parsing Techniques

A parser breaks the sequence of tokens into smaller elements, outputting a parse tree.
  • Top-Down Parsing (Predictive): Starts from the start symbol and derives the input string.
  • Bottom-Up Parsing (Shift-Reduce): Starts with the input symbol and constructs the parse tree up to the start symbol by reversing the rightmost derivations.
Shift-Reduce Parsing reduces a string to the grammar's start symbol using a stack (holding grammar symbols) and an input tape. It uses two primary actions: shift (pushes the current input symbol onto the stack) and reduce (replaces symbols on the right side of a production with the non-terminal on the left side).

Operator Precedence Parsing is a shift-reduce method that applies only to a class of grammars where: 1) no right-hand side has $\epsilon$, and 2) no two non-terminals are adjacent. Precedence is established between terminals using relations $\succ$ (higher), $\prec$ (lower), and $\doteq$ (equal).

LR Parser scans input Left-to-right (L) and constructs a Rightmost derivation in reverse (R). It is divided into four main types:
Parser TypeBasisReduce Move Placement
LR(0)Canonical collection of LR(0) items (production with a dot)Entire row
SLR(1)Canonical collection of LR(0) itemsOnly in the Follow of the left-hand side non-terminal
CLR(1)Canonical collection of LR(1) items (LR(0) + look ahead symbol)Only in the lookahead symbols
LALR(1)Combines LR(1) items with the same core production but different lookahead symbols (e.g., $I_{3}$ and $I_{6}$ become $I_{36}$)Only in the lookahead symbols
Intermediate Code and Translation
  • Syntax Directed Translation (SDT): A technique where semantic rules are associated with grammar productions ($Grammar + semantic\ rule = SDT$). These rules evaluate attributes (e.g., E.val) for non-terminals.
  • Three Address Code (TAC): An intermediate code that uses at most three addresses and one operator in the form $a = b\ op\ c$.
    • Quadruple: Uses 4 fields: op, arg1, arg2, and result. Easy for global optimization.
    • Triples: Uses 3 fields: op, arg1, and arg2. Temporaries are implicit, making code rearrangement difficult.
    • Indirect Triples: Uses pointers to a separate listing of computations, allowing for easier rearrangement than triples.
  • Basic Block: A sequence of three address statements with a single entry and a single exit point, used as a unit for optimization.
  • Loop Detection: Uses Control Flow Analysis (CFA) and a Program Flow Graph (PFG); a cycle in the PFG indicates a loop.
UNIT 4: Runtime Environment and Code GenerationStorage Allocation and Memory Management
  • Stack Allocation: A runtime management technique where Activation Records (ARs) are pushed onto the stack when a procedure is invoked and popped when it ends. ARs store local variables.
    • Activation Record Contents: Actual Parameters, Return Address, Return Value, Local Data, Old Stack Pointer (to the caller's AR), and optional access/control links.
  • Accessing Non-Local Data: Accessing variables from an outer scope.
    • Static Scoping: Scope determined at compile-time.
    • Dynamic Scoping: Scope determined at run-time based on the execution stack.
    • Mechanisms: Access Links (traverse the static scope chain) and Display (array of pointers to ARs).
  • Heap Management: Manages dynamic memory allocation and deallocation. Strategies include First-Fit, Best-Fit, and Worst-Fit allocation.
  • Garbage Collection (GC): Automatically identifies and reclaims unused memory.
    • Types: Mark-and-Sweep (marks reachable objects, reclaims unmarked), Reference Counting (reclaims objects with zero references), and Generational GC (collects younger objects more frequently).
    • Trace-based GC: Starts from a Root Set (global, stack variables) and recursively traces all reachable objects to mark them as live.
Code Generator

A code generator translates intermediate representations into machine code or bytecode.

Key Responsibilities:
  • Instruction Selection: Choosing the most efficient machine instructions.
  • Instruction Scheduling: Ordering instructions to minimize execution time.
  • Register Allocation: Assigning variables to registers to reduce memory accesses.
Issues in Design: Complexity in instruction selection, managing register spills, optimizing for specific architectures, and ensuring platform independence.UNIT-V: Code Optimization and Data Flow AnalysisMachine Independent Code Optimization

This optimization transforms intermediate code without involving specific hardware components to eliminate redundancies and reduce code size.

Function Preserving Optimization:
  1. Common Subexpression Elimination: Removing repeated calculations of the same expression.
  2. Constant Folding: Replacing expressions computed at compile-time (e.g., $5+7$) with their constant value (e.g., $12$).
  3. Constant Propagation: Replacing a variable with its assigned constant value in later computations.
  4. Dead Code Elimination: Removing code that is never executed or whose result is never used.
  5. Copy Propagation: Replacing a variable with the source variable in copy statements (e.g., if $x=y$, use $y$ instead of $x$).
Loop Optimizations: Aims to reduce time spent inside loops.
  1. Frequency Reduction (Code Motion): Moving loop-invariant code (statements whose values do not change) outside the loop.
  2. Loop Unrolling: Performing the loop operation multiple times within a single iteration to reduce the overall number of loop runs.
  3. Loop Jamming: Combining multiple loops that perform related operations into a single loop.
  4. Strength Reduction: Replacing costly operations (e.g., multiplication) with cheaper ones (e.g., addition or bit shifts).
  5. Redundancy Elimination: Evaluating a repeated expression once and substituting the value in subsequent occurrences.
Peephole Optimization

A local, machine-dependent technique that optimizes small sequences of 2–5 instructions (a "peephole") by replacing them with more efficient code. Techniques include redundant load/store elimination, constant folding, strength reduction, and eliminating null sequences (e.g., $a := a+0$).Data Flow Analysis

A technique used to analyze how data flows through a program to identify optimization opportunities.

Types of Data Flow Properties:
  • Available Expression: An expression whose value has already been computed and can be reused, used for common subexpression elimination.
  • Reaching Definition: Tracks the definition of a variable that "reaches" a specific point, used in constant and variable propagation.
  • Live Variable: A variable whose value is still needed for a future computation; used for register allocation and dead code elimination.
  • Busy Expression: An expression whose evaluation exists along a path and none of its operands are redefined before evaluation; used for code movement optimization.

Comments