Compiler Construction

General compiler setup

Analysis
- Scanner
  - Tokenization
  - Siever
    - Filtering of tokens (spaces, comments)
    - Collecting pragmas
    - Token-Replacements (Constants, names)
    - In practice: User-Code Callbacks per token type
  - Generated by specification
- Parser
  - Generated by specification
    - hierarchical structure = context-free grammar
    - Pushdown-Automata
- Type Checker
Internal Representation
Synthesis

Lexing

From Regex to NFA

Thompson's Algorithm
- Stitching together of predefined building blocks
- Linear number of states for given input
Berry-Sethi Approach (=Glushkov Automaton)
- n + 1 states without $\epsilon$-transitions
- Pre-computation of attributes during DFS
  1. empty
    - can current sub-expression consume $\epsilon$
    - Post-Order
  2. first
    - set of child leaves which may be reached first
    - Post-Order
  3. next
    - set of leaves which may be reached first after current
    - Pre-Order
    - At root: $\mathrm{next}[e]=\emptyset$
  4. last
    - set of child leaves which may be reached last
    - Post-Order
- Construction using collected sets
  - States: Start + non-$\epsilon$-leaves
  - Final states: $! \mathrm{empty}[e] ~?~ \mathrm{last}[e] : \{ \cdot e \} \cup \mathrm{last}[e]$
  - Transitions $\forall \text{ literals } a:~\{(\cdot e,a,i\cdot ) ~|~ \forall~ i \in \mathrm{first}[e]\} \cup \{(i\cdot ,a,i'\cdot ) ~|~ \forall~ i' \in \mathrm{next}[i]\}$

From NFA to DFA: Powerset-Construction

Merging of states with same input transitions
Exponential size of powersets: NFA($n$) -> DFA($2^n$)
In practice: Only sets directly necessary for given input generated while processing

Scanner Design

Generation of NFA with set final states for given rule-expressions
Maximal prefix
Management of two pointers A, B on the text and their current state in the NFA
- A: Pointing at last position after last executed final state
- B: Reading head
- When B = $\emptyset$, consumption of text between A & B, Reset of A
Differentiation of Scanner state (for comments)

Parsing

Context-free grammar

start-symbol
terminals
non-terminals
- productive
  - derivation of symbol ends in word
  - Modeled by And-Or-Graph of grammar
    - AND-Nodes: Rules
    - OR-Nodes: NT
    - Construction
      1. Pick NT ending in T
      2. Arrow to Rules producting this NT
      3. Arrows from Rule to that rule's NT
      4. Goto 2
- reachable
  - can be reached from start NT
  - Depdency graph
rules
derivations (DFS down the tree)
- Leftmost
  - Pre-Order left-to-right
  - Top-Down syntax tree construction
- Rightmost
  - Pre-Order right-to-left
- Reverse Rightmost
  - Bottom-Up syntax tree construction
  - Post-Order left-to-right
Unique grammar: Derivation of all words is unique
Reduced grammar
- Grammar without useless NT
- Convertible in linear time
- Process
  1. productive subset $N_1\subseteq N$
  2. Construction of $P_1 = \{A\rightarrow \alpha \in P ~|~ A \in N_1 \wedge \alpha\in(N_1 \cup T)*\}$
  3. Productive and reachable subset $N_2 \subseteq N_1$
  4. Construction of $P_2 = \{A\rightarrow \alpha \in P ~|~A\in N_2 \wedge \alpha\in (N_2 \cup T)*\}$
  5. Result $G' = (N_2, T, P_2, S)$
$LL(1)$ grammar
- Reduced grammar
- $\forall~ A\rightarrow \alpha, A\rightarrow \alpha' \in P ~.~ \forall~ \text{ derivation } S\rightarrow_L^* uA\beta ~|~ u \in T*: \mathrm{First}_1(\alpha\beta)\cap\mathrm{First}_1(\alpha'\beta)=\emptyset$
- Unique decision of production rule by only looking at first symbol
Leftrecursive reduced grammar is not $LL(k)$ for any $k$

Top-Down

Pushdown-Automata

$LL(k)$ Parser: ability to peek $k$ symbols ahead (Left to right, Leftmost derivation)
Transitions based on stack value
- Types
  - Expansions
  - Shifts
  - Reduces
Acceptance when terminating with empty input and empty stack
Construction from CFG
- From leftmost derivations
- From reverse rightmost derivations
- Process
  - Convert entry and exit positions per Rule into Stack value following derivation order
  - Transitions from stack items of entry, literal, exit value
  - Problem: Duplicate rules from construction, Solutions:
    - GLL Parsing: Duplicate configuration at conflict and continue in parallel
    - Recursive Descent & Backtracking: DFS for appropriate derivation
    - Recursive Descent & Lookahead: Conflict resolution by next-input lookup

Lookahead Sets

$A \odot B = A\{\epsilon\} ~|~ B$ Concat but left side without epsilon
$\mathrm{First}_1(L)=\{\epsilon~|~\epsilon \in L\}\cup\{u\in T ~|~ \exists~ v \in T^*:uv\in L\}$ Prefix der Länge 1
If all rules of $G$ are productive, all $\mathrm{First}_1(A)\neq \emptyset$
$\forall~ \alpha \in (N \cup T)^*: \mathrm{First}(\alpha)=\mathrm{First}_1(\{w\in T^* ~|~\alpha \rightarrow^*w\})$
$\mathrm{empty}(X) \iff X\rightarrow^*\epsilon$
$\epsilon$-free First sets
- $F_\epsilon(a) = \{a\} \quad \text{ if }a\in T$
- $F_\epsilon(A)\supseteq F_\epsilon(X_j)\quad \text{ if }A\rightarrow X_1 \dots X_m \in P, \mathrm{empty}(X_1) \wedge \dots \wedge \mathrm{empty}(X_{j-1})$
- Pure unification problem: Solvable in linear space & time
- Fast computation using Variable dependency graph ($\supseteq$ is dependency operator)
  - Grouping of nodes into Strongly Connected Components
  - Collection of smallest set of shared symbols within components
  - Moving / Traveling of symbols along in-going edges between SCC
Necessity to consider right context for alternatives when producing epsilon
- $\mathrm{Follow}_1(B) \supseteq F_\epsilon(X_j) \quad\text{ if }A\rightarrow \alpha B X_1 \dots X_m \in P, \mathrm{empty}(X_1)\wedge \dots \wedge \mathrm{empty}(X_{j-1})$
- $\mathrm{Follow}_1(B) \supseteq \mathrm{Follow}_1(A) \quad\text{ if }A\rightarrow \alpha B X_1 \dots X_m \in P, \mathrm{empty}(X_1)\wedge \dots \wedge \mathrm{empty}(X_m)$
Computation of lookahead table $M$
- $M[B, w] = i$ with $B\rightarrow\gamma^i$ if $w\in\mathrm{First}_1(\gamma) \odot \mathrm{Follow}_1(B)$

Right-Regular Context-Free Parsing (RR-CFG)

considered $RLL(1)$ if corresponding CFG is also $LL(1)$
Addressing the recurring theme of lists of things
CFG with Regex as production rules
Conversion into regular CFG
- Replacing Regex expression with corresponding set of production rules
- Recursive Descent RLL Parsers

Recursive Descent RLL Parsers

generate($\alpha$) metaprogram, decomposition of given expression
- $r^*$: while loop
- $r_1|\dots|r_k$ switch over $labels(\mathrm{First}_1(r_i))$
Lookahead function expect()
Only subset of CFG parseable this way, Issue: non-disjoint First-sets
- Solution 1: More aggressive prefix-based grouping of rules
- Solution 2: Ranked Grammars for clear order in case of conflicts
- Solution 3: Going from $LL(1)$ to $LL(k)$
  - Problem: Still not all CFG parseable
  - Solution: Regular Lookahead $LL(*)$

Bottom-Up

Shift-Reduce Parsers

Push input on stack until matching full right side of production
Transitions
- Shift: From NT to NT + matched symbol
- Reduce: From production to left-hand NT
- Finish marker
Sequence of reductions = Reverse rightmost derivation
Non-deterministic
Viable prefix $\alpha \gamma$ of item $[B\rightarrow\gamma\bullet]$ if $\alpha\gamma \vdash \alpha B$
Core problem: Finding of admissable items where the stack is a viable prefix
- Item $[B\rightarrow \gamma\bullet\beta] $ admissable for $\alpha\gamma\iff S\rightarrow_R^* \alpha B v$
- Characteristic Automation $c(G)$ of $G$

Characteristic Automation $c(G)$ of $G$

Consumption of right-hand sides
Tracing of admissable items in its states (entry, exit, shifting $\bullet$ around)
Construction
- States: Items
- Start state: $[S'\rightarrow\bullet S]$
- Final states: $\{[B\rightarrow\gamma\bullet]~|~B\rightarrow \gamma\in P\}$
- Transitions
  1. Bullet processing within rule by consuming transactions: $([A\rightarrow\alpha\bullet X \beta], X, [A\rightarrow\alpha X \bullet\beta]), \quad X\in(N\cup T), A\rightarrow\alpha X\beta\in P$
  2. When bullet in front of NT, $\epsilon$ to all its rules: $([A\rightarrow\alpha\bullet B \beta], \epsilon, [B\rightarrow\bullet\gamma]), \quad A\rightarrow\alpha B\beta, B\rightarrow\gamma\in P$

Canonical $LR(0)$-Automaton

Deterministic characteristic automaton
Creation from $c(G)$
1. Performing arbitrarily many $\epsilon$-Transitions after every consuming transition
2. Performing powerset construction
3. Alternatively: Apply $c(G)$ to powersets directly
Construction directly from the grammar
- Needed helper function $\delta_\epsilon^*$ ($\epsilon$-closure): $\delta_\epsilon^*(q)=q\cup\{[B\rightarrow\bullet\gamma]~|~B\rightarrow\gamma\in P, [A\rightarrow\alpha\bullet B'\beta']\in q, B'\rightarrow^* B\beta\}$
- States: Sets of items
- Start state: $\delta_\epsilon^*\{[S'\rightarrow \bullet S]\}$
- Final states: $\{q~|~[A\rightarrow\alpha\bullet]\in q\}$
- Transitions: $\delta(q,X)=\delta_\epsilon^*\{[A\rightarrow\alpha X\bullet\beta]~|~[A\rightarrow\alpha\bullet X\beta]\in q\}$

$LR(0)$-Parser

Management of prefix $\alpha=X_1 \dots X_m$ on pushdown and usage of $LR()G$ for reduction spot identification
Non deterministic:
- conflicts from multiple items in final states (=$LR(0)$-unsuited states)
  - Reduce-Reduce conflict
  - Shift-Reduce conflict
- Solution: Right context lookahead $LR(k)$ with $k>0$
Reduction with $A\rightarrow\gamma$ if $[A\rightarrow\gamma\bullet]$ admissible for $\alpha$
Optimization: Pushing of states instead of $X_i$
- Upon reduction with $A\rightarrow\gamma$: Popping of $|\gamma|$ pushdown states and continuing from now top state with symbol $A$
- Only rules within final states of interest
Construction
- States: $Q\cup \{f\}$
- Start state: $q_0$
- Final state $f$
- Transitions
  - Shift: $(p,a,pq)\quad \text{ if }\quad q=\delta(p,a)\neq\emptyset$
  - Reduce: $(pq_1\dots q_m,\epsilon, pq)\quad \text{ if }\quad [A\rightarrow X_1\dots X_m\bullet]\in q_m, q=\delta(p,A)$
  - Finish: $(q_0 p,\epsilon, f)\quad \text{ if }\quad [S'\rightarrow S\bullet]\in p$
  - With canonical automaton $LR(G)=(Q,T,\delta,q_0,F)$

$LR(1)$-Parser

$LR(1)$-item: $[B\rightarrow\alpha\bullet\beta,x]$ with $x\in\mathrm{Follow}_1(B)=\bigcup\{\mathrm{First}_1(\nu)~|~S\rightarrow^*\mu B\nu\}$
Admissible $LR(1)$-item $[B\rightarrow\gamma\bullet\beta,x]$ for $\alpha\gamma$ if $S\rightarrow_R^*\alpha B w$ with $\{x\}=\mathrm{First}_1(w)$
$LR(1)$-Automaton
- Resolution of all previous conflicts in $LR(0)$ states
- Characteristic $LR(1)$-Automaton $c(G, 1)$: like $c(G)$ but additional management of 1-prefix of $\mathrm{Follow}_1$ of left hand sides
- Canonical $LR(1)$-Automaton: Conversion like previous from $c(G,1)$
- New $\epsilon$-closure: Propagation of lookahead as well
- Issue: transitions will break due to mismatch of given and expected lookahead between states
  - Removing of old transitions and creation of new target states with correct lookahead
  - Vast increase of states
Action table
- In practice states just as integer ids
- Lookup table for reduction / shift steps at final states
- $\mathrm{action}:Q\times T \rightarrow LR(0)\text{-Items }\cup\{s, error\}$
goto table
- Encoding of transitions: $\mathrm{goto}[q, X]=\delta(q,X)\in Q$
Possibility of conflicts as with $LR(0)$ if lookahead equal for two productions
- Precedences
  - Assigning operator token priorities
  - Associativity setting per token
- Increase of lookahead: Exponential explosion of lookahead sets
- Locally varying lookahead size

$LR(k)$ to $LR(1)$

Iterative algorithm, grouping together ambiguous parts of sub-productions and making new rules from them
Drawing the actually differentiating symbol closer to the start of the rule
Process
1. Right-context extraction from Nonterminal into Terminal
2. Right-context propagation into new rules

Semantic Analysis

Attribute Grammars

CFGs with additional sets attributes per $N$ and $T$
- inherited attributes: value defined by top-down
- synthesized attributes: value defined by bottom-up
Local computations: Attribute equations evaluated at individual nodes
Sequential dependency of equation components
General Attribute Systems
- Unified way of addressing family nodes in tree
- $\mathrm{attribute}_k[0]$: attribute of currently selected top nodes
- $\mathrm{attribute}_k[i]$: attribute of i-th child (1-based)
- Challenge: Application of static evaluation strategy
  - Require acyclicity
  - Calculation time exponentially complex
  - Solutions
    - User-given strategy
    - Dynamic strategy determination
    - Automation of subclasses only
- Subclasses
  - Strongly Acyclic Grammar

Strongly Acyclic Grammar

Strongly Acyclic Attribute Dependencies
- Calculation of overapproximation of global dependencies
- $D(p)$ local dependencies of nodes
- $L[i]= \{(a[i],b[i]) ~|~(a,b)\in L\}$ projection of relations into a given tree-context
- $\pi_0(S)=\{(a,b)~|~(a[0],b[0])\in S\}$ un-projection of root-node relevant relations from context
- $[[p]]^\#(L_1,\dots,L_k)=\pi_0((D(p)\cup L_1[1]\cup\dots\cup L_k[k])^+)$ root-projection of transitive closure of all children
- $\mathcal{R}(X) \supseteq (\bigcup\{[[p]]^\#(\mathcal{R}(X_1),\dots,\mathcal{R}(X_k))~|~p: X\rightarrow X_1\dots X_k\})^+ ~|~p\in P$ mapping of symbols to relations (global attribute dependencies)
  - Inequality system with unique least solution $\mathcal{R}^\star(X)$
  - Process for Grammar testing
    1. No contribution by terminals
    2. Re-decoration and embedding of child nodes' $\mathcal{R}$
    3. transitive closure of all relations in union of all now-local dependencies
    4. Checking for cycles
    5. Pruning of cycle-affected attributes
    6. Application of $\pi_0$
if all $D(p)\cup\mathcal{R}^\star(X_1)[1]\cup\dots\cup\mathcal{R}^\star(X_k)[k] ~\forall~ p\in G$, then G is strongly acyclic

From dependencies to evaluation strategies

Linear order from dependency partial order
- demand-driven evaluation
  - evaluate dependency only as soon as required by current attribute
  - required pointer to parent nodes
  - not local
- evaluation in passes
  - Computation flow
    - Comptation of inherited attributes before descending (pre-order)
    - Computation of synthesized attributes after ascending (post-order)
  - L-Attributed Grammars
    - If for all productions $S\rightarrow S_1\dots S_n$ every inherited attribute of $S_j$ with $1\leq j \leq n$ only depends on attributes of $S_1, \dots, S_{j-1}$ and the inherited attributes of $S$
    - Can be evaluated during parsing
    - Fixed evaluation strategy: single DFS traversal
    - Partitioning of attributes into L-attributed sets for conformity

Type Checking

Type system $\Gamma$ rules on inner nodes of AST
Predicate logic statements $\frac{\text{condition}}{\text{statement}}$, e.g array access: $\displaystyle \frac{\Gamma\vdash e_1 : t*\quad\Gamma\vdash e_2 : \mathrm{int}}{\Gamma\vdash e_1[e_2] : t}$
Axioms for Constants and Variables (lookup in the type-table)
Structural type equality
- Semantic equality (same access patterns), but differently constructed structs
- Does not hold in C
- Example: Pulling in of struct pointer struct
- Testing process
  - Equivalence (deduction) rules for decomposition: Again predicate logic-esque diagram
    - Raw type comparison
    - Pointer stripping
    - Type expansion
  - Resolving of types until contradiction (failure) or syntactic equivalence (success)
Subtyping
- Same procedure as with structural type equality
- Accept if type is $\leq$ other type
- For structs: at least all members of parent
- For functions: parameter sub-super type comparison is reversed (contra-variance)

Code Synthesis

Virtual machine -> Interpreter
Unlimited number of registers
- local registers: temporary results
- global registers: parameters and return values
only ints as primitives
No syscalls
Comparison to real systems
- Ease of restriction lifts
- Dalvik & LLVM quite similar
- Platform independent interpreter

Translation

Function $\mathrm{code}_R^i ~c~ \rho$ with $i$ keeping track of the first available register, our expression $c$, and name-to-address mapping environment $\rho$
- $\mathrm{code}_R^i ~c~ \rho = \mathrm{loadc} ~R_i~c$ load constant
- $\mathrm{code}_R^i ~x~ \rho = \mathrm{move} ~R_i~R_a$ move var $x$ at address $a$ into $R_i$
Statements: Expression but ignoring of result register
Switch: Jump table, indirect jumps: $\mathrm{check}^i ~l~u~B = \mathrm{jump-if} ~l\leq R_i\leq u$
- Jump to $B + u$ using $\mathrm{jumpi}~R_i~B$ with value $u$ in $R_i$
- else jumps to end of B block $B + k;~k = u-l$
- Alternative / Optimization
  - Omitting of checks if information about switch value range known
  - If-cascade $\mathcal{O}(\log n)$ tests: Further optimizing by reordering to prioritize frequent paths
Functions
- Convention: $i$-th parameter in $R_{-i}$
- Result in $R_0$
- Process: Push, Call, Pop