Maximilian Stark (mail@dakror.de), SS2020
Stand: 18.07.2020
Compiler Construction
General compiler setup
- Analysis
- Scanner
- Tokenization
- Siever
- Filtering of tokens (spaces, comments)
- Collecting pragmas
- Token-Replacements (Constants, names)
- In practice: User-Code Callbacks per token type
- Generated by specification
- Parser
- Generated by specification
- hierarchical structure = context-free grammar
- Pushdown-Automata
- Type Checker
- Internal Representation
- Synthesis
Lexing
From Regex to NFA
- Thompson's Algorithm
- Stitching together of predefined building blocks
- Linear number of states for given input
- Berry-Sethi Approach (=Glushkov Automaton)
- n + 1 states without $\epsilon$-transitions
- Pre-computation of attributes during DFS
- empty
- can current sub-expression consume $\epsilon$
- Post-Order
- first
- set of child leaves which may be reached first
- Post-Order
- next
- set of leaves which may be reached first after current
- Pre-Order
- At root: $\mathrm{next}[e]=\emptyset$
- last
- set of child leaves which may be reached last
- Post-Order
- Construction using collected sets
- States: Start + non-$\epsilon$-leaves
- Final states: $! \mathrm{empty}[e] ~?~ \mathrm{last}[e] : \{ \cdot e \} \cup \mathrm{last}[e]$
- Transitions $\forall \text{ literals } a:~\{(\cdot e,a,i\cdot ) ~|~ \forall~ i \in \mathrm{first}[e]\} \cup \{(i\cdot ,a,i'\cdot ) ~|~ \forall~ i' \in \mathrm{next}[i]\}$
From NFA to DFA: Powerset-Construction
- Merging of states with same input transitions
- Exponential size of powersets: NFA($n$) -> DFA($2^n$)
- In practice: Only sets directly necessary for given input generated while processing
Scanner Design
- Generation of NFA with set final states for given rule-expressions
- Maximal prefix
- Management of two pointers A, B on the text and their current state in the NFA
- A: Pointing at last position after last executed final state
- B: Reading head
- When B = $\emptyset$, consumption of text between A & B, Reset of A
- Differentiation of Scanner state (for comments)
Parsing
Context-free grammar
- start-symbol
- terminals
- non-terminals
- productive
- derivation of symbol ends in word
- Modeled by And-Or-Graph of grammar
- AND-Nodes: Rules
- OR-Nodes: NT
- Construction
- Pick NT ending in T
- Arrow to Rules producting this NT
- Arrows from Rule to that rule's NT
- Goto 2
- reachable
- can be reached from start NT
- Depdency graph
- rules
- derivations (DFS down the tree)
- Leftmost
- Pre-Order left-to-right
- Top-Down syntax tree construction
- Rightmost
- Reverse Rightmost
- Bottom-Up syntax tree construction
- Post-Order left-to-right
- Unique grammar: Derivation of all words is unique
- Reduced grammar
- Grammar without useless NT
- Convertible in linear time
- Process
- productive subset $N_1\subseteq N$
- Construction of $P_1 = \{A\rightarrow \alpha \in P ~|~ A \in N_1 \wedge \alpha\in(N_1 \cup T)*\}$
- Productive and reachable subset $N_2 \subseteq N_1$
- Construction of $P_2 = \{A\rightarrow \alpha \in P ~|~A\in N_2 \wedge \alpha\in (N_2 \cup T)*\}$
- Result $G' = (N_2, T, P_2, S)$
- $LL(1)$ grammar
- Reduced grammar
- $\forall~ A\rightarrow \alpha, A\rightarrow \alpha' \in P ~.~ \forall~ \text{ derivation } S\rightarrow_L^* uA\beta ~|~ u \in T*: \mathrm{First}_1(\alpha\beta)\cap\mathrm{First}_1(\alpha'\beta)=\emptyset$
- Unique decision of production rule by only looking at first symbol
- Leftrecursive reduced grammar is not $LL(k)$ for any $k$
Top-Down
Pushdown-Automata
- $LL(k)$ Parser: ability to peek $k$ symbols ahead (Left to right, Leftmost derivation)
- Transitions based on stack value
- Types
- Expansions
- Shifts
- Reduces
- Acceptance when terminating with empty input and empty stack
- Construction from CFG
- From leftmost derivations
- From reverse rightmost derivations
- Process
- Convert entry and exit positions per Rule into Stack value following derivation order
- Transitions from stack items of entry, literal, exit value
- Problem: Duplicate rules from construction, Solutions:
- GLL Parsing: Duplicate configuration at conflict and continue in parallel
- Recursive Descent & Backtracking: DFS for appropriate derivation
- Recursive Descent & Lookahead: Conflict resolution by next-input lookup
Lookahead Sets
- $A \odot B = A\{\epsilon\} ~|~ B$ Concat but left side without epsilon
- $\mathrm{First}_1(L)=\{\epsilon~|~\epsilon \in L\}\cup\{u\in T ~|~ \exists~ v \in T^*:uv\in L\}$ Prefix der Länge 1
- If all rules of $G$ are productive, all $\mathrm{First}_1(A)\neq \emptyset$
- $\forall~ \alpha \in (N \cup T)^*: \mathrm{First}(\alpha)=\mathrm{First}_1(\{w\in T^* ~|~\alpha \rightarrow^*w\})$
- $\mathrm{empty}(X) \iff X\rightarrow^*\epsilon$
- $\epsilon$-free First sets
- $F_\epsilon(a) = \{a\} \quad \text{ if }a\in T$
- $F_\epsilon(A)\supseteq F_\epsilon(X_j)\quad \text{ if }A\rightarrow X_1 \dots X_m \in P, \mathrm{empty}(X_1) \wedge \dots \wedge \mathrm{empty}(X_{j-1})$
- Pure unification problem: Solvable in linear space & time
- Fast computation using Variable dependency graph ($\supseteq$ is dependency operator)
- Grouping of nodes into Strongly Connected Components
- Collection of smallest set of shared symbols within components
- Moving / Traveling of symbols along in-going edges between SCC
- Necessity to consider right context for alternatives when producing epsilon
- $\mathrm{Follow}_1(B) \supseteq F_\epsilon(X_j) \quad\text{ if }A\rightarrow \alpha B X_1 \dots X_m \in P, \mathrm{empty}(X_1)\wedge \dots \wedge \mathrm{empty}(X_{j-1})$
- $\mathrm{Follow}_1(B) \supseteq \mathrm{Follow}_1(A) \quad\text{ if }A\rightarrow \alpha B X_1 \dots X_m \in P, \mathrm{empty}(X_1)\wedge \dots \wedge \mathrm{empty}(X_m)$
- Computation of lookahead table $M$
- $M[B, w] = i$ with $B\rightarrow\gamma^i$ if $w\in\mathrm{First}_1(\gamma) \odot \mathrm{Follow}_1(B)$
Right-Regular Context-Free Parsing (RR-CFG)
- considered $RLL(1)$ if corresponding CFG is also $LL(1)$
- Addressing the recurring theme of lists of things
- CFG with Regex as production rules
- Conversion into regular CFG
- Replacing Regex expression with corresponding set of production rules
- Recursive Descent RLL Parsers
Recursive Descent RLL Parsers
- generate($\alpha$) metaprogram, decomposition of given expression
- $r^*$: while loop
- $r_1|\dots|r_k$ switch over $labels(\mathrm{First}_1(r_i))$
- Lookahead function expect()
- Only subset of CFG parseable this way, Issue: non-disjoint First-sets
- Solution 1: More aggressive prefix-based grouping of rules
- Solution 2: Ranked Grammars for clear order in case of conflicts
- Solution 3: Going from $LL(1)$ to $LL(k)$
- Problem: Still not all CFG parseable
- Solution: Regular Lookahead $LL(*)$
Bottom-Up
Shift-Reduce Parsers
- Push input on stack until matching full right side of production
- Transitions
- Shift: From NT to NT + matched symbol
- Reduce: From production to left-hand NT
- Finish marker
- Sequence of reductions = Reverse rightmost derivation
- Non-deterministic
- Viable prefix $\alpha \gamma$ of item $[B\rightarrow\gamma\bullet]$ if $\alpha\gamma \vdash \alpha B$
- Core problem: Finding of admissable items where the stack is a viable prefix
- Item $[B\rightarrow \gamma\bullet\beta] $ admissable for $\alpha\gamma\iff S\rightarrow_R^* \alpha B v$
- Characteristic Automation $c(G)$ of $G$
Characteristic Automation $c(G)$ of $G$
- Consumption of right-hand sides
- Tracing of admissable items in its states (entry, exit, shifting $\bullet$ around)
- Construction
- States: Items
- Start state: $[S'\rightarrow\bullet S]$
- Final states: $\{[B\rightarrow\gamma\bullet]~|~B\rightarrow \gamma\in P\}$
- Transitions
- Bullet processing within rule by consuming transactions: $([A\rightarrow\alpha\bullet X \beta], X, [A\rightarrow\alpha X \bullet\beta]), \quad X\in(N\cup T), A\rightarrow\alpha X\beta\in P$
- When bullet in front of NT, $\epsilon$ to all its rules: $([A\rightarrow\alpha\bullet B \beta], \epsilon, [B\rightarrow\bullet\gamma]), \quad A\rightarrow\alpha B\beta, B\rightarrow\gamma\in P$
Canonical $LR(0)$-Automaton
- Deterministic characteristic automaton
- Creation from $c(G)$
- Performing arbitrarily many $\epsilon$-Transitions after every consuming transition
- Performing powerset construction
- Alternatively: Apply $c(G)$ to powersets directly
- Construction directly from the grammar
- Needed helper function $\delta_\epsilon^*$ ($\epsilon$-closure): $\delta_\epsilon^*(q)=q\cup\{[B\rightarrow\bullet\gamma]~|~B\rightarrow\gamma\in P, [A\rightarrow\alpha\bullet B'\beta']\in q, B'\rightarrow^* B\beta\}$
- States: Sets of items
- Start state: $\delta_\epsilon^*\{[S'\rightarrow \bullet S]\}$
- Final states: $\{q~|~[A\rightarrow\alpha\bullet]\in q\}$
- Transitions: $\delta(q,X)=\delta_\epsilon^*\{[A\rightarrow\alpha X\bullet\beta]~|~[A\rightarrow\alpha\bullet X\beta]\in q\}$
$LR(0)$-Parser
- Management of prefix $\alpha=X_1 \dots X_m$ on pushdown and usage of $LR()G$ for reduction spot identification
- Non deterministic:
- conflicts from multiple items in final states (=$LR(0)$-unsuited states)
- Reduce-Reduce conflict
- Shift-Reduce conflict
- Solution: Right context lookahead $LR(k)$ with $k>0$
- Reduction with $A\rightarrow\gamma$ if $[A\rightarrow\gamma\bullet]$ admissible for $\alpha$
- Optimization: Pushing of states instead of $X_i$
- Upon reduction with $A\rightarrow\gamma$: Popping of $|\gamma|$ pushdown states and continuing from now top state with symbol $A$
- Only rules within final states of interest
- Construction
- States: $Q\cup \{f\}$
- Start state: $q_0$
- Final state $f$
- Transitions
- Shift: $(p,a,pq)\quad \text{ if }\quad q=\delta(p,a)\neq\emptyset$
- Reduce: $(pq_1\dots q_m,\epsilon, pq)\quad \text{ if }\quad [A\rightarrow X_1\dots X_m\bullet]\in q_m, q=\delta(p,A)$
- Finish: $(q_0 p,\epsilon, f)\quad \text{ if }\quad [S'\rightarrow S\bullet]\in p$
- With canonical automaton $LR(G)=(Q,T,\delta,q_0,F)$
$LR(1)$-Parser
- $LR(1)$-item: $[B\rightarrow\alpha\bullet\beta,x]$ with $x\in\mathrm{Follow}_1(B)=\bigcup\{\mathrm{First}_1(\nu)~|~S\rightarrow^*\mu B\nu\}$
- Admissible $LR(1)$-item $[B\rightarrow\gamma\bullet\beta,x]$ for $\alpha\gamma$ if $S\rightarrow_R^*\alpha B w$ with $\{x\}=\mathrm{First}_1(w)$
- $LR(1)$-Automaton
- Resolution of all previous conflicts in $LR(0)$ states
- Characteristic $LR(1)$-Automaton $c(G, 1)$: like $c(G)$ but additional management of 1-prefix of $\mathrm{Follow}_1$ of left hand sides
- Canonical $LR(1)$-Automaton: Conversion like previous from $c(G,1)$
- New $\epsilon$-closure: Propagation of lookahead as well
- Issue: transitions will break due to mismatch of given and expected lookahead between states
- Removing of old transitions and creation of new target states with correct lookahead
- Vast increase of states
- Action table
- In practice states just as integer ids
- Lookup table for reduction / shift steps at final states
- $\mathrm{action}:Q\times T \rightarrow LR(0)\text{-Items }\cup\{s, error\}$
- goto table
- Encoding of transitions: $\mathrm{goto}[q, X]=\delta(q,X)\in Q$
- Possibility of conflicts as with $LR(0)$ if lookahead equal for two productions
- Precedences
- Assigning operator token priorities
- Associativity setting per token
- Increase of lookahead: Exponential explosion of lookahead sets
- Locally varying lookahead size
$LR(k)$ to $LR(1)$
- Iterative algorithm, grouping together ambiguous parts of sub-productions and making new rules from them
- Drawing the actually differentiating symbol closer to the start of the rule
- Process
- Right-context extraction from Nonterminal into Terminal
- Right-context propagation into new rules
Semantic Analysis
Attribute Grammars
- CFGs with additional sets attributes per $N$ and $T$
- inherited attributes: value defined by top-down
- synthesized attributes: value defined by bottom-up
- Local computations: Attribute equations evaluated at individual nodes
- Sequential dependency of equation components
- General Attribute Systems
- Unified way of addressing family nodes in tree
- $\mathrm{attribute}_k[0]$: attribute of currently selected top nodes
- $\mathrm{attribute}_k[i]$: attribute of i-th child (1-based)
- Challenge: Application of static evaluation strategy
- Require acyclicity
- Calculation time exponentially complex
- Solutions
- User-given strategy
- Dynamic strategy determination
- Automation of subclasses only
- Subclasses
Strongly Acyclic Grammar
- Strongly Acyclic Attribute Dependencies
- Calculation of overapproximation of global dependencies
- $D(p)$ local dependencies of nodes
- $L[i]= \{(a[i],b[i]) ~|~(a,b)\in L\}$ projection of relations into a given tree-context
- $\pi_0(S)=\{(a,b)~|~(a[0],b[0])\in S\}$ un-projection of root-node relevant relations from context
- $[[p]]^\#(L_1,\dots,L_k)=\pi_0((D(p)\cup L_1[1]\cup\dots\cup L_k[k])^+)$ root-projection of transitive closure of all children
- $\mathcal{R}(X) \supseteq (\bigcup\{[[p]]^\#(\mathcal{R}(X_1),\dots,\mathcal{R}(X_k))~|~p: X\rightarrow X_1\dots X_k\})^+ ~|~p\in P$ mapping of symbols to relations (global attribute dependencies)
- Inequality system with unique least solution $\mathcal{R}^\star(X)$
- Process for Grammar testing
- No contribution by terminals
- Re-decoration and embedding of child nodes' $\mathcal{R}$
- transitive closure of all relations in union of all now-local dependencies
- Checking for cycles
- Pruning of cycle-affected attributes
- Application of $\pi_0$
- if all $D(p)\cup\mathcal{R}^\star(X_1)[1]\cup\dots\cup\mathcal{R}^\star(X_k)[k] ~\forall~ p\in G$, then G is strongly acyclic
From dependencies to evaluation strategies
- Linear order from dependency partial order
- demand-driven evaluation
- evaluate dependency only as soon as required by current attribute
- required pointer to parent nodes
- not local
- evaluation in passes
- Computation flow
- Comptation of inherited attributes before descending (pre-order)
- Computation of synthesized attributes after ascending (post-order)
- L-Attributed Grammars
- If for all productions $S\rightarrow S_1\dots S_n$ every inherited attribute of $S_j$ with $1\leq j \leq n$ only depends on attributes of $S_1, \dots, S_{j-1}$ and the inherited attributes of $S$
- Can be evaluated during parsing
- Fixed evaluation strategy: single DFS traversal
- Partitioning of attributes into L-attributed sets for conformity
Type Checking
- Type system $\Gamma$ rules on inner nodes of AST
- Predicate logic statements $\frac{\text{condition}}{\text{statement}}$, e.g array access: $\displaystyle \frac{\Gamma\vdash e_1 : t*\quad\Gamma\vdash e_2 : \mathrm{int}}{\Gamma\vdash e_1[e_2] : t}$
- Axioms for Constants and Variables (lookup in the type-table)
- Structural type equality
- Semantic equality (same access patterns), but differently constructed structs
- Does not hold in C
- Example: Pulling in of struct pointer struct
- Testing process
- Equivalence (deduction) rules for decomposition: Again predicate logic-esque diagram
- Raw type comparison
- Pointer stripping
- Type expansion
- Resolving of types until contradiction (failure) or syntactic equivalence (success)
- Subtyping
- Same procedure as with structural type equality
- Accept if type is $\leq$ other type
- For structs: at least all members of parent
- For functions: parameter sub-super type comparison is reversed (contra-variance)
Code Synthesis
Register C-Machine (RCMa)
- Virtual machine -> Interpreter
- Unlimited number of registers
- local registers: temporary results
- global registers: parameters and return values
- only ints as primitives
- No syscalls
- Comparison to real systems
- Ease of restriction lifts
- Dalvik & LLVM quite similar
- Platform independent interpreter
Translation
- Function $\mathrm{code}_R^i ~c~ \rho$ with $i$ keeping track of the first available register, our expression $c$, and name-to-address mapping environment $\rho$
- $\mathrm{code}_R^i ~c~ \rho = \mathrm{loadc} ~R_i~c$ load constant
- $\mathrm{code}_R^i ~x~ \rho = \mathrm{move} ~R_i~R_a$ move var $x$ at address $a$ into $R_i$
- Statements: Expression but ignoring of result register
- Switch: Jump table, indirect jumps: $\mathrm{check}^i ~l~u~B = \mathrm{jump-if} ~l\leq R_i\leq u$
- Jump to $B + u$ using $\mathrm{jumpi}~R_i~B$ with value $u$ in $R_i$
- else jumps to end of B block $B + k;~k = u-l$
- Alternative / Optimization
- Omitting of checks if information about switch value range known
- If-cascade $\mathcal{O}(\log n)$ tests: Further optimizing by reordering to prioritize frequent paths
- Functions
- Convention: $i$-th parameter in $R_{-i}$
- Result in $R_0$
- Process: Push, Call, Pop