Parser
Parses CFG of the language.
Only one way to derive a thing!
LL(1) Parsing
First and Follow Sets
To construct parser for LL(1) grammar, first workout First and Follow sets.
- Generate First set
- For
- Add terminals in to .
- If , add terminals in to .
- If , add to .
- Generate Follow set
- \text{Follow}(S) = \$$ if S$ is the start symbol.
- For , add to , excluding .
- For , or if , add non-terminals in to .
Warning
- In deriving First set, be careful with non-terminals that could potentially produce !
- In deriving Follow set, add Follow set of the LHS to the RHS, not vice versa. CFG goes from left to right only. (Think of the things in the right have no “control” on where it’s placed.)
In deriving First set, go bottom-up; in deriving Follow set, go
top-down.
- CFG → LL(1) → Recursive Descent or Table Driven
Recursive Descent Parsing
- One simple function for each non-terminal.
- Function body follows the RHS of the rule:
- Non-terminals result in a call to another parse function
- Terminals result in considering the next token.
- Expect what’s in the follow set.
int parse_P() { return parse_E() && expect_token(TOKEN_EOF); }
int parse_E() { return parse_T() && parse_E_prime(); }Table-Driven Parsing
This approach is top-down. Using First set to determine which rule will be applied.
- For each rule (denoted by the number):
- , add this rule to .
- If , , including \$$, add this rule to T[A, b]$.
| Stack | Input | Action |
|---|---|---|
P$ | int * int $ | apply 1: P => E |
| … | … | … |
int T' E' $ | int * int $ | match int |
T' E' $ | * int $ |
- Parsing process
- Push top symbol and EOF into stack.
- Each time, look at the left-most token in the Input column.
- Based on the table, decide which rule to apply. Pop and push symbols into the stack accordingly.
- Some times token is matched without rule applied.
LR Parsing
Shift-Reduce Parsing
| Stack | input | Action |
|---|---|---|
id ( id + id ) $ | shift | |
id | ( id + id ) $ | shift |
id ( | id + id ) $ | shift |
id ( id | + id ) $ | reduce T -> id |
T ( id | + id ) $ | reduce E -> T |
- Shift = consume a token and push it onto the stack
- Reduce = apply a rule to the top of the stack, pop tokens from and push non-terminals onto the stack.
In this process, conflicts may occur!
LR(0) Automaton
- Represents all the possible rules that are currently under consideration by a shift-reduce parser.
- Also known as canonical collection or compact finite state machine of the grammar.
- Each state:
- Kernel, for state 0:
P -> . E. - Closure, add all rules on the right side of the dot:
E -> . E + T,E -> . T. - Takes non-terminals or terminals, and transition to other state. Move the dot accordingly.
- Kernel, for state 0:
- Conflicts
- Shift-reduce:
T -> id . ( E )versusT -> id . - Reduce-reduce:
S -> id ( E ) .versusE -> id ( E ) .
- Shift-reduce:
- LR(0) automaton tells us which rules are available, but we don’t know which rules to apply exactly!
SLR Parsing
TL;DR SLR = LR(0) automaton + Follow Set
- S stands for Simple. SLR parsing utilizes the Follow set to resolve conflicts.
- In short, take reduction to
Aonly when the next token is inFollow(A). - Grammar can be parsed by SLR ⇒ SLR Grammar
- SLR Parse Table has Goto and Action sections.
| State | E | T | + | int | `$ |
|---|---|---|---|---|---|
| 0 | |||||
| 1 | GOTO | GOTO | ACTION | … | |
| … |
- To construct the table, if the DOT is followed by:
- Terminal, simply shift to another state according to LR(0) automaton.
- Non-Terminal, goto another state according to LR(0) automaton.
- Nothing, reduce according to the rule.
| Stack | Symbols | Input | Action |
|---|---|---|---|
| 0 | id ( id + id ) $ | shift 4 | |
| 0 4 | id | ( id + id ) $ | shift 5 |
| 0 4 5 | id ( | id + id ) $ | shift 4 |
| 0 4 5 4 | id ( id | + id ) $ | reduce T -> id |
| 0 4 5 8 | id ( T | + id ) $ |
- Parsing process
- Shift tokens
- Push states into stack 1.
- Push symbols into stack 2.
- If reduce
- Pop the state at top.
- Since a non-terminal is created, now use the non-terminal and the state at the top of the stack to perform goto
- Push the new state to the stack.
- Shift tokens
Distinction between LL(1) and LR(1) Parsing Each LL(1) parsing state
considers only a single non-terminal, while each LR(1) parsing state considers a large number of possible configurations.
However, SLR is only a proper subset of LR(1)!
LR(1) Parsing
- LR(1) automaton augments LR(0) automaton by annotating the lookahead.
- Lookahead = set of tokens that can follow the non-terminal, in the current
parsing state. — I'm looking for this after this non-terminal.
- For , add new rules like with the same lookahead.
- For
- If cannot be , the lookahead is .
- If can be , the lookahead is plus the lookahead of .
- When constructing the automaton, we bring the lookahead from one state to the next.
LALR Parsing
- LA = Lookahead.
- The aforementioned automaton could be too big. In LALR parsing, we can merge states with the same core, by merging their lookahead.