DONE Partial Parsing
The partial parser implements the theoretical framework of partial forests as a memoized recursive-descent parser with regex-derivative terminal matching. Given an input $s \in \Sigma^*$, it produces a partial forest $\mathcal{F}$ containing all valid parse trees (complete and partial) consistent with the grammar $G$.
Source: src/logic/partial/parse.rs
Segmentation
The first stage of parsing tokenizes input into a sequence of segments.
A segment is a whitespace-delimited token carrying its byte span and a flag indicating whether it is a prefix of a special token. Formally, segmentation is a function:
$$\text{tokenize}: \Sigma^* \to \text{Segment}^*$$
Segments preserve positional information (byte offsets) and are the atomic unit of the span cache.
Special tokens (multi-character operators like ->, <=) are matched longest-first. If a segment at the end of input is a strict prefix of a special token (e.g., - when -> exists in the grammar), it is marked as a partial special token. This flag disables caching for spans involving that segment, since the token’s identity is ambiguous until more input arrives.
Recursive Descent with Memoization
The parser is a recursive-descent parser over segments, augmented with:
- A persistent span cache for cross-parse memoization
- Cycle detection for left-recursive grammars
- Depth-bounded recursion
The main entry point partial(s) produces a PartialParseOutcome:
- Tokenize $s$ into segments.
- Check cache validity: if the new segments extend the previous input’s segments (prefix match), reuse the cache; otherwise clear it.
- Parse from the start symbol $S$ at position 0 with a fresh
ParseState. - Filter roots to those consuming all segments.
- Return success (with possibly partial trees) or failure.
A parse state tracks per-invocation mutable data:
- $\text{visited}: (N \times \mathbb{N}) \to \mathbb{N}$, mapping (nonterminal, position) to visit count for cycle detection.
- $\text{hit_depth_limit}: \text{bool}$, a sticky flag set when any branch exceeds the recursion bound. When true, no results from this parse invocation are cached.
Nonterminal Parsing
For nonterminal $A$ at position $i$ and depth $d$:
- Depth guard: if $d > d_{\max}$, abort and set $\text{hit_depth_limit} \leftarrow \text{true}$.
- Cache lookup: if the span cache can answer for $(d_{\max}, A, i, \ell)$ for all relevant span lengths, return cached trees.
- Cycle detection: track $(A, i)$ in visited. Allow up to $\min(d_{\max}, |\text{segments}| + 2)$ re-entries before aborting. This handles left-recursive productions like
Term ::= Term BaseTerm. - Try all productions: iterate over productions for $A$ in rotated order (see below), calling $\text{parse_production}$ for each.
- Cache results: if no depth limit was hit and segments do not end with a partial special token, store complete subtrees in the span cache grouped by consumed length.
Alternative Ordering
Productions for a nonterminal are tried in a rotated order determined by recursion depth:
$$\text{order}(n, d) = [(d \bmod n), (d+1 \bmod n), \ldots, (d-1 \bmod n)]$$
where $n$ is the number of productions and $d$ is the current recursion depth.
This rotation distributes search effort across recursion depths, preventing systematic bias toward the first production. It is particularly important for left-recursive grammars where always trying the recursive alternative first would waste effort at shallow depths.
Symbol Sequence Parsing
A production’s right-hand side is a sequence of symbols parsed left-to-right. For each successful parse of the first symbol, the parser attempts to parse the remaining symbols with the unconsumed segments.
A critical invariant governs partial nodes: if a child node is partial (incomplete), it is accepted only if it consumed all remaining segments (i.e., it is at the frontier of the input). Partial nodes cannot appear in non-frontier positions.
Partial Terminals
Terminals are regex patterns matched via Brzozowski derivatives (see two-level language completability). Matching a segment against a terminal regex yields one of four outcomes:
| Outcome | Node Produced | Meaning |
|---|---|---|
| Complete | Terminal::Complete (no extension) | Segment fully matches the regex |
| Extensible | Terminal::Complete (with extension derivative) | Segment is a complete match AND a valid prefix of longer matches |
| Prefix | Terminal::Partial (with remainder derivative) | Segment is a valid prefix but not a complete match |
| NoMatch | (none) | Segment does not match |
At end-of-input (empty remaining segments), the parser produces a Terminal::Partial with the full regex as remainder. This enables downstream completion generation: the remainder derivative describes exactly which tokens could legally appear next.
The distinction between extension and remainder is central to the completability analysis:
- An extension derivative on a complete terminal means the terminal could accept more characters (e.g., identifier
xcould becomexy). - A remainder derivative on a partial terminal describes what characters are needed to complete the match.
Epsilon Productions
An epsilon production (empty right-hand side, $\varepsilon$) matches zero segments and produces a nonterminal with no children. Such a nonterminal is considered complete.
Recursion Limit
The default maximum recursion depth is $d_{\max} = 15$ (DEFAULT_MAX_RECURSION_DEPTH). This is overridable per-parser and is distinct from the typing depth limit of 50.
When the depth limit is hit, the parser returns a DepthLimit error. The MetaParser wrapper in src/logic/partial/meta.rs uses iterative deepening to find the minimum sufficient depth.
Incremental Parsing
The parser supports incremental parsing across successive calls. When input $s$ is extended to $s \cdot t$:
- Normalize both segment sequences to position-independent keys.
- If the new segments start with the previous segments (prefix match), reuse the span cache.
- Otherwise, clear the cache entirely.
This prefix-stability property ensures that typing a single character does not require re-parsing the entire input from scratch.