What's in a Parser Generator?
Grammar (De-)Constructed
Among other things, chapter one defined context-free grammars and chapter two introduced the version of extended BNF to specify grammars which we have been used throughout.
Given some grammar rules written in this notation, chapter three explained how to mechanically create a scanner function from a grammar to break up an input string into the literals and tokens used as terminal alphabet in the grammar. Chapter four introduced classes for objects that can represent grammar rules as trees in such a way that a grammar rule can be viewed as a function to recognize a piece of input and, in particular, the start rule will recognize sentences of the language described by the grammar.
Finally, chapter five explained how the rules, i.e., recognition functions, can be augmented by action methods to translate a sentence from the input terminals into some convenient representation, or to manipulate it in other ways.
All of this works as soon as the grammar rules are represented
with objects from Rule and the other magic classes.
This chapter shows how this is done.
The Grammars' Grammar
Throughout, grammars have been written in a specific version of extended BNF notation.
Formally speaking, the notation amounts to a language where the sentences are grammars,
and as such there has to be a grammar describing this language.
Indeed, this grammars' grammar is
Grammar.ebnf,
a static ingredient of the Grammar class,
and it is another sentence in the language of grammars:
grammar: { rule };
rule: Token ':' alt ';';
alt: seq [{ '|' seq }];
seq: { lit | ref | opt | some };
lit: Lit;
ref: Token;
opt: '[' alt ']';
some: '{' alt '}';
To begin with, the above is just another grammar which can be studied in example 9/01.
- As usual, press to represent and check the grammar and see that there are no issues — the grammars' grammar is LL(1).
The output shows the usual grammar description
created by Grammar.toString() with the literals
':'and';'to separate a rule name from the right-hand side and to terminate a rule (line 2 above),- the literal
'|'to separate alternatives (line 3), - brackets,
'['and']', to enclose an optional list of alternatives (line 7), and - braces,
'{'and'}', to enclose a list of alternatives which can be repeated (line 8).
Such a self-referential description tends to be confusing, but the output also shows
that this grammar has two tokens, aptly named Token and Lit:
-
Tokenappears in line 2 at the beginning of the grammar rule for a grammar rule (ouch!) and, therefore, corresponds to anything in the input which can be a rule name. -
Litis the other token (ouch!) and appears in line 5 and by reference in line 4. From the looks of the grammars' grammar, line 4 describes what can be in a sequence; therefore,Litmust correspond to anything in the input which can be a literal.
Unfortunately, lines 4 and 6 together point out a problem:
Token stands for any name used in a grammar, i.e., for a rule name or a token name.
One could have taken the position that any name not defined as a rule name
must eventually be defined as a token name and provided with a pattern.
Instead, Grammar construction requires the token
definitions up front and complains if they are used as rule names.
The token definitions for the grammars' grammar must be specified in the and are as expected:
Lit /'(?:[^'\\]|\\['\\])+'/
Token /[A-Za-z][A-Za-z0-9_]*/
i.e., when represented in the input, Lit are surrounded by single quotes, with single quotes
and backslashes escaped by backslashes, and Token are alphanumeric.
The references
Grammar.terminals,
another static ingredient of the Grammar class.
This suggests two experiments. In example 9/01:
- Press to represent and check the grammars' grammar for use.
- Copy the to the and
- press to see that the grammars' grammar recognizes itself, i.e., the grammars' grammar is a sentence in the language described by the grammars' grammar.
Is it the only one?
grammar is do rule end end
rule is Token \: alt \; end
alt is seq maybe do \| seq end end end
seq is do lit or ref or opt or some end end
lit is Lit end
ref is Token end
opt is \[ alt \] end
some is \{ alt \} end
The grammars' grammar would not recognize the above as a sentence — literals are just backslash-escaped single characters, rules are not marked up with colons and semicolons, etc.
However, you can check in example 9/02 that it just takes minor tweaks to literals and token definitions for the grammars' grammar to recognize the above as a sentence:
- Press to represent and check the tweaked grammars' grammar for use, and
- press to see that the above is a sentence for the tweaked grammars' grammar.
Representing the Grammars' Grammar
The examples suggest that there are different ways to spell grammar rules — there might even be extensions to the notation used so far. Example 9/03 contains:
seq: { lit | ref | opt | some | parens };
parens: '(' alt ')';
This addition would allow parentheses containing one or more alternatives — just like brackets and braces — to be part of a sequence, i.e., the parentheses would arrange for precedence and we could embed alternatives for something that appears exactly once right into a sequence. If this were the grammars' grammar, a grammar for a sum could be
sum: Number [{ ( '+' | '-' ) Number }];
Extending the grammars' grammar is the subject of an upcoming section.
First, however, there is still the question:
how does the button
on the practice page represent a grammar — specified only as a string in the
—
with objects from
Rule and all the other magic classes
so that functions to perform lexical and syntax analysis based on that grammar
can be called by the button?
The documentation of the Grammar constructor makes it look very easy:
const g = new Grammar(grammar, tokens);
The construction only requires a string grammar
straight from the
and an object tokens which maps token names to regular expressions
describing their representation.
An empty string can even be mapped to a regular expression which describes everything which is to be ignored in the and the , such as white space and comments.
The practice page cannot hand the string from the
to the Grammar constructor directly,
but it can apply
eval to the string and hand the resulting JavaScript object to the constructor.
So, the Grammar constructor must recognize
an arbitrary grammar string as a sentence — that is a job for syntax analysis based on
the grammars' grammar represented in the magic objects.
If we had this representation, we could apply it to get it — a clear case of bootstrapping.
A Grammar object has factory methods such as
rule()
which construct Rule and all the other magic objects
and handle all the bookkeeping, too.
A rule of the grammars' grammar such as
rule: Token ':' alt ';';
is represented manually as follows:
Grammar.grammar = new Grammar(Grammar.terminals);
Grammar.grammar.rule(Grammar.grammar.nt('rule'),
Grammar.grammar.seq([
Grammar.grammar.token('Token'),
Grammar.grammar.lit("':'"),
Grammar.grammar.nt('alt'),
Grammar.grammar.lit("';'")
], null)
);
Thankfully, this has to be done manually only once,
for the grammars' grammar, and the resulting JavaScript code,
Grammar.grammar,
is another static ingredient of the class.
Any other grammar — as long as it is a sentence for the grammars' grammar —
is recognized using the parse() method of a
Parser
created by the Grammar.parser() method
which only depends on the fact that the rules of the grammars' grammar
have been represented using Rule and all the other magic classes,
and that the grammar tree has been checked as described in
chapter four.
class Grammar {
constructor (grammar, tokens, configuration) {
...
Grammar.grammar.parser().parse(grammar, ... );
The Grammar.parser() factory method
creates a Parser which calls the factory method
Grammar.scanner() to create a Scanner.
The Parser.parse() method accepts a string,
uses the Scanner to break it up into a list of
Tuple objects
each of which contains an input position, part of the input string,
and a Lit or Token object
representing this piece of the input.
Finally, the Parser.parse() method
starts syntax analysis
by calling the Rule.parse() method of the start rule
if the terminal in the first Tuple is in the expect set of the rule.
Representing Any Grammar
So far,
Grammar.grammar can recognize any grammar
that is a sentence conforming to the grammars' grammar.
However, mere recognition of some other grammar does not mean that there is
lexical and syntax analysis for the other grammar.
To get that, the rules of the other grammar have to be represented with
Rule and all the other magic classes.
This could again be done manually, but that's an error-prone process.
If the Parser.parse() function
can recognize grammars, it can call action methods so that grammars
can be represented — using the very same classes that are involved
in recognition in the first place.
That is bootstrapping!
The Actions class contains the action methods
which are used during construction of a
Grammar object for some grammar string
describing a sentence conforming to the grammars' grammar:
class Grammar {
constructor (grammar, tokens, configuration) {
...
Grammar.grammar.parser().parse(grammar, new Actions(this));
The Parser.parse() function takes the sentence string
and the action methods corresponding to the rules of the grammars' grammar
which build and check the trees for the rules in the sentence.
All action methods have access to a new, empty Grammar object this.g,
set by the constructor,
and can use factory methods such as Grammar.rule()
to build the trees just as it was done manually once only for the grammar's grammar.
No more manual labor!
The action methods of the class Actions are surprisingly simple.
Each action method receives the items collected by the rule's parse() method
and returns a new object. For example:
// lit: Lit;
lit (literal) { return this.g.lit(literal, true); }
If a literal is recognized, the name is collected by the Lit
object in the rule tree.
The action method uses the factory method of this.g
to create a Lit object for the new grammar.
// ref: Token;
ref (name) {
if (name in this.g.tokensByName) return this.g.token(name, undefined, true);
return this.g.nt(name);
}
Similarly, a Token object collects a name and creates
either a Token or an NT object for the new grammar.
// seq: { lit | ref | opt | some };
seq (some) {
if (some.flat().every(node => node instanceof Opt))
throw this.g.error(some.flat().join(', ') +
': all sequence elements are optional');
return this.g.seq(some.flat(), null);
}
Braces collect a list of lists.
Here, the content of each of the inner lists is produced by a rule,
i.e., it will be new objects such as the two produced above.
flat() moves these objects to the outer list.
Spread syntax, i.e., ..., passes the objects collected by the braces
as individual arguments to
Grammar.seq(), the factory method to represent a sequence.
// alt: seq [{ '|' seq }];
alt (seq, many) {
return many ?
this.g.alt(seq, ... many.flat(1).map(elt => elt[1])) :
this.g.alt(seq);
}
Alternatives are easy if there is only one — it was represented as a
Seq object by the previous action
and it needs to be wrapped into an Alt object.
Otherwise the brackets collect a list which contains the list of lists
produced by the braces.
many.flat() lifts the innermost lists up where
map() can pick the second element of each of the innermost lists.
Each elt[0] would have been the string representing the '|',
each elt[1] is a Seq object.
Spread syntax and the factory method Grammar.alt()
take care of representing the list of alternative sequences.
Alt objects never show up in a grammar tree —
Alt is the hidden superclass to represent
rules, brackets, and braces.
The next three actions use spread syntax and factory methods
to revise the representation:
// rule: Token ':' alt ';';
rule (name, _, alt, s) { return this.g.rule(this.g.nt(name), ...alt.seqs); }
// opt: '[' alt ']';
opt (lb, alt, rb) { return this.g.opt(...alt.seqs); }
// some: '{' alt '}';
some (lb, alt, rb) { return this.g.some(...alt.seqs); }
A grammar contains some rules. By now they have been collected and the factory methods have attached everything to the new grammar. Finally, the start rule action just returns the grammar, not yet checked:
// grammar: { rule };
grammar (r) { return this.g; }
It is up to the caller of the Parser.parse() function
to check the grammar, at least far enough to create the expect sets.
In retrospect, a few comments:
-
The spread syntax
...seems unnecessary — the factory methods could have been designed to accept arrays instead. However, the grammars' grammar had to be represented manually. A look at the source ofGrammar.grammarshows how requiring individual descendants as arguments facilitates nested calls to the factory methods. -
Creating and immediately destroying the
Altobjects could have been avoided. Instead, thealtaction could have returned a list of alternative sequences. However, some action is necessary to smooth out the different nesting depths of one and more alternatives. -
Altobjects seem to be destined to represent alternative sequences nested into a sequence when the grammars' grammar is extended with parentheses for grouping. Could there be an obstacle?
Quick Summary
-
Grammar rules can be represented as nested constructor or factory method calls to create
Ruleand other objects. -
The classes implement algorithms for grammar checking such as
shallow(),deep(), andcheck(), and recognition, i.e.,parse(). -
As a result. once the grammars' grammar is (manually) represented, any grammar can be recognized.
-
Following recognition, action methods for the rules of the grammars' grammar take care of representing the rules of any grammar string which is a sentence.
-
In summary, the grammars' grammar, once represented, can represent other grammars, and this in turn implements lexical and syntax analysis for other grammars. Their rules, too, can be augmented with actions to implement translations.
Bootstrap Example
Example 9/04 demonstrates that the grammars' grammar can recognize and represent itself.
class Actions extends EBNF.Actions { // modify top-level action
static count = 0; // count created grammars
constructor () { // create next grammar
super(new EBNF.Grammar(EBNF.Grammar.terminals));
this.g.count = ++ Actions.count;
}
grammar (some) {
this.g.check(); // check next grammar
puts(this.g.toString()); // display next grammar
puts('count', this.g.count); // display it's count
// executable uses next grammar to parse it's own rules
return () => this.g.parser().parse(
this.g.count % 2 ? program : grammar, Actions);
}
}
The Actions class is extended and contains
a static variable count to identify how often an object is created (lines 2 and 6 above).
Whenever an object is created it uses a new, empty Grammar object (line 5).
The top-level action is replaced to check and display the grammar, together with it's count (lines 10 to 12).
Rather than returning the grammar it returns a function
which will use the new Grammar object to parse
the text in the or the
and represent it using a new Actions object (lines 15 and 16).
- Press to represent and check the grammars' grammar.
- Press
to let the grammars' grammar represent and check itself and store the parser
from the new representation as the result in
run. - Press to apply this parser to the or the which both contain the grammar's grammar.
- Press a few more times:
the
countvalue in the increases because the grammar keeps representing itself.
Extending the Grammars' Grammar
The grammars' grammar can represent itself. Therefore, it is possible to extend the grammars' grammar, i.e., to modify or extend the notation in which we have been writing grammars. Example 9/05 tries to add parentheses to the notation as discussed earlier, so that alternatives can be embedded into sequences.
The extended grammar's grammar is in the . The modifications are
seq: { lit | ref | opt | some | parens };
parens: '(' alt ')';
- Press to represent and check it: it is a sentence!
The new grammar for sum
sum: Number [{ ( '+' | '-' ) Number }];
is in the .
The again extends the Actions class
to provide a definition for Number and
replace the top-level action for grammar
to check and display the new grammar
and return an executable which will call the Parser.parse() method
based on the new grammar with a sum 1 + 2 - 3.
It also contains an action for parens
// parens: '(' alt ')';
parens (lp, alt, rp) { // returns the alternatives
return alt;
}
which simply returns the
Alt object to be included in the sequence.
- Press
to represent and check the grammar for
sum.
Unfortunately...
An Alt object cannot be
a descendant of a Seq object —
sequences were supposed to only contain terminals, rule references,
brackets and braces.
Not allowing parentheses in sequences is a syntax issue,
checking that there are no other descendants is a semantic issue.
The semantic check is part of the Seq constructor.
A rule reference is allowed in a sequence, and a rule contains alternatives. Example 9/06, the next attempt, tries to add a hidden rule:
// parens: '(' alt ')';
parens (lp, alt, rp) { // returns reference to auxiliary rule
const uniq = this.g.nt(); // unique non-terminal
this.g.rule(uniq, ... alt.seqs); // uniq: alt;
return uniq;
}
The only change from the previous example is a different action for parens.
This action creates a rule with a unique name
and returns a reference to it which will be added to the parent sequence.
- Press to represent and check
the extended grammars' grammar and
press
to let the extended grammars' grammar represent and check the
new grammar for
sumfound in the .
Unfortunately...
The rule for sum got lost: the Parser.parse() method looks for a grammar,
from there for a rule, alt, seq, and eventually parens —
this is the stack of open calls on the parse() methods
by the time we reach the new action which creates the hidden rule.
This action happens to create the first Rule object ever
which becomes the start rule, rather than sum.
Third time is a charm:
the action for parens in example 9/07 is more careful
and moves the new rule to the end of the list of rules (lines 2, 8, and 14 below):
class Actions extends EBNF.Actions {
#hidden = [];
// parens: '(' alt ')';
parens (lp, alt, rp) { // returns reference to auxiliary rule
const uniq = this.g.nt(); // unique non-terminal
this.g.rule(uniq, ... alt.seqs); // uniq: alt;
this.#hidden.push(this.g.rules.pop()); // can't be start rule
return uniq;
}
// grammar: { rule };
grammar (some) {
this.g.rules.push(... this.#hidden); // append hidden rules
...
- Again, press to represent and check the extended grammars' grammar,
- press
to let the extended grammars' grammar represent and check the
new grammar for
sumfound in the , and - finally, press to apply the
new grammar for
sumto the expression1+2-3.
Bottom line: the grammars' grammar can translate itself and this allows for
changes to the notation as long as Rule
and the other magic classes support the semantics or new classes are added
to support entirely new constructs.
According to the definition of bootstrapping this can go on for many iterations — but it will eventually reach the limits of the practice page...