9. Compiling Grammars

What's in a Parser Generator?

`Grammar` (De-)Constructed

Among other things, chapter one defined context-free grammars and chapter two introduced the version of extended BNF to specify grammars which we have been used throughout.

Given some grammar rules written in this notation, chapter three explained how to mechanically create a scanner function from a grammar to break up an input string into the literals and tokens used as terminal alphabet in the grammar. Chapter four introduced classes for objects that can represent grammar rules as trees in such a way that a grammar rule can be viewed as a function to recognize a piece of input and, in particular, the start rule will recognize sentences of the language described by the grammar.

Finally, chapter five explained how the rules, i.e., recognition functions, can be augmented by action methods to translate a sentence from the input terminals into some convenient representation, or to manipulate it in other ways.

All of this works as soon as the grammar rules are represented with objects from Rule and the other magic classes. This chapter shows how this is done.

The Grammars' Grammar

Throughout, grammars have been written in a specific version of extended BNF notation. Formally speaking, the notation amounts to a language where the sentences are grammars, and as such there has to be a grammar describing this language. Indeed, this grammars' grammar is Grammar.ebnf, a static ingredient of the Grammar class, and it is another sentence in the language of grammars:

grammar: { rule };
rule:    Token ':' alt ';';
alt:     seq [{ '|' seq }];
seq:     { lit | ref | opt | some };
lit:     Lit;
ref:     Token;
opt:     '[' alt ']';
some:    '{' alt '}';

To begin with, the above is just another grammar which can be studied in example 9/01.

As usual, press to represent and check the grammar and see that there are no issues — the grammars' grammar is LL(1).

The output shows the usual grammar description created by Grammar.toString() with the literals

':' and ';' to separate a rule name from the right-hand side and to terminate a rule (line 2 above),
the literal '|' to separate alternatives (line 3),
brackets, '[' and ']', to enclose an optional list of alternatives (line 7), and
braces, '{' and '}', to enclose a list of alternatives which can be repeated (line 8).

Such a self-referential description tends to be confusing, but the output also shows that this grammar has two tokens, aptly named Token and Lit:

Token appears in line 2 at the beginning of the grammar rule for a grammar rule (ouch!) and, therefore, corresponds to anything in the input which can be a rule name.
Lit is the other token (ouch!) and appears in line 5 and by reference in line 4. From the looks of the grammars' grammar, line 4 describes what can be in a sequence; therefore, Lit must correspond to anything in the input which can be a literal.

Unfortunately, lines 4 and 6 together point out a problem: Token stands for any name used in a grammar, i.e., for a rule name or a token name. One could have taken the position that any name not defined as a rule name must eventually be defined as a token name and provided with a pattern. Instead, Grammar construction requires the token definitions up front and complains if they are used as rule names.

The token definitions for the grammars' grammar must be specified in the and are as expected:

Lit /'(?:[^'\\]|\\['\\])+'/
Token /[A-Za-z][A-Za-z0-9_]*/

i.e., when represented in the input, Lit are surrounded by single quotes, with single quotes and backslashes escaped by backslashes, and Token are alphanumeric. The references Grammar.terminals, another static ingredient of the Grammar class.

This suggests two experiments. In example 9/01:

Press to represent and check the grammars' grammar for use.
Copy the to the and
press to see that the grammars' grammar recognizes itself, i.e., the grammars' grammar is a sentence in the language described by the grammars' grammar.

Is it the only one?

grammar is do rule end end
rule    is Token \: alt \; end
alt     is seq maybe do \| seq end end end
seq     is do lit or ref or opt or some end end
lit     is Lit end
ref     is Token end
opt     is \[ alt \] end
some    is \{ alt \} end

The grammars' grammar would not recognize the above as a sentence — literals are just backslash-escaped single characters, rules are not marked up with colons and semicolons, etc.

However, you can check in example 9/02 that it just takes minor tweaks to literals and token definitions for the grammars' grammar to recognize the above as a sentence:

Press to represent and check the tweaked grammars' grammar for use, and
press to see that the above is a sentence for the tweaked grammars' grammar.

Representing the Grammars' Grammar

The examples suggest that there are different ways to spell grammar rules — there might even be extensions to the notation used so far. Example 9/03 contains:

seq:    { lit | ref | opt | some | parens };
parens: '(' alt ')';

This addition would allow parentheses containing one or more alternatives — just like brackets and braces — to be part of a sequence, i.e., the parentheses would arrange for precedence and we could embed alternatives for something that appears exactly once right into a sequence. If this were the grammars' grammar, a grammar for a sum could be

sum: Number [{ ( '+' | '-' ) Number }];

Extending the grammars' grammar is the subject of an upcoming section.

First, however, there is still the question: how does the button on the practice page represent a grammar — specified only as a string in the — with objects from Rule and all the other magic classes so that functions to perform lexical and syntax analysis based on that grammar can be called by the button?

The documentation of the Grammar constructor makes it look very easy:

const g = new Grammar(grammar, tokens);

The construction only requires a string grammar straight from the and an object tokens which maps token names to regular expressions describing their representation.

An empty string can even be mapped to a regular expression which describes everything which is to be ignored in the and the , such as white space and comments.

The practice page cannot hand the string from the to the Grammar constructor directly, but it can apply eval to the string and hand the resulting JavaScript object to the constructor.

So, the Grammar constructor must recognize an arbitrary grammar string as a sentence — that is a job for syntax analysis based on the grammars' grammar represented in the magic objects.

If we had this representation, we could apply it to get it — a clear case of bootstrapping.

A Grammar object has factory methods such as rule() which construct Rule and all the other magic objects and handle all the bookkeeping, too. A rule of the grammars' grammar such as

rule: Token ':' alt ';';

is represented manually as follows:

Grammar.grammar = new Grammar(Grammar.terminals);
Grammar.grammar.rule(Grammar.grammar.nt('rule'),
  Grammar.grammar.seq([
    Grammar.grammar.token('Token'),
    Grammar.grammar.lit("':'"),
    Grammar.grammar.nt('alt'),
    Grammar.grammar.lit("';'")
  ], null)
);

Thankfully, this has to be done manually only once, for the grammars' grammar, and the resulting JavaScript code, Grammar.grammar, is another static ingredient of the class.

Any other grammar — as long as it is a sentence for the grammars' grammar — is recognized using the parse() method of a Parser created by the Grammar.parser() method which only depends on the fact that the rules of the grammars' grammar have been represented using Rule and all the other magic classes, and that the grammar tree has been checked as described in chapter four.

class Grammar {
  constructor (grammar, tokens, configuration) {
    ...
    Grammar.grammar.parser().parse(grammar, ... );

The Grammar.parser() factory method creates a Parser which calls the factory method Grammar.scanner() to create a Scanner. The Parser.parse() method accepts a string, uses the Scanner to break it up into a list of Tuple objects each of which contains an input position, part of the input string, and a Lit or Token object representing this piece of the input. Finally, the Parser.parse() method starts syntax analysis by calling the Rule.parse() method of the start rule if the terminal in the first Tuple is in the expect set of the rule.

Representing Any Grammar

So far, Grammar.grammar can recognize any grammar that is a sentence conforming to the grammars' grammar.

However, mere recognition of some other grammar does not mean that there is lexical and syntax analysis for the other grammar. To get that, the rules of the other grammar have to be represented with Rule and all the other magic classes.

This could again be done manually, but that's an error-prone process. If the Parser.parse() function can recognize grammars, it can call action methods so that grammars can be represented — using the very same classes that are involved in recognition in the first place. That is bootstrapping!

The Actions class contains the action methods which are used during construction of a Grammar object for some grammar string describing a sentence conforming to the grammars' grammar:

class Grammar {
  constructor (grammar, tokens, configuration) {
    ...
    Grammar.grammar.parser().parse(grammar, new Actions(this));

The Parser.parse() function takes the sentence string and the action methods corresponding to the rules of the grammars' grammar which build and check the trees for the rules in the sentence. All action methods have access to a new, empty Grammar object this.g, set by the constructor, and can use factory methods such as Grammar.rule() to build the trees just as it was done manually once only for the grammar's grammar. No more manual labor!

The action methods of the class Actions are surprisingly simple. Each action method receives the items collected by the rule's parse() method and returns a new object. For example:

// lit: Lit;
lit (literal) { return this.g.lit(literal, true); }

If a literal is recognized, the name is collected by the Lit object in the rule tree. The action method uses the factory method of this.g to create a Lit object for the new grammar.

// ref: Token;
ref (name) { 
  if (name in this.g.tokensByName) return this.g.token(name, undefined, true);
  return this.g.nt(name);
}

Similarly, a Token object collects a name and creates either a Token or an NT object for the new grammar.

// seq: { lit | ref | opt | some };
seq (some) {
  if (some.flat().every(node => node instanceof Opt))
    throw this.g.error(some.flat().join(', ') +
      ': all sequence elements are optional');
  return this.g.seq(some.flat(), null);
}

Braces collect a list of lists. Here, the content of each of the inner lists is produced by a rule, i.e., it will be new objects such as the two produced above. flat() moves these objects to the outer list. Spread syntax, i.e., ..., passes the objects collected by the braces as individual arguments to Grammar.seq(), the factory method to represent a sequence.

// alt: seq [{ '|' seq }];
alt (seq, many) { 
  return many ?
    this.g.alt(seq, ... many.flat(1).map(elt => elt[1])) :
    this.g.alt(seq);
}

Alternatives are easy if there is only one — it was represented as a Seq object by the previous action and it needs to be wrapped into an Alt object.

Otherwise the brackets collect a list which contains the list of lists produced by the braces. many.flat() lifts the innermost lists up where map() can pick the second element of each of the innermost lists. Each elt[0] would have been the string representing the '|', each elt[1] is a Seq object. Spread syntax and the factory method Grammar.alt() take care of representing the list of alternative sequences.

Alt objects never show up in a grammar tree — Alt is the hidden superclass to represent rules, brackets, and braces. The next three actions use spread syntax and factory methods to revise the representation:

// rule: Token ':' alt ';';
rule (name, _, alt, s) { return this.g.rule(this.g.nt(name), ...alt.seqs); }

// opt: '[' alt ']';
opt (lb, alt, rb) { return this.g.opt(...alt.seqs); }

// some: '{' alt '}';
some (lb, alt, rb) { return this.g.some(...alt.seqs); }

A grammar contains some rules. By now they have been collected and the factory methods have attached everything to the new grammar. Finally, the start rule action just returns the grammar, not yet checked:

// grammar: { rule };
grammar (r) { return this.g; }

It is up to the caller of the Parser.parse() function to check the grammar, at least far enough to create the expect sets.

In retrospect, a few comments:

The spread syntax ... seems unnecessary — the factory methods could have been designed to accept arrays instead. However, the grammars' grammar had to be represented manually. A look at the source of Grammar.grammar shows how requiring individual descendants as arguments facilitates nested calls to the factory methods.
Creating and immediately destroying the Alt objects could have been avoided. Instead, the alt action could have returned a list of alternative sequences. However, some action is necessary to smooth out the different nesting depths of one and more alternatives.
Alt objects seem to be destined to represent alternative sequences nested into a sequence when the grammars' grammar is extended with parentheses for grouping. Could there be an obstacle?

Quick Summary

Grammar rules can be represented as nested constructor or factory method calls to create Rule and other objects.
The classes implement algorithms for grammar checking such as shallow(), deep(), and check(), and recognition, i.e., parse().
As a result. once the grammars' grammar is (manually) represented, any grammar can be recognized.
Following recognition, action methods for the rules of the grammars' grammar take care of representing the rules of any grammar string which is a sentence.
In summary, the grammars' grammar, once represented, can represent other grammars, and this in turn implements lexical and syntax analysis for other grammars. Their rules, too, can be augmented with actions to implement translations.

Bootstrap Example

Example 9/04 demonstrates that the grammars' grammar can recognize and represent itself.

class Actions extends EBNF.Actions {  // modify top-level action
  static count = 0;                   // count created grammars

  constructor () {                    // create next grammar
    super(new EBNF.Grammar(EBNF.Grammar.terminals));
    this.g.count = ++ Actions.count;
  }
  
  grammar (some) { 
    this.g.check();                   // check next grammar
    puts(this.g.toString());          // display next grammar
    puts('count', this.g.count);      // display it's count
    
    // executable uses next grammar to parse it's own rules
    return () => this.g.parser().parse(
      this.g.count % 2 ? program : grammar, Actions);
  }
}

The Actions class is extended and contains a static variable count to identify how often an object is created (lines 2 and 6 above). Whenever an object is created it uses a new, empty Grammar object (line 5).

The top-level action is replaced to check and display the grammar, together with it's count (lines 10 to 12). Rather than returning the grammar it returns a function which will use the new Grammar object to parse the text in the or the and represent it using a new Actions object (lines 15 and 16).

Press to represent and check the grammars' grammar.
Press to let the grammars' grammar represent and check itself and store the parser from the new representation as the result in run.
Press to apply this parser to the or the which both contain the grammar's grammar.
Press a few more times: the count value in the increases because the grammar keeps representing itself.

Extending the Grammars' Grammar

The grammars' grammar can represent itself. Therefore, it is possible to extend the grammars' grammar, i.e., to modify or extend the notation in which we have been writing grammars. Example 9/05 tries to add parentheses to the notation as discussed earlier, so that alternatives can be embedded into sequences.

The extended grammar's grammar is in the . The modifications are

seq:     { lit | ref | opt | some | parens };
parens:  '(' alt ')';

Press to represent and check it: it is a sentence!

The new grammar for sum

sum: Number [{ ( '+' | '-' ) Number }];

is in the .

The again extends the Actions class to provide a definition for Number and replace the top-level action for grammar to check and display the new grammar and return an executable which will call the Parser.parse() method based on the new grammar with a sum 1 + 2 - 3. It also contains an action for parens

// parens: '(' alt ')';
parens (lp, alt, rp) {              // returns the alternatives
  return alt;
}

which simply returns the Alt object to be included in the sequence.

Press to represent and check the grammar for sum.

Unfortunately...

An Alt object cannot be a descendant of a Seq object — sequences were supposed to only contain terminals, rule references, brackets and braces. Not allowing parentheses in sequences is a syntax issue, checking that there are no other descendants is a semantic issue. The semantic check is part of the Seq constructor.

A rule reference is allowed in a sequence, and a rule contains alternatives. Example 9/06, the next attempt, tries to add a hidden rule:

// parens: '(' alt ')';
parens (lp, alt, rp) {   // returns reference to auxiliary rule
  const uniq = this.g.nt();              // unique non-terminal
  this.g.rule(uniq, ... alt.seqs);                // uniq: alt;
  return uniq;
}

The only change from the previous example is a different action for parens. This action creates a rule with a unique name and returns a reference to it which will be added to the parent sequence.

Press to represent and check the extended grammars' grammar and press to let the extended grammars' grammar represent and check the new grammar for sum found in the .

Unfortunately...

The rule for sum got lost: the Parser.parse() method looks for a grammar, from there for a rule, alt, seq, and eventually parens — this is the stack of open calls on the parse() methods by the time we reach the new action which creates the hidden rule. This action happens to create the first Rule object ever which becomes the start rule, rather than sum.

Third time is a charm: the action for parens in example 9/07 is more careful and moves the new rule to the end of the list of rules (lines 2, 8, and 14 below):

class Actions extends EBNF.Actions {
  #hidden = [];
  
  // parens: '(' alt ')';
  parens (lp, alt, rp) {   // returns reference to auxiliary rule
    const uniq = this.g.nt();              // unique non-terminal
    this.g.rule(uniq, ... alt.seqs);                // uniq: alt;
    this.#hidden.push(this.g.rules.pop()); // can't be start rule
    return uniq;
  }

  // grammar: { rule };
  grammar (some) {
    this.g.rules.push(... this.#hidden);   // append hidden rules
      ...

Again, press to represent and check the extended grammars' grammar,
press to let the extended grammars' grammar represent and check the new grammar for sum found in the , and
finally, press to apply the new grammar for sum to the expression 1 + 2 - 3.

Bottom line: the grammars' grammar can translate itself and this allows for changes to the notation as long as Rule and the other magic classes support the semantics or new classes are added to support entirely new constructs.

According to the definition of bootstrapping this can go on for many iterations — but it will eventually reach the limits of the practice page...