4. Recognizing Sentences

What's in a Sentence?

Check Before You Call!


Chapter two explained how to write grammars and chapter three explained how to extract terminals from input. This chapter discusses how to decide if an input string, i.e., a sequence of terminals delivered by lexical analysis, is a sentence of a language described by a grammar. Chapter one called this syntax analysis and it should result in a syntax tree for the sentence.

A grammar contains all the information required to create a Parser object with a method parse() which implements syntax analysis — as long as the grammar is up to the job. This chapter also shows how to check a grammar to see if it can be used to create a Parser.

Both, this implementation of syntax analysis and the corresponding grammar check, are very intuitive and the technique is called LL(1) — left-to-right parsing with leftmost derivation. Chapter 10 discusses SLR(1) — left-to-right with rightmost derivation — an approach which is not quite as intuitive but which can use a larger class of grammars.

Trees

Here is a grammar for a sum of at least two, optionally signed numbers:

sum: term [{ more }];
more: '+' term | '-' term;
term: [ '+' | '-' ] Number | '(' sum ')';

The following input is in fact a sentence:

1 - (2 + - 3)

Arithmetic expressions such as this sum are often represented as expression trees where branch nodes are labeled with operators and leaf nodes are the numbers and variables in the expression.

arithmetic tree

The levels of such an expression tree imply precedence, i.e., evaluation order; parentheses will not appear in an expression tree.

Similarly, we can visualize the rules of a grammar as trees. Here is sum:

rule sum

Recall that the right-hand side of a rule such as sum consists of one or more alternatives. Each of these is a sequence. In this case there is only one and it consists of a reference term to another rule and something in braces to be repeated one or more times.

Braces, in turn, contain alternatives, and each of those is again a sequence. In this case there is one alternative, namely a sequence consisting of a reference more to another rule. Here is the tree for the rule more:

rule more

more has two alternatives. Each must again be a sequence. One contains the literal '+', the other contains '-', and each sequence then references the third rule, term:

rule term

term also has two alternative sequences. One contains brackets with signs and a Number token, the other contains literals with parentheses enclosing a reference to a sum.

Brackets indicate something optional and contain alternatives, just as braces do. In this case there are two sequences and each only contains a literal with a sign.

Recognition

For the purpose of recognition, rules can be viewed as functions. The diagrams suggest that the rule trees consist of objects which belong to a small set of classes each of which can have methods.

At the top level, the parse() method of the Parser object generated by the parser() method from the grammar delegates the recognition problem to the start rule sum by calling a parse() method on the Rule object which is the root of the tree:

rule sum

Within the sum rule, there is only one alternative and that sequence delegates to its first element, which references and delegates to the rule term:

rule term

term has two alternatives — time to look at the first terminal provided by lexical analysis. The proposed sentence is

1 - (2 + - 3)

and it starts with 1. This Number only fits the first alternative of term, and only after the optional part of the sequence is skipped.

accept 1

We are on our way to building the syntax tree which will prove that the input is a sentence. The process can be viewed in example 4/01:

  • Prepare the grammar by pressing and press to see what tuples lexical analysis creates.

  • Next toggle and press to watch how syntax analysis, i.e., the parse() method of the Parser object and the parse() methods in the rule tree, consume the tuples — one up front, and a next one whenever syntax analysis reaches the current terminal:

> g.config.lookahead = true
> g.parser().parse(program)
parser lookahead: (1) "1" Number
Number lookahead: (1) '-'
'-' lookahead: (1) '('
'(' lookahead: (1) "2" Number
Number lookahead: (1) '+'
'+' lookahead: (1) '-'
'-' lookahead: (1) "3" Number
Number lookahead: (1) ')'
')' lookahead: end of input

Back in the sum, after term has found Number 1, the sequence advances to the next element which requires finding more one or more times:

rule more

While there are two alternatives, the next input, literal '-', only fits the second alternative and we get a little further in building the parse tree:

accept 1 -

Syntax analysis continues, the parse() methods keep calling. The next input is literal '(' which rule term can match with the second alternative:

accept 1 - lpar

Next, the sequence in term calls sum recursively before it will eventually match the trailing literal ')'. Here is the result, where the rule calls and the terminals have been colored for emphasis:

parse tree

The leaf nodes are all terminals, i.e., literals or tokens, and they match the proposed sentence

1 - (2 + - 3)

in order, i.e., we have found a parse tree and confirmed that the input is a sentence!

Strictly speaking, the diagram only shows that the input is a sequence of terminals (turquoise nodes) which are the leaves of an ordered tree. The definition of a syntax tree in chapter one requires that each branch node is labeled with a rule name and that the rule name and the sequence of subtrees (branch nodes and leaves) correspond to an ordered pair among the rules of the grammar. Chapter two extended the grammar notation and the extensions account for the white nodes in the diagram — sequences and the iteration constructs with brackets [ opt ] and braces { some }. Still, the red and white branch nodes interpreted together as extended notation must appear in the grammar.

Parser and parse()

How do the Parser's parse() method and the other parse() methods work? The grammar diagrams in the preceding section showed that the job can be elegantly distributed among the nodes in the grammar trees for the rules sum, more, and term:

rule sum rule more
rule term

The colors suggest that the nodes belong to the following classes:

object class purpose
Rule Rule represents a rule with a name and sequences as alternative descendants.
Seq Seq represents a sequence with terminals, rule references, braces, and brackets as descendants.
NT NT represents a reference to a rule.
Some Some represents braces with sequences as alternative descendants which are used one or more times.
Lit Lit represents a literal.
Token Token represents a token.
Opt Opt represents brackets with sequences as alternative descendants which are optional.

For convenience there are two more classes:

  • Lit and Token extend the class T which represents terminals in general.

  • Rule, Some, and Opt extend the class Alt which deals with alternative sequences in general.

We require that the branch nodes Rule, Seq, Some, and Opt, always have one or more descendants, and that Seq cannot only have Opt as descendants. This means that every parse() method will eventually find something.

Syntax analysis is implemented as parse() methods in six of the nine classes, the remaining three, Opt, Lit, and Token, simply inherit. The methods share a Parser object which owns the input tuples; the next() method is called to move on to the next tuple and it can trace lookahead. The actual implementation of syntax analysis is fairly simple:

  • The Grammar's parser() method creates the Parser object and it's parse() method is the function which will be called with the input string to create the tuples, check that the start rule can handle the first tuple, and call the parse() method for the start rule.

  • parse() for a Rule delegates to the superclass — chapter five explains what else the method can do to create a convenient representation of the sentence.

  • The superclass Alt delegates to the particular descendant which can actually deal with the current tuple. Such a descendant exists — otherwise the method calls would not have reached this point.

  • parse() for a Some, i.e., for braces, delegates to the superclass Alt until there is no longer a descendant which can deal with the current tuple — the next section explains how this is determined.

  • parse() for a Seq delegates to one descendant after another — after first checking that the descendant can handle the current tuple and throwing an error if not — with one exception: a descendant Opt is skipped if it cannot deal with the current tuple.

  • parse() for a non-terminal delegates to the referenced rule.

  • Finally, parse() for literals and tokens calls the next() method to move on to the next tuple.

Yes, all of this is just a lot of delegating the work — with the requirement to "check before you call." Alt and Seq perform some traffic control and literals and tokens nudge the scanner when they are reached.

The process can be observed in detail in example 4/01:

  • represent and check the grammar by pressing .

  • Toggle and and press to watch how syntax analysis proceeds:

> g.config.lookahead = true
> g.config.parse = true
> g.parser().parse(program)
parser lookahead: (1) "1" Number
sum| Rule parse {
sum| super(sum: term [ { more } ];) parse {
sum| Seq(term [ { more } ]) parse {
sum| NT(term) parse {
  term| Rule parse {
  term| super(term: [ '+' | '-' ] Number | '(' sum ')';) parse {
  term| Seq([ '+' | '-' ] Number) parse {
  term| Token(Number) parse {
Number lookahead: (1) '-'
  term| Token(Number) parse }: '1'
  term| Seq([ '+' | '-' ] Number) parse }: [ null '1' ]
  term| super(term: [ '+' | '-' ] Number | '(' sum ')';) parse }: [ null '1' ]
  term| Rule parse }: [ null '1' ]
sum| NT(term) parse }: [ null '1' ]

Each line of output is prefixed with the rule name which has been called, nested rule calls are indented. Each line contains the class name of the node in the rule tree — or super for a call to the superclass, e.g., Rule to Alt (line 6) — and the rule fragment corresponding to the node. If a line ends with { it traces a call, otherwise it ends with }: and the return value — which will be explained in chapter five.

The above shows in complete detail how the function created by the parser() method calls on sum (line 5) which calls on term (line 9) to recognize the Number 1 (line 12). In particular, lines 12 to 14 above show that Token advances to the next tuple.

expect

"Check before you call" seems like excellent advice, but just exactly what needs to be checked?

The parse() methods described in the previous section could be implemented in an error-tolerant fashion so that every alternative could be tried before syntax analysis throws the towel when it really cannot determine that some input is in fact a sentence. This works if there is no left recursion, but it is horribly inefficient, and it would usually require to move backwards in the input in order to try some other alternative.

Instead, each rule tree object contains a Set, called expect, of terminals, one of which the object wants to see as next tuple — the lookahead — when the object's parse() method is called. This can be seen in the output above and demonstrated in example 4/01 as suggested above.

Initially, the function created by the parser() method sets the first lookahead up, and once the shared parse() method for Lit and Token is reached it calls the next() method to move on to the next lookahead.

"Check before you call" only allows to call parse() on an object if the current lookahead is in the object's expect set.

Obviously, terminals expect themselves, i.e., expect for literals and tokens are singleton sets which can be defined during construction. What about the other objects in the rule tree?

Naively

  • a sequence should get expect from its first descendant,

  • expect sets from alternative sequences should be merged,

  • rules, braces and brackets, i.e., Rule, Some and Opt, let the superclass Alt of alternative sequences worry about expect,

and we are done?

Almost. Rule references, i.e., NT, have to take expect from their rules, and there things can get recursive... Therefore, computing expect takes two algorithms:

  • shallow is applied to all rules, one after another, and proceeds along a sequence only far enough to determine expect for the sequence. It will detect (and fail for) left recursion.

  • deep is applied to the start rule, down to all descendants, and along each sequence to the end, to determine expect for all nodes in the entire rule tree. deep cannot fail after shallow succeeded, and it detects rules that cannot be reached from the start rule.

All objects representing rules have expect sets. Therefore, all classes representing rules define expect and two methods shallow() and deep().

Representing expect is easy: all literal names are strings, single-quoted, and different, all token names are strings, not quoted, and different. expect is a JavaScript map from a name to true indicating presence in the set. For convenience, this is encapsulated in a dedicated class Set because the built-in class Set currently does not provide the operations needed here.

The rule tree classes implement shallow() as follows:

class shallow()
Lit
Token
set expect to a singleton set with the terminal's name.
Alt delegates to all descendant sequences and merges their expect sets.
NT delegates to Rule and detects undefined rules.
Rule marks, avoids, and complains about recursion, delegates to superclass Alt.
Seq proceeds from left to right merging descendant's expect until the descendant is not Opt; note that we require that there will be such a descendant.

The remaining classes Opt and Some inherit from Alt.

The rule tree classes implement deep() as follows:

class deep()
Alt needs to delegate and merge again to get to the end of all sequences.
NT delegates to Rule again and might detect more undefined rules.
Rule marks a rule as reached from the start rule, delegates to superclass Alt again to finish all sequences; recursion cannot happen.
Seq proceeds completely from right to left merging descendant's expect while the descendant is Opt; the last expect is set for the sequence.

For Lit and Token there is nothing to do. Opt and Some again inherit from Alt.

The effects of shallow() and deep() can be seen in example 4/01:

  • Toggle or and press to watch how each algorithm proceeds. Note that the algorithms are only applied once — during grammar checking.

shallow() processes each rule, starting with sum:

rule sum

The (shortened) output shows that term has to be analyzed before the result for sum is known:

> g = new EBNF.Grammar(grammar, tokens, { shallow: true })
> g.check()
sum| Rule shallow {
sum| super(sum: term [ { more } ];) shallow {
sum| Seq(term [ { more } ]) shallow {
sum| NT(term) shallow {
  term| Rule shallow {
    ...
  term| Rule shallow }: '+', '-', Number, '('
sum| NT(term) shallow }: '+', '-', Number, '('
sum| Seq(term [ { more } ]) shallow }: '+', '-', Number, '('
sum| super(sum: term [ { more } ];) shallow }: '+', '-', Number, '('
sum| Rule shallow }: '+', '-', Number, '('

As before, each line of output is prefixed with the rule name which has been called, nested rule calls are indented. Each line contains the class name of the node in the rule tree — or super for a call to the superclass, e.g., Rule to Alt (line 4) — and the rule fragment corresponding to the node. If a line ends with { it traces a call, otherwise it ends with }: and the elements of the resulting set.

rule more

The output shows that two alternative sequences have to be merged for more(line 11 below):

more| Rule shallow {
more| super(more: '+' term | '-' term;) shallow {
more| Seq('+' term) shallow {
more| Lit('+') shallow {
more| Lit('+') shallow }: '+'
more| Seq('+' term) shallow }: '+'
more| Seq('-' term) shallow {
more| Lit('-') shallow {
more| Lit('-') shallow }: '-'
more| Seq('-' term) shallow }: '-'
more| super(more: '+' term | '-' term;) shallow }: '+', '-'
more| Rule shallow }: '+', '-'

The output for sum also shows that deep() is required to compute expect for the sequence in sum hidden under the Some node at right. Here is the (shortened) output from deep:

> g = new EBNF.Grammar(grammar, tokens, { deep: true })
> g.check()
sum| Rule deep {
sum| super(sum: term [ { more } ];) deep {
sum| Seq(term [ { more } ]) deep {
sum| Opt([ { more } ]) deep {
sum| Seq({ more }) deep {
sum| Some({ more }) deep {
sum| Seq(more) deep {
sum| NT(more) deep {
  more| Rule deep {
      ...
  more| NT(term) deep {
    term| Rule deep {
      ...
    term| Rule deep }: '+', '-', Number, '('
  more| NT(term) deep }: '+', '-', Number, '('
      ...
  more| Rule deep }: '+', '-'
sum| NT(more) deep }: '+', '-'
sum| Seq(more) deep }: '+', '-'
sum| Some({ more }) deep }: '+', '-'
sum| Seq({ more }) deep }: '+', '-'
sum| Opt([ { more } ]) deep }: '+', '-'
sum| NT(term) deep {
  term| Rule deep {
  term| Rule deep }: '+', '-', Number, '('
sum| NT(term) deep }: '+', '-', Number, '('
sum| Seq(term [ { more } ]) deep }: '+', '-', Number, '('
sum| super(sum: term [ { more } ];) deep }: '+', '-', Number, '('
sum| Rule deep }: '+', '-', Number, '('

The computation happens in lines 7 to 22 above. In a Seq, deep() proceeds from right to left, i.e., in this case more is entered in line 10 before sum references term.

However, more depends on term, i.e., term will be processed in lines 14 to 16 before more is done in line 19. Later, when sum needs term in line 25, deep() does not have to enter the rule a second time.

Of course, deep() does not change expect for any of the rules themselves — that has been computed by shallow().

Ambiguity

Clearly, expect is essential for the "check before you call" policy of calling a parse() method only if the callee expects the next input symbol.

Unfortunately, this is not the whole story as example 4/02 demonstrates. Here, the grammar has a new start rule sums which should allow input to consist of more than one sum of signed numbers:

sums: { sum };
sum: term [{ more }];
more: '+' term | '-' term;
term: [ '+' | '-' ] Number | '(' sum ')';

The following input contains three sums:

1 - (2 + - 3)
4 + 5
- 6 - 7

It looks as everything works just fine:

  • Toggle to avoid the ambiguity check and represent the grammar by pressing .

  • Then toggle and and press to watch how syntax analysis proceeds.

There is voluminous output and there are no error messages. Chapter five will explain why the output contains so many brackets.

However, upon closer inspection (and considerable pruning) the output reveals a surprise:

> g.config.lookahead = true
> g.config.parse = true
> g.parser().parse(program)
parser lookahead: (1) "1" Number
sums| Rule parse {
        ...
  sum| Rule parse {
        ...
')' lookahead: (2) "4" Number
        ...
  sum| Rule parse }: [ [ null '1' ] [ [ [ [ '-' [ '('
         [ [ null '2' ] [ [ [ [ '+' [ [ '-' ] '3' ] ] ] ] ] ]
         ')' ] ] ] ] ] ]
        ...
  sum| Rule parse {
        ...
    more| Rule parse {
        ...
      term| Token(Number) parse {
Number lookahead: (3) '-'
      term| Token(Number) parse }: '5'
        ...
    more| Rule parse }: [ '+' [ null '5' ] ]

1 - (2 + - 3), the first sum, is recognized as before (line 11), and 4 + 5, the second sum, is found as well. Unfortunately, more is happy to see '-' as a lookahead and continues to gobble up - 6 - 7 as part of a longer second sum:

        ...
  sum| Rule parse }: [ [ null '4' ] [ [ [ [ '+' [ null '5' ] ] ]
         [ [ '-' [ null '6' ] ] ] [ [ '-' [ null '7' ] ] ] ] ] ]

Surprise: expect is consulted before descendants of Some or Opt are asked to parse(), but the process is greedy, i.e., whatever follows in a sequence containing Some or Opt only gets a chance if those two let go.

A grammar is called ambiguous if different parse trees can be constructed for the same sentence.

In this example, the rule sum could succeed three times

1 - (2 + - 3)
4 + 5
- 6 - 7

as the three lines of input might suggest. However, the three lines are just one long string — white space is skipped by lexical analysis — and the output above shows that sum can recognize 4 + 5 - 6 - 7 as a longer second sum. Overall, there are two different parse trees, with two or three top-level rule:sum nodes.

This grammar is ambiguous — a typical issue when two arithmetic expressions with signed numbers can follow each other. It is easy to repair, check out example 4/03.

will complain about ambiguities — unless suppresses the follow() and check() algorithms which normally are part of preparing the grammar.

We have ambiguity issues

  • if two or more alternative sequences have overlapping expect sets, or
  • if braces, i.e., a Some object, and their successor in a sequence have overlapping expect sets, or
  • if brackets, i.e., an Opt object, and their successor in a sequence have overlapping expect sets.

Another ambiguity issue would arise if a sequence had a way to avoid getting to any terminal — in this case the expect of the sequence could overlap with whatever can follow the sequence. This is why our grammar definitions do not permit a sequence to only consist of brackets.

In all these cases "check before you call" has a choice within the overlaps, and blind trust in the closest expect set is not advisable.

follow()

How do we check for these ambiguities?

Overlapping expect sets among alternative sequences are easy to find, but the other cases involve two successive items in a sequence. Unfortunately, already in example 4/02 of multiple sums

sums: { sum };
sum: term [{ more }];
more: '+' term | '-' term;
term: [ '+' | '-' ] Number | '(' sum ')';

it is not obvious which expect sets overlap.

Rather than hunting for successive items, the follow() algorithm computes an individual follow set for each object in the grammar rules. The set contains all terminals which can directly follow in the input after whatever the object itself recognizes during syntax analysis. Given the follow set, we can reason about ambiguities on a per-object basis.

Just like shallow() and deep(), the follow() algorithm is implemented as a method in the various classes which are used to represent grammar rules. The method is called for an object with the Set of all terminals that can follow the object as a parameter.

The first step is that nothing can follow the start rule — at least at the top level. Therefore we call follow() for the start rule with an empty set to get things started.

class follow(set)
Alt sends the incoming set to all descendant sequences because the set can follow each descendant.
Some adds its own expect set to the incoming set and sends the result to the descendant sequences because each descendant can be repeated and thus follow any other.
Seq processes all descendants from right to left: it sends the current set to a descendant and sends its expect to the next one — moving from the end to the beginning of the sequence. If the current descendant is an Opt, its expect is merged with the current set and sent to the predecessor because Opt's expect and Opt's follow, i.e., the current set, both, can follow the predecessor.
NT adds the parameter to the referenced rule, but only if that changes something.

Rule and Opt inherit from their superclass Alt.

There is nothing to do for Lit and Token because they don't require checking.

NT is tricky: obviously it has to send the parameter to the referenced Rule because all the terminals in the incoming set can follow the rule. But at this point, e.g., the start rule might find out that there are more terminals which can follow and the entire process has to start over. Therefore, NT will only send the parameter to the referenced rule if the parameter really is bigger than the current follow of the referenced rule. This prevents infinite recursion because all sets can at most contain the finite number of terminals which the grammar has.

It should be noted that there is a more efficient way to compute follow: If the immediate follow relationship between any two items in all rules is noted in a matrix, Warshall's algorithm is an efficient way to compute the "infinite" product of that matrix with itself and the resulting matrix describes the so-called transitive closure of the follow relation, which is just what we need. Brackets and braces, however, complicate the specification of the first matrix.

check()

The expect sets enable syntax analysis to "check before you call" and avoid trial and error.

The follow sets are used once during grammar preparation to ensure that "check before you call" always gets a unique answer, i.e., that syntax analysis cannot get greedy and must produce a unique parse tree.

A context-free grammar is called LL(1) if syntax analysis can be performed by processing input left to right, looking for the left-most derivation, i.e., starting with the start rule of the grammar, and with one terminal lookahead. The approach is also called top-down because it starts with the root of the syntax tree, i.e., the start rule of the grammar.

The following check() determines if a grammar is LL(1). It is based on the expect and follow sets and is again distributed over the classes for the objects representing the grammar rules.

class check()
Seq each descendant must be checked.
Alt the alternatives' expect sets must not overlap.
Opt expect and follow must not overlap.
Some expect and follow must not overlap.
NT does nothing(!) because check() is applied to every rule, not just recursively to the start rule.

In Alt, check() is very simple to implement: an Alt node has its own expect set which is the union of all alternatives. Therefore, the sum of the number of elements in the alternatives' expect sets and the number of elements in the Alt node's own expect set must be the same.

Ambiguity Revisited

An ambiguous grammar might still be useful — as long as it recognizes only what is intended. Example 4/04 illustrates an if statement:

stmt: Text | if;
if: 'if' Text 'then' stmt [ 'else' stmt ] 'fi';

This grammar is not ambiguous and it recognizes for example the following program:

if a then
  if b then c
  else d fi
fi
  • Press to represent and check the grammar and
  • press to perform syntax analysis.

The output can be reformatted

[ [ 'if' 'a' 'then' [ 
    [ 'if' 'b' 'then' [ 'c' ] [ 'else' [ 'd' ] ] 'fi' ]
  ] null 'fi' ] ]

and shows that else d is recognized as part of the inner if statement.

It turns out that this is due to the fact that in the grammar above 'if' and 'fi' are balanced.

Example 4/05 illustrates a more typical if statement without fi:

if a then
  if b then c
  else d

This grammar is reported to be ambiguous:

  • Press to represent and check the grammar.

  • Toggle to suppress ambiguity checking,

  • press to represent and check the grammar again, and

  • press to perform syntax analysis.

Again, the output can be reformatted for clarity

[ [ 'if' 'a' 'then' [
    [ 'if' 'b' 'then' [ 'c' ] [ 'else' [ 'd' ] ] ]
  ] null ] ]

Obviously, the output no longer contains the literal 'fi' but it is otherwise unchanged!

Recognition is greedy, i.e., if it can accept an else because there is an if to be continued it will not check if there might be an outer if. Fortunately, this is how programming languages like to interpret else clauses, and in this particular case an ambiguous grammar does the right thing.

Quick Summary

  • Syntax analysis receives terminals from lexical analysis and determines if the input is a sentence conforming to a grammar.

  • Grammar rules can be viewed as functions performing recognition, and syntax analysis starts "top-down" by calling the start rule of the grammar.

  • A grammar rule can be represented as a tree of nodes, i.e., objects of a few classes such as Rule.

  • Algorithms, such as syntax analysis, are distributed as methods over these classes, such as parse().

  • Top-down, deterministic syntax analysis is based on the principle of "check before you call": parse() can only be called on a node if the lookahead, i.e., the next input symbol, matches what the node expects.

  • "check before you call" is possible if a grammar is LL(1) (and therefore not ambiguous). This needs to be checked once when the rules are represented as trees.

  • The algorithms shallow() and deep() compute for each tree node the set expect of terminals which can be in the lookahead.

  • The algorithm follow() computes for each tree node the set of terminals that can follow the input that the node recognizes.

  • The algorithm check() compares these sets for each tree node and determines if a grammar is LL(1) and can be used for this style of syntax analysis.

  • Ambiguity is not always a bad thing; however, recognition based on an ambiguous grammar would have to be very carefully tested before it can be accepted for a project.

  • Wherever classes and methods are mentioned in this book they are linked to the documentation and from there to the (syntax-colored) source lines.

Programming Note: Set

Implementing specific set operations for LL(1) checking and syntax analysis in JavaScript (rather than using the built-in Set class) turns out to be very easy.

A Set is represented as an object with terminal names as properties which have an arbitrary value, e.g., true, just to denote their presence in the set. Note that token and literal names cannot overlap because literal names are single-quoted and tokens are not.

Object.keys(other.set).every(key => key in this.set)
Object.keys(this.set).reduce((result, key) => {
  if (key in other.set) result.set[key] = true;
  return result;
}, new Set());

One word of caution: these sets are objects and, therefore, passed by reference. A new set has to be constructed, e.g., when a Some node adds its own expect and follow sets to send them to the descendants...

Programming Note: trace()

A prudent approach to programming is to assume a priori that something will go wrong and to always instrument code so that algorithms can be observed. Unfortunately, this is likely to hide the actual algorithm behind obscure scaffolding which should not be there for production runs, e.g.,

if (debug) console.debug('some label', 'about to compute');
  // compute something here
if (debug) console.debug('some label', 'something to see');

In the object-oriented approach to syntax analysis described here, algorithms such as parse(), shallow(), deep(), follow(), and check() are distributed as methods of the classes involved in representing grammar rules.

The job of each method is quite small but the methods call each other recursively across the classes and the return values get more and more complicated. Example 4/01 illustrates how instructive it is to see method calls and return values.

  • Toggle some of , , and , then press and watch how the expect and follow sets are put together.

  • Next toggle , then press and watch how the parse() methods call on each other.

Tracing function calls and return values amounts to function composition:

  • the algorithm is carried out by a method,

  • a tracer proxy announces the method call and arguments, calls the algorithm method, reports the end of the method call and the result values, and returns the method's results.

Methods are functions stored in a class' prototype, i.e., a method can be cached and replaced by a tracer proxy which informs about and internally calls the cached method, e.g.

const cache = class_.prototype.method;          // cache
class_.prototype.method = function (...arg) {   // proxy
  console.debug('about to call', class_.name, cache.name);
  const result = cache.call(this, ...arg);
  console.debug('result:', result);
  return result;
}

trace() manages caching and reporting for the essential algorithms of syntax analysis.

Previous: 3. Scanning Input Next: 5. Translating Sentences