3. Scanning Input

Terminals, Tokens, and Literals

Taking Input Apart

Chapter one explained that lexical analysis, the first phase of the translation process in a compiler, breaks the source program — usually a string — into parts called terminals in order to simplify the job of the next phase, syntax analysis.

Chapter two explained that a grammar describes sentences, i.e., sequences of terminals which conform to the rules of the grammar.

Therefore, a grammar will at least describe what terminals it needs for the rules, even if it does not really need to describe what the terminals look like in an actual source program.

Our grammar notation distinguishes two kinds of terminals, literals and tokens:

A literal is a single-quoted string in a rule and the content of the string represents the literal in the source program. Backslashes are used to escape single quotes and backslashes in the string content in the rule.
A token is just a name in a rule, different from all non-terminal names. Tokens are defined as key-value pairs (properties) of a JavaScript object where the key is the token name and the value is a regular expression (search pattern) which describes how the token is represented in the source program.

If the grammar in example 2/10 is represented and checked by pressing the output is a short description of the grammar:

  0 sum: Number { '+' Number | '-' Number };

literals: '+', '-'
tokens: Number

The description contains a numbered list of rules, a list of single-quoted literals, and a list of token names used in the rules. The token patterns are in the text of the :

{ Number: /[0-9]+/ }

Together with the convention to ignore white space this is enough information to break the source program up into terminals:

Press to represent and check the grammar and
press to perform lexical analysis.

The output shows how the string in the

12 - 345 + 6789

is cut into pieces:

> g.scanner().pattern /(\s+)|([0-9]+)|(\-)|(\+)/gm
> g.scanner().scan(program)
(1) "12" Number
(1) '-'
(1) "345" Number
(1) '+'
(1) "6789" Number

The first line in the output displays the search pattern which is used for lexical analysis. It will be discussed below. The second line contains the JavaScript expression to create a scanner and perform lexical analysis.

Each of the remaining lines in the output describes one Tuple object, consisting of a line number, a piece of the input, and a classification as a terminal:

For a literal the piece of input is enclosed in single quotes.
For a token the piece of input is enclosed in double quotes and the token name is shown, too.
Illegal input is enclosed in double quotes and there is no token name.

The output is independent of any white space which may or may not be in the . If, for example, some newline characters are inserted between the numbers and operators, only the line numbers in the Tuple objects will change.

The button uses the scan() method of a Scanner object which is constructed from the grammar. This chapter explains how an object of the Scanner class chops up the input.

Search Patterns

Computer users always search for specific pieces of text — buried in files or among collections of file names. Even the earliest versions of operating systems such as DOS, Unix®, or Windows, provided more than "literal" searches — they supported search patterns. For example

rm abc d?f g*i [0-9*a-z]\*

This Unix® command might not be terribly realistic, but it demonstrates the principle. It would remove

the file literally named abc,
all files with three-character names starting with d and ending with f,
all files with names starting with g, ending with i, and containing zero or more characters in-between,
and, finally, all files with a name consisting of a single digit, asterisk, or lower-case letter, followed by a literal asterisk.

Search patterns are usually called regular expressions because they employ a very concise notation for regular grammars which are context-free grammars where the right-hand side of rules can only use a terminal or a terminal followed by a single non-terminal.

The patterns above demonstrate that this allows for iteration, but it does not allow, e.g., for a construction that can balance parentheses.

Regular expressions can have very efficient implementations based on the theory of finite state automata.

One has to be cautious, however. While most modern programming languages support regular expressions for searching and text validation, neither the syntax nor the semantics of regular expressions is consistent across different languages, e.g.,

(a.*b)|[cde]+

should find a followed by any number of characters and then b, or find a sequence of one or more of the characters c, d, or e. However, depending on the platform, the expression might well find the first match, or the longest match, or even all matches...

The EBNF module/parser generator is implemented in JavaScript; therefore these tutorials will use and discuss regular expressions as they are supported in JavaScript.

Different flavors of regular expressions can be tested on the Regular Expressions 101 website.

Regular Expressions 101

This section is a crash course on patterns as they are supported in JavaScript, specifically, to deal with the question how typical terminals in programming languages can be expressed through patterns. Many of the more esoteric aspects of regular expressions are not covered here.

A pattern is used to recognize part of an input string. Patterns are a shorthand notation for a type of grammars; therefore, we should expect that there is a concept of literals and tokens for patterns:

Letters, digits, and (very few) special characters are literals and recognize themselves. This can be used to recognize words in a programming language:

if then else while do

A period is a token in the language of regular expressions which recognizes any character with the exception of a line separator:

Brackets, [ and ], enclose a sequence of characters, which is termed a character class. This token recognizes any single one of its constituent characters:

[0123456789]

If the sequence inside brackets starts with ^ the token will recognize any single character not in the sequence — most of the time this is more than one bargains for:

[^a-z]

- between two characters inside the sequence denotes the inclusive range.

All of this together can be used to recognize single digits, single lower-case letters, or all special characters in the ASCII character set

[0-9] [a-z] [!-/:-@[-`{-~]

Most non-alphanumeric characters have a special meaning in a pattern. This can be defeated by preceding a character with the \ (backslash) character. In particular, \\ is the literal to recognize one backslash.

Simply preceding every character in a pattern with backslash is unwise because some letters acquire special meaning when preceded by backslash. For example, this literal recognizes any single white space character, even outside the ASCII character set:

\s

Chapter two explained that grammars use rules built from literals and tokens to describe more complicated sentences.

Instead of rules, regular expressions use certain operators to build complicated patterns from simpler pieces. From a semantic perspective the constructs have much in common with the extensions to BNF discussed previously.

To begin with, a sequence of pattern pieces recognizes a sequence of input parts each of which matches the corresponding pattern piece.

For example, this sequence recognizes command language variable names $0 through $9:

\$[0-9]

One should always escape $ because $ at the end of a pattern anchors recognition, i.e., it demands that recognition extend all the way to the end of input.
^ at the beginning of a pattern also anchors recognition, i.e., recognition must start at the beginning of input.

For example, this pattern recognizes a string only if it consists exactly of two command variable names separated by exactly one white space character:

^\$[0-9]\s\$[0-9]$

Iteration is expressed by ?, +, or * directly following a pattern piece to indicate that the piece is optional, recognized one or more times, or both, in the input.

This can be used, e.g., to recognize optionally signed integer numbers (line 1 below), alphanumeric variable names starting with a letter (line 2), or such a name and number optionally surrounded and definitely separated by white space as the only input (line 3):

[-+]?[0-9]+
[a-zA-Z][a-zA-Z0-9]* 
^\s*[a-zA-Z][a-zA-Z0-9]*\s+[-+]?[0-9]+\s*$

Patterns support alternatives: pattern pieces can be joined with the operator |.
Just as in grammars, sequences take precedence over alternatives.
Iterations take precedence over sequences.

Here is a pattern for a number which would be interpreted as decimal, hexadecimal, or octal, depending on the prefix:

[1-9][0-9]*|0[xX][0-9a-fA-F]+|0[0-7]*

The order of alternatives can make a difference because JavaScript is happy with the first alternative that matches. Use the Regular Expressions 101 website to see that the previous pattern will not recognize 0xA if the last alternative is specified first, i.e.,

0[0-7]*|[1-9][0-9]*|0[xX][0-9a-fA-F]+

Parentheses (?: and ) can be used to group pattern pieces and thus change precedence and nest alternatives into sequences and both into iterations.

Here is a pattern for a non-empty, single-quoted string which uses backslash to escape a single quote, backslash, or newline (represented as \n):

'(?:\\['\\\n]|[^'\\\n])+'

Admittedly, JavaScript patterns also allow simple parentheses; however, simple parentheses introduce capture groups which serve a very special purpose that is critical for the scan() method developed in the next section.

We have already seen the patterns needed to describe most of the tokens for a typical programming language. Among them, strings are the most complicated but the last example showed how escape conventions and newlines can be handled.

Slightly more complicated is the removal of comments from input. Patterns cannot balance nests of leading and trailing sequences; therefore, patterns cannot deal with nested comments. Other than that:

Line comments are simple — they range from a leading sequence to the end of a line or of the input.

\/\/[^\n]*(?:\n|$)

JavaScript uses // for the lead of a comment and also uses / to enclose regular expressions; therefore, the pattern above for JavaScript comments escapes the leading '//'.
Comments enclosed by unique single characters use the same design. E.g., Pascal encloses comments in braces and allows multiple lines in a comment:

{[^}]*}

Finally, comments enclosed in overlapping sequences, such as /* and */ in JavaScript, require careful specification:

\/\*(?:[^*]|\*+[^/*])*\*+\/

Here, zero or more character groups follow the opening /* and one or more asterisks precede the closing slash. Each character group consists of any single character but an asterisk, or of one or more asterisks followed by a single character which is neither a slash nor an asterisk. Use the Regular Expressions 101 website to see that, e.g., /***/ is recognized but /*/ is not.

As far as possible, example 3/01 contains all the pattern examples in this section.

Press to represent and check the grammar, and then
press to see what tuples are found in the .
Some definitions in the are not used in the grammar. Why not?
Note that if you click on the label of a text area the area will expand. Shift-click will restore the default layout, and alt/option-click tries to open a separate, non-editable window with syntax highlighting.

`scan()`

Given the literals and descriptions for tokens, how does the scan() method perform lexical analysis, i.e., take an input string and return the list of Tuple objects which describe the terminals in the string?

JavaScript represents regular expressions as objects of the class RegExp. An object can be constructed by enclosing a pattern in single slashes, or it can be constructed from a string — but then, backslash and quotes would have to be escaped according to string rules before they play their role in a pattern.

The exec() method for a RegExp object accepts a string. If the regular expression matches the string somewhere the method returns an array, otherwise it returns null. The first element in the array is that part of the string which was matched. The remaining elements in the array, if any, contain the components of the matched part which the capture groups in the regular expression, if any, matched.

This gets tricky if capture groups are nested, but here is a simple example which should suggest how exec() can be set up to do the job of lexical analysis. Consider a string with the alphabet

abcdefghijklmnopqrstuvwxyz

and consider a pattern with three capture groups

([a-f]+)(.*)([u-z]+)

Matches are greedy, i.e., the first group matches the beginning of the alphabet, and it will grab all of the letters abcdef,
the third group only matches the end of the alphabet, i.e., in this case z,
because the second group is greedy, too, and matches all the letters in-between.

Therefore, the JavaScript function call

/([a-f]+)(.*)([u-z]+)/.exec('abcdefghijklmnopqrstuvwxyz')

returns an array with four elements of length 26, 6, 19, and 1.

The pattern would also match the string JavaScript because the first capture group is happy with the first a, the second group can match nothing, and the third group matches v. The resulting array again contains four strings, containing JavaScript, a, nothing, and v — check it out on the Regular Expressions 101 website.

exec() can be configured by appending certain flags to the pattern, such as

g (global) arranges that exec() can be called multiple times to find all matches of the pattern in the string.
m (multiline) allows the anchors ^ and $ to match line feeds in the string.

A capture group extracts a piece of input and classifies it in the resulting list by the group's position in the pattern — extraction and classification is what lexical analysis has to do.

Example 3/02 creates the following pattern with four alternatives, each a capture group, namely white space, "names" (which include literals like if), "numbers", and the operator <=:

> g.scanner().pattern = /(\s+)|([a-z]+)|([0-9]+)|(\<\=)/gm

Depending on the string matched against the pattern, exactly one group will capture something — unless the entire match fails.

As for lexical analysis:

If nothing is matched or if the match does not start at the beginning of the input string there are illegal characters in the input. They can be represented by a specific Tuple object which does not classify the input.
A match of the first group has to be trimmed off the input — the scanner skips, e.g., white space.
The next set of groups consists of the tokens of the grammar which might have to be screened further. In this example, these are sequences of lower-case letters ("names") and sequences of digits ("numbers").
Note that something like iffy should be recognized as a name and not as the literal if followed by the name fy.
The last set of groups consists of the literals of the grammar which are not matched by the tokens, in this case <=.

The lexical analysis pattern can be composed as alternatives of capture groups with

one pattern for white space and comments,
one pattern for each token,
and finally one pattern for each literal in order of decreasing length so that, e.g., <= is recognized before <.
Patterns for literals which are matched by a token pattern can be omitted.

This assumes that white space and token patterns all match different input, and that the result of a token match such as "names" is screened, e.g., by looking the matched input up in a map of literal values which take precedence over the token.

In general, overlapping patterns can cause problems. It is not possible to check that the patterns are "different" enough when the Scanner is constructed but example 3/01 and example 3/02 show how a grammar and some input can be used to test lexical analysis thoroughly:

The grammar should contain all tokens and literals as simple alternatives. Press to represent the grammar.
The should contain representations of all tokens and literals to be tested — the input does not have to be a sentence. Press to see what tuples are found.

Example 3/03 illustrates that the token patterns may have to be ordered. Consider integer constants, represented as digit strings, and floating point constants, required to contain a decimal point:

Int: /0|[1-9][0-9]*/,
Real: /(?:0|[1-9][0-9]*)\.[0-9]*/

The input string in the is

1 2.3

If the scanner() places the Int pattern before the Real pattern lexical analysis will find three Int tokens and one illegal character:

> g.scanner().pattern =
  /(\s+)|(0|[1-9][0-9]*)|((?:0|[1-9][0-9]*)\.[0-9]*)/gm
> g.scanner().scan(program)
(1) "1" Int
(1) "2" Int
(1) "."
(1) "3" Int

As a crutch to fix this problem, the token patterns are sorted by name before they are combined for the lexical analysis pattern. In example 3/04 the token name is changed from Real to Float.

Press to prepare the grammar, and
press to test the lexical analysis pattern.

Now the order of the token patterns is interchanged, Float before Int. The floating point pattern will find the floating point number and the output consists of two tuples:

> g.scanner().pattern =
  /(\s+)|((?:0|[1-9][0-9]*)\.[0-9]*)|(0|[1-9][0-9]*)/gm
> g.scanner().scan(program)
(1) "1" Int
(1) "2.3" Float

Note that 0. and 0.1 are valid but .1 is not. This could be changed with an alternative in the Float pattern.

Screening

Many programming languages use reserved words, i.e., they forbid that literals such as if, then, and else are used as variable names — these literals should always indicate a control structure.

From the perspective of a grammar for a programming language, variable names are a class of inputs, i.e., they are a token in the grammar, and the scanner must not report a reserved word like if as the particular token for variable names.

Screening prevents that.

A grammar is represented as a Grammar object. The scanner() method is called with an optional pattern argument which describes the parts of input which should be ignored, by default white space.

scanner() returns a Scanner object which contains the lexical analysis pattern described above and a list, terminals, of the unique Lit and Token objects corresponding by position to the capture groups of the pattern.

The scan() method of this Scanner object takes an input string, applies the lexical analysis pattern, and classifies the matching part of the input as a literal or token, represented as a Lit or Token object, by using the index of the capture group and the terminals list.

If the scan() method proposes a part of the input as a token while it could be a literal it is used as an index into a map of literals with which the token overlaps — this should be an efficient operation in JavaScript. If the string is found, the result is the corresponding Lit object in place of the proposed Token object. Thus, literals masquerading as tokens are detected and lexical analysis can report them properly.

Screening slows input processing down and should be used only if really necessary; e.g., a token for numbers will not require screening if all reserved words in the language consist of letters and all operators consist of special characters.

To streamline the process all literal values which can appear in the input are checked against all token patterns when the Scanner object is constructed. If a literal value is classified as a token the corresponding Lit object is added to a map .screen for the Token object. If this map exists, it is consulted during scan() as described above. Additionally, the literal value is dropped from the lexical analysis pattern because it would not match anything on its own.

Example 3/05 shows that things can go really wrong if a token pattern partially overlaps a literal. Patterns for "numbers" and alphabetic "names" would both overlap the literal 'Formula1'.

Fortunately, this case is rare in the world of programming languages but the scanner() method has an optional parameter so that the pattern pieces can be sorted explicitly. The parameter can be used in a JavaScript program; example 3/05 would be corrected as follows:

import * as EBNF from 'modules/ebnf.js';
const EBNF = await import("./ebnf.js");
const grammar = " examples: Name | Number | 'Formula1'; ",
  tokens = { Name: /[A-Za-z]+/, Number: /[0-9]+/ },
  program = " Formula 1 \n Formula1 ";

const g = new EBNF.Grammar(grammar, tokens); g.check();
console.log(g.toString());
console.log(g.scanner().pattern);
console.log(g.scanner().scan(program).join('\n')); // the mistake

const terminals = [ g.lit("'Formula1'"), g.token('Name'), g.token('Number') ];
const s = g.scanner(/\s+/, terminals); // the correction
console.log(s.pattern);
console.log(s.scan(program).join('\n')); // the intended output

In Node.js this produces the intended output:

> console.log(s.pattern);
/(\s+)|(Formula1)|([A-Za-z]+)|([0-9]+)/gm
> console.log(s.scan(program).join('\n')); // the intended output
(1) "Formula" Name
(1) "1" Number
(2) 'Formula1'

Alternatively, an alphanumeric Name

Name: /[A-Za-z][A-Za-z0-9]*/

would overlap Formula1 completely and screening would resolve Formula1 correctly as a literal.

Quick Summary

Lexical analysis breaks input into parts represented by the literals and tokens used in a grammar. It skips the irrelevant parts of the input — usually white space and comments. Sequences of one or more unrecognized input characters are also reported.
Regular expressions (search patterns) are the tool of choice to describe literals and tokens and they are the basis for automatic implementation of lexical analysis.
scanner() creates a Scanner object with a scan() method which accepts an input string and returns a list of Tuple objects describing the input parts.
scan() uses a RegExp composed of the patterns for all tokens and some literals in the grammar, plus a pattern for input parts to be skipped.
The order of patterns is important: (usually) tokens, sorted by name, before literals, literals sorted by length in reverse order.
Overlapping token patterns may cause issues, in particular, if a token pattern partially overlaps a literal. As a last resort, the pattern order can be set explicitly when the Scanner object is created.
Some tokens, such as the class of identifiers in a programming language, might hide reserved words, i.e., literals such as if then else which have a special meaning as a control structure.
Such tokens are marked when the Scanner object is created and matched input is checked against a token-specific map of hidden literals so that the input part can be reported correctly.
The Regular Expressions 101 website is great for testing.
Example 3/01 contains many patterns for terminals used in programming languages. A few of these are too general to be used in the example...