1. Compiler Terminology

What's in a Compiler?

What's in a Parser Generator?


A compiler translates a source program into an executable. It should also check that the source program is correct — at least in some sense.

The source program is written in the source language, the executable is written in the target language, the compiler itself is written in the implementation language. For this book, target and implementation language, both, are JavaScript because we will not discuss how to deal with low level languages such as assembler or machine language as target languages. However, beginning in chapter six, we will implement a small, typical programming language and target a simple stack machine which we will simulate in JavaScript.

Often, source and implementation language are the same, i.e., a compiler for a language is written in that language and can compile itself. Chapter nine illustrates how self-compilation for parser generators works.

Translation

The source program is presented as a text string. Translation is accomplished by cooperating tools which focus on specific aspects:

  • A scanner represents the source program as a sequence of terminals, similar to words forming a sentence, and discards, e.g., white space and comments.

  • A parser tries to construct a syntax tree from the terminals to represent the program structure.

Translation proceeds in several steps which might run sequentially or all at once:

  • Lexical analysis is the process of running a scanner. Lexical analysis detects illegal characters which are neither terminals nor allowed to be ignored.

  • Syntax analysis is the process of running a parser. The result is often called a parse tree or syntax tree. Syntax analysis detects syntax errors, e.g., unbalanced parentheses, unbalanced or missing keywords, bad punctuation, etc.

  • Semantic analysis performs type checking and adds information to the parse tree, e.g., the data type produced by certain leaves or branches during execution. Semantic analysis should detect semantic errors, e.g., undefined values, illegal assignments, unsuitable function arguments, and type conflicts in general.

  • Finally, the executable is created from the parse tree and the information collected by semantic analysis, if any.

Tools

Lexical and syntax analysis are fairly formal processes which are well supported by programming tools such as EBNF, BNF, and their Grammar classes:

  • A scanner generator takes a description of terminals and constructs a scanner. The description is usually based on search patterns, also known as regular expressions. Chapter three explains how to create a scanner from regular expressions.

  • A parser generator takes a description of a grammar and constructs a parser. Chapter two explains how to describe grammars and chapter four explains one technique to create a parser from a grammar; chapter ten discusses a more powerful technique. Chapter nine explains how to implement parser generators.

  • Parser generators either produce trees to which semantic analysis can add annotations, or they provide a programming interface for the benefit of semantic analysis. Chapter five explains how to represent a syntax tree, and chapter six introduces a typical programming interface where semantic analysis or the creation of an executable can be attached.

Grammar and Trees

Syntax analysis depends on grammars and syntax trees which can be defined quite formally:

  • A context-free grammar consists of a finite set of terminals, a finite set of non-terminals — different from terminals and among them one called the start symbol, and a finite set of rules.

  • Each rule is an ordered pair of a non-terminal and a finite, ordered sequence consisting of terminals and non-terminals. There may be different pairs with the same non-terminal and different sequences.

A context-free grammar defines a syntax tree as follows:

  • The root node of the syntax tree is labeled with the start symbol of the grammar.

  • All branch nodes are labeled with non-terminals, all leaf nodes are labeled with terminals.

  • A syntax tree is an ordered tree, i.e., subtrees appear in a branch node in a fixed order.

  • Every ordered pair consisting of the non-terminal for a branch node and the ordered sequence of labels taken from the roots of the subtrees for the branch node must be in the rules of the grammar.

Language and Sentences

The rules of a grammar describe the branch nodes of every syntax tree, i.e., a grammar is the finite description of a language:

  • A sentence is the (finite) ordered sequence of terminals which are the leaves of a syntax tree. There might be more than one syntax tree for a sentence.

  • A language is the (usually) infinite set of all sentences which use the same grammar for their syntax trees. There might be more than one grammar for the same language.

Note that ordered pairs (rules) with empty sequences would result in leaf nodes which are labeled with non-terminals. They cannot be part of a sentence as defined above. Empty sequences look like a glitch in these definitions but chapter ten will illustrate that they are necessary. The previous definitions still work if a sentence is defined to consist of the leaves of a syntax tree with the exception of those leaves which are labeled with non-terminals.

Previous: Index Next: 2. Writing Grammars