A tour of the compiler internals

This section contains a brief introduction to the internal structure of the compiler. This introduction is neither comprehensive nor tutorial; it is merely intended as a stepping stone for the courageous.

The compiler has undergone much evolution. It started as a project to build a simple and easy to maintain compiler. Somewhere along the way we decided to compile Modula-3. Much later we decided to generate C. Now it generates native code directly.

The initial observation was that most compilers' data structures were visible and complex. This situation makes it necessary to understand a compiler in its entirety before attempting non-trivial enhancements or bug fixes. By keeping most of the compiler's primary data structures hidden behind opaque interfaces, we hoped to avoid this pitfall. So far, bugs have been easy to find. During early development, it was relatively easy to track the weekly language changes.

The compiler is decomposed by language feature rather than the more traditional compiler passes. We attempted to confine each language feature to a single module. For example, the parsing, name binding, type checking and code production for each statement is in its own module. This separation means that only the CaseStmt module needs to know what data structures exist to implement CASE statements. Other parts of the compiler need only know that the CASE statement is a statement. This fact is captured by the object subtype hierarchy. A CaseStmt.T is a subtype of a Stmt.T.

The main object types within the compiler are: values, statements, expressions, and types. ``Values'' is a misnomer; ``bindings'' would be better. This object class include anything that can be named: modules, procedures, variables, exceptions, constants, types, enumeration elements, record fields, methods, and procedure formals. Statements include all of the Modula-3 statements. Expressions include all the Modula-3 expression forms that have a special syntax. And finally, types include the Modula-3 types.

The compiler retains the traditional separation of input streams, scanner, symbol table, and output stream.

The compilation process retains the usual phases. Symbols are scanned as needed by the parser. A recursive descent parser reads the entire source and builds the internal syntax tree. All remaining passes simply add decorations to this tree. The next phase binds all identifiers to values in scopes. Modula-3 allows arbitrary forward references so it is necessary to accumulate all names within a scope before binding any identifiers to values. The next phase divides the types into structurally equivalent classes. This phase actually occurs in two steps. First, the types are divided into classes such that each class will have a unique C representation. Then, those classes are refined into what Modula-3 defines as structurally equivalent types. After the types have been partitioned, the entire tree is checked for type errors. Finally, the IL is emitted.

The intermediate language is defined as a sequence of calls on an M3CG.T object. In the compiler front end there is a thin veneer interface, CG, that makes the actual calls. The veneer converts between the bit-level addressing used in the front end and the byte-level addressing used by the back ends. There are several implementations of M3CG.T. The one that we currently use is M3CG_Wr. It converts the stream of calls into a parseable stream of ASCII characters.

The current back end is constructed as gcc front end. It reads the M3CG output of the front end and produces the tree structures needed by gcc's back end.

The front end is in the compiler package. Within that package the following directories exist:

    builtinOps    ABS, ADR, BITSIZE, ...
    builtinTypes  INTEGER, CHAR, REFANY, ...
    builtinWord   Word.And, Word.Or, ...
    exprs         +, -, [], ^, AND, OR, ...
    main          main program
    misc          scanner, symbol tables, ...
    stmts         :=, IF, TRY, WHILE, ...
    types         ARRAY, BITS FOR, RECORD, ...
    values        MODULE, PROCEDURE, VAR, ...

The intermediate language is in the m3cg package.

The back end is in the m3cc package. The single file that implements the M3CG reader is gcc/m3.c.

Linking

SRC Modula-3 uses a special two-phase linker. The first phase of the linker checks that all version stamps are consistent. It then generates in m3main declarations for the runtime type structures, and initialization code for the collection of objects to be linked. The second phases calls the host system's linker to actually link the program.

The information needed by the first phase is generated by the compiler in a file ending in .m3x. For every symbol X.Z exported or imported by a module, the compiler generates a version stamp. These stamps are used to ensure that all modules linked into a single program agree on the type of X.Z. The linker will refuse to link programs with inconsistent version stamps.

File extensions

SRC Modula-3 uses files with the following customizable suffixes:

.i3 (.m3)
Modula-3 interface (module) sources.
.ig (.mg)
generic interface (module) sources.
.ic (.mc)
M3CG intermediate language files.
.is (.ms)
intermediate assembly files.
.io (.mo)
object files.
.c
C source files.
.h
C header files.
.s
foreign assembly source files.
.o
foreign object files.
.a
library archives.
.m3x
pre-linker information including version stamps.