The lexer is based on the re1 module. TPG profits from the power of Python regular expressions. This document assumes the reader is familiar with regular expressions.
You can use the syntax of regular expressions as expected by the re module except from the grouping syntax since it is used by TPG to decide which token is recognized.
Tokens can be explicitely defined by the token and separator keywords.
A token is defined by:
Token definitions end with a ; .
See figure 6.1 for examples.
|
The order of the declaration of the tokens is important. The first token that is matched is returned. The regular expression has a special treatment. If it describes a keyword, TPG also looks for a word boundary after the keyword. If you try to match the keywords if and ifxyz TPG will internally search if\b and ifxyz\b. This way, if won’t match ifxyz and won’t interfere with general identifiers (\w+ for example).
There are two kinds of tokens. Tokens defined by the token keyword are parsed by the parser and tokens defined by the separator keyword are considered as separators (white spaces or comments for example) and are wiped out by the lexer.
Tokens can also be defined on the fly. Their definition are then inlined in the grammar rules. This feature may be useful for keywords or punctuation signs. Inline tokens can not be transformed by an action as predefined tokens. They always return the token in a string.
See figure 6.2 for examples.
|
Inline tokens have a higher precedence than predefined tokens to avoid conflicts (an inlined if won’t be matched as a predefined identifier).
TPG works in two stages. The lexer first splits the input string into a list of tokens and then the parser parses this list.
The lexer split the input string according to the token definitions (see 6.2). When the input string can not be matched a tpg.LexerError exception is raised.
The lexer may loop indefinitely if a token can match an empty string since empty strings are everywhere.
Tokens are matched as symbols are recognized. Predefined tokens have the same syntax than non terminal symbols. The token text (or the result of the function associated to the token) can be saved by the infix / operator (see figure 6.3).
Inline tokens have a similar syntax. You just write the regular expression (in a string). Its text can also be save (see figure 6.4).
Their are some special tokens that have been requested by some users but these tokens can not be easily described by TPG using classical token definition (see 6.2).
TPG is written in Python so is should be easy to handle INDENT and DEINDENT tokens as in Python language. These tokens are introduced in the source to be parsed by a preprocessor, before the lexer is activated. Spaces in the beginning the lines are replaced by indent and deindent tokens when needed. These special tokens are characters which ASCII codes are 16 and 17. These characters may not be used in regular text files.
The indent option (see 5.3) has been added to define the indentation. It has two values. The first one is a regular expression describing the indentation, usually spaces and tabulations. The second one is a regular expression describing the lines not to be taken into account, usually comments. This second parameter describes the beginning of the line, i.e. ”#” will match lines starting with a #.
When the indent option is active, indent and deindent tokens are defined. They can be used as any other token.
|