parse()
, that recognizes correct instances of the grammar.
In this chapter the organization and specification of such a grammar file is discussed in detail.
Having read this chapter you should be able to define a grammar for which Bisonc++ can generate a class, including a member that will recognize correctly formulated (in terms of your grammar) input, using all the features and facilities offered by bisonc++ to specify a grammar. In principle this grammar will be in the class of LALR(1) grammars (see, e.g., Aho, Sethi & Ullman, 2003 (Addison-Wesley)).
Bisonc++ directives %% Grammar rulesReaders familiar with Bison will note that there is no C declaration section and no section to define Additional C code. With bisonc++ these sections are superfluous since, due to the fact that a bisonc++ parser is a class, all additional code required for the parser's implementation can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be defined in the classes' header files, so also the `additional C code' section could be omitted from the Bisonc++ grammar file.
The `%%' is a punctuation that appears in every bisonc++ grammar file to separate the two sections.
The bisonc++ directives section is used to declare the names of the terminal and nonterminal symbols, and may also describe operator precedence and the data types of semantic values of various symbols. Furthermore, this section is also used to specify bisonc++ directives. These bisonc++ directives are used to define, e.g., the name of the generated parser class and a namespace in which the parser class will be defined. All bisonc++ directives are covered in section 5.6.
The grammar rules define how to construct nonterminal symbols from their parts. The grammar rules section contains one or more bisonc++ grammar rules, and nothing else. See section 5.3, covering the syntax of grammar rules.
There must always be at least one grammar rule, and the first `%%' (which precedes the grammar rules) may never be omitted even if it is the first thing in the file.
Bisonc++'s grammar file may be split into several files. Each file may be given a suggestive name. This allows quick identification of where a particular section or rule is found, and improves readability of the designed grammar. The %include-directive (see section 5.6.6) can be used to include a partial grammar specification file into another specification file.
A terminal symbol (also known as a token type) represents a class of
syntacticly equivalent tokens. You use the symbol in grammar rules to mean
that a token in that class is allowed. The symbol is represented in the
Bisonc++ parser by a numeric code, and the parser's lex()
member function
returns a token type code to indicate what kind of token has been read. You
don't need to know what the code value is; you can use the symbol to stand for
it.
A nonterminal symbol stands for a class of syntactically equivalent groupings. The symbol name is used in writing grammar rules. By convention, it should be all lower case.
Symbol names can contain letters, digits (not at the beginning), and underscores. Bisonc++ currently does not support periods in symbol names (Users familiar with Bison may observe that Bison does support periods in symbol names, but the Bison user guide remarks that `Periods make sense only in nonterminals'. Even so, it appears that periods in symbols are hardly ever used).
There are two ways to write terminal symbols in the grammar:
%token
. See
section 5.6.19.
character token type
(or literal character token
) is
written in the grammar using the same syntax used in C++ for character
constants; for example, '+
' is a character token type. A character token
type doesn't need to be declared unless you need to specify its semantic value
data type (see section 5.5.1), associativity, or precedence (see
section 5.6.7).
By convention, a character token type is used only to represent a token
that consists of that particular character. Thus, the token type '+
' is
used to represent the character `+
' as a token. Nothing enforces this
convention, but if you depart from it, your program will confuse other
readers.
All the usual escape sequences used in character literals in C++ can
be used in bisonc++ as well, but you must not use the 0
character as a
character literal because its ASCII code, zero, is the code lex()
must
return for end-of-input (see section 6.3.1). If your program must be
able to return 0-byte characters, define a special token (e.g., ZERO_BYTE
)
and return that token instead.
Note that literal string tokens, formally supported in Bison, is
not supported by bisonc++. Again, such tokens are hardly ever encountered,
and the dominant lexical scanner generators (like flex(1)) do not support
them. Common practice is to define a symbolic name for a literal string
token. So, a token like EQ
may be defined in the grammar file, with the
lexical scanner returning EQ
when it matches ==
.
How you choose to write a terminal symbol has no effect on its grammatical meaning. That depends only on where it appears in rules and on when the parser function returns that symbol.
The value returned by the lex()
member is always one of the terminal
symbols (or 0 for end-of-input). Whichever way you write the token type in the
grammar rules, you write it the same way in the definition of yylex. The
numeric code for a character token type is simply the ASCII code for the
character, so lex()
can use the identical character constant to generate
the requisite code. Each named token type becomes a C++ enumeration value
in the parser base-class header file, so lex()
can use the corresponding
enumeration identifiers. When using an externally (to the parser) defined
lexical scanner, the lexical scanner should include the parser's base class
header file, returning the required enumeration identifiers as defined in the
parser class. So, if (%token NUM) is defined in the parser class Parser
,
then the externally defined lexical scanner may return Parser::NUM
.
The symbol `error
' is a terminal symbol reserved for error recovery
(see chapter 8). The error
symbol should not be used for any
other purpose. In particular, the parser's member function lex()
should
never return this value. Several other identifiers should not be used as
terminal symbols. See section 5.6.19.1 for a description.
result: components ... ;where result is the nonterminal symbol that this rule describes and components are various terminal and nonterminal symbols that are put together by this rule (see section 5.2). With respect to the way rules are defined, note the following:
exp: exp '+' exp ;means that two groupings of type
exp
, with a `+' token in between, can
be combined into a larger grouping of type exp
.
{ C++ statements }Usually there is only one action and it follows the components. See section 5.5.4.
result: rule1-components ... | rule2-components... ... ;They are still considered distinct rules even when joined in this way.
result:
could also have been
defined as:
result: rule1-components ... ; result: rule2-components... ... ;However, this is a potentially dangerous practice, since one of the two
result
rules could also have used a misspelled rule-name (e.g., the second
result
) should have been results
. Therefore, bisonc++ will generate a
warning if the same nonterminal is used repeatedly when defining production
rules.
exp
groupings:
expseq: expseq1 | // empty ; expseq1: expseq1 ',' exp | exp ;Convention calls for a comment `
// empty
' in each empty production
rule.
expseq1: expseq1 ',' exp | exp ;Since the recursive use of expseq1 is the leftmost symbol in the right hand side, we call this left recursion. By contrast, here the same construct is defined using right recursion:
expseq1: exp ',' expseq1 | exp ;Any kind of sequence can be defined using either left recursion or right recursion, but you should always use left recursion, because it can parse a sequence of any number of elements with bounded stack space. Right recursion uses up space on the bisonc++ stack in proportion to the number of elements in the sequence, because all the elements must be shifted onto the stack before the rule can be applied even once. See chapter 7 for further explanation of this.
Indirect or mutual recursion occurs when the result of the rule does not appear directly on its right hand side, but does appear in rules for other nonterminals which do appear on its right hand side. For example:
expr: primary '+' primary | primary ; primary: constant | '(' expr ')' ;defines two mutually-recursive nonterminals, since each refers to the other.
For example, the calculator calculates properly because the value associated with each expression is the proper number; it adds properly because the action for the grouping `x + y' is to add the numbers associated with x and y.
In this section defining the semantics of a language will be addressed. The following topics will be covered:
rpn
and
infix
calculator examples (see, e.g., sections 4.1 and 4.2).
Bisonc++'s default is to use type int
for all semantic values. To specify
some other type, the directive %stype
must be used, like this:
%stype doubleAny text following
%stype
up to the end of the line, up to the first
of a series of trailing blanks or tabs or up to a comment-token (//
or
/*
) becomes part of the type definition. Be sure not to end a
%stype
definition in a semicolon.
This directive must go in the directives section of the grammar file (see
section 5.1). As a result of this, the parser class will define a
private type STYPE__
as double
: Semantic values of language
constructs always have the type STYPE__
, and (assuming the parser class is
named Parser
) an internally used data member d_val
that could be used
by the lexical scanner to associate a semantic value with a returned token is
defined as:
Parser::STYPE__ d_val;
int
or
double
, while a string needs type std::string
, and an identifier might
need a pointer to an entry in a symbol table.
To use more than one data type for semantic values in one parser, bisonc++ offers the following feature:
%union
directive (see section 5.6.21).
%token
directive (see section 5.6.19)
and for groupings with the %type
directive (see section
5.6.20).
The drawback of all this is that union
fields cannot contain class
objects as fields, and thus such objects must be accessed through
pointers. So, if the semantic value of a textual token is returned in
std::string
object, then the union
should have a std::string *
field. Consequently, the programmer must take care to delete the memory
pointed to by the string *
fields whenever these fields become
obsolete. In practice this requirement is not too hard to meet, and so
union
constructions containing pointers to class type objects are
commonly seen.
One may wonder why a union
is still used by Bisonc++ as C++ offers
inherently superior approaches to combine multiple types in one type. The
C++ way to so so is by defining a polymorphic base class and a series of
derived classes implementing the various exclusive data types. The union
approach is supported by Bisonc++ since it is supported by bison(1) and
bison++; dropping the union
would impede backward portability.
The alternative to using a union
is using a polymorphic base class. This
class will be developed below as the class Base
. Since it's a polymorphic
base class it should have the following characteristics:
clone()
member to implement as so-called virtual
constructor;
insert()
member and an overloaded operator<<()
were defined to
allow derived objects to be inserted into ostream
objects.
The class Base has the following interface:
#ifndef _INCLUDED_BASE_ #define _INCLUDED_BASE_ #include <iosfwd> class Base { public: virtual ~Base(); virtual Base *clone() const = 0; virtual std::ostream &insert(std::ostream &os) const = 0; }; inline Base::~Base() {} inline std::ostream &operator<<(std::ostream &out, Base const &obj) { return obj.insert(out); } #endif
The alternatives defined by a classical union
are defined by classes
derived from the class Base
. For example:
Int
contain int
value values. Here is
its interface (and implementation):
#ifndef _INCLUDED_INT_ #define _INCLUDED_INT_ #include <ostream> #include <bobcat/a2x> #include "../base/base.h" class Int: public Base { int d_value; public: Int(char const *text); Int(int v); virtual Base *clone() const; int value() const; // directly access the value virtual std::ostream &insert(std::ostream &os) const; }; inline Int::Int(char const *txt) : d_value(FBB::A2x(txt)) {} inline Int::Int(int v) : d_value(v) {} inline Base *Int::clone() const { return new Int(*this); } inline int Int::value() const { return d_value; } inline std::ostream &Int::insert(std::ostream &out) const { return out << d_value; } #endif
Text
contain text. These objects can be
used, e.g., to store the names of identifiers recognized by a lexical scanner.
Here is the interface of the class Text
:
#ifndef _INCLUDED_TEXT_ #define _INCLUDED_TEXT_ #include <string> #include <ostream> #include "../base/base.h" class Text: public Base { std::string d_text; public: Text(char const *id); virtual Base *clone() const; std::string const &id() const; // directly access the name. virtual std::ostream &insert(std::ostream &os) const; }; inline Text::Text(char const *id) : d_text(id) {} inline Base *Text::clone() const { return new Text(*this); } inline std::string const &Text::id() const { return d_text; } inline std::ostream &Text::insert(std::ostream &out) const { return out << d_text; } #endif
Realize that Base
can't be the parser's semantic value:
Base
class object cannot contain derived class's data members.
Base
reference as a semantic value,
as vectors and other containers cannot store references.
Base
class. Although a pointer would offer programmers the benefits of the
polymorphic nature of the Base
class, it would also require them
to keep track of memory used by Base
objects, which would counter
many of the benefits of using a polymorphic base class.
Consequently, a wrapper class Semantic
around a Base
pointer is
used. The wrapper class will take care of the proper destruction of objects
accessed through a Base
pointer and will offer facilities to obtain the
proper derived class. The interface of the class Semantic
is:
#ifndef _INCLUDED_SEMANTIC_ #define _INCLUDED_SEMANTIC_ #include <ostream> #include "../base/base.h" class Semantic { Base *d_bp; public: Semantic(Base *bp = 0); // Semantic will own the bp Semantic(Semantic const &other); ~Semantic(); Semantic &operator=(Semantic const &other); Base const &base() const; template <typename Class> Class const &downcast(); private: void copy(Semantic const &other); }; inline Base const &Semantic::base() const { return *d_bp; } inline Semantic::Semantic(Base *bp) : d_bp(bp) {} inline Semantic::~Semantic() { delete d_bp; } inline Semantic::Semantic(Semantic const &other) { copy(other); } inline Semantic &Semantic::operator=(Semantic const &other) { if (this != &other) { delete d_bp; copy(other); } return *this; } inline void Semantic::copy(Semantic const &other) { d_bp = other.d_bp ? other.d_bp->clone() : 0; } template <typename Class> inline Class const &Semantic::downcast() { return dynamic_cast<Class &>(*d_bp); } inline std::ostream &operator<<(std::ostream &out, Semantic const &obj) { if (&obj.base()) return out << obj.base(); return out << "<UNDEFINED>"; } #endif
Semantic
is a slightly more `complete' class than Base
and its
derivatives, since it contains a pointer which must be handled
appropriately. So it needs a copy constructor, overloaded assignment operator
and destructor. Apart from that, it supports members to obtain a reference to
the base class which is used by the overloaded operator<<()
to allow
insertion of objects of classes derived from Base
into streams. It also
offers a small member template function returning a reference to a derived
class object from the semantic value's Base
class pointer. This member
effectively implements (and improves) the type safety that is otherwise
strived for by typed nonterminals and typed tokens.
%stype
is of course Semantic
. For this illustrative
example the parser's grammar is a simple one expecting input according to the
following two rules:
rule: IDENTIFIER '(' IDENTIFIER ')' ';' | IDENTIFIER '=' INT ';' ;
The rule's actions simply echo the received identifiers and int
values to
cout
. Here is an example of an action block:
IDENTIFIER '=' INT ';' { cout << $1 << " " << $3 << endl; }
Alternative actions could easily be defined, e.g., using the downcast()
member:
IDENTIFIER '=' INT ';' { int value = $3.downcast<Int>().value(); }
So, in these cases Bisonc++'s parser stores all semantic values on the
semantic values stack, even though multiple tokens were returned in a
particular rule. At any time all semantic values associated with
previously recognized tokens are available in the action block. Once the
semantic value stack is reduced, the Semantic
class takes care of the
proper destruction of the objects pointed to by the Semantic d_bp
data
member.
The scanner must of course be able to access the parser's data member
representing the most recent demantic value. This data member is d_lval__
,
which can be offered to the scanner object when it's constructed. E.g., with a
scanner expecting an STYPE__ *
the parser's constructor could simply be
defined as:
inline Parser::Parser() : d_scanner(&d_val__) {}
d_val__
data member. Therefore the Scanner class defines a
Semantic *d_val
member, which is initialized to the Semantic
pointer
received by the Scanner's constructor:
inline Scanner::Scanner(tt(Semantic) *semval) : d_semval(semval) {}
The scanner (as implemented, e.g., by flex(1) recognizes input patterns,
returns Parser tokens (e.g., Parser::INT), and returns a semantic value when
applicable. E.g., when recognizing a Parser::INT
the rule is:
{ *d_semval = new Int(yytext); return Parser::INT; }
Note that, as the Semantic
constructor expects but one argument,
automatic promotion from Base *
to Semantic
can be used in the
assignments to *d_semval
.
The IDENTIFIER
's semantic value is obtained as follows:
[a-zA-Z_][a-zA-Z0-9_]* { *d_semval = new Text(yytext); return Parser::IDENTIFIER; }
An action consists of C++ statements surrounded by braces, much like a compound statement in C++. It can be placed at any position in the rule; it is executed at that position. Most rules have just one action at the end of the rule, following all the components. Actions in the middle of a rule are tricky and should be used only for special purposes (see section 5.5.6).
The C++ code in an action can refer to the semantic values of the
components matched by the rule with the construct $n
, which stands for the
value of the nth component. The semantic value for the grouping being
constructed is $$
. (Bisonc++ translates both of these constructs into
array element references when it copies the actions into the parser file.)
Here is a typical example:
exp: ... | exp '+' exp { $$ = $1 + $3; } | ...This rule constructs an
exp
from two smaller exp groupings connected
by a plus-sign token. In the action, $1
and $3
refer to the semantic
values of the two component exp groupings, which are the first and third
symbols on the right hand side of the rule. The sum is stored into $$
so
that it becomes the semantic value of the addition-expression just recognized
by the rule. If there were a useful semantic value associated with the `+'
token, it could be referred to as $2
.
If you don't specify an action for a rule, bisonc++ supplies a default: $$ =
$1
. Thus, the value of the first symbol in the rule becomes the value of the
whole rule. Of course, the default rule is valid only if the two data types
match. There is no meaningful default action for an empty rule; every empty
rule must have an explicit action unless the rule's value does not
matter. Note that the default $$
value is assigned at the beginning of
an action block. Any changes to $1
will therefore not automatically be
propagated to $$
. E.g., assuming that $1 == 3
at the beginning of the
following action block, then $$
will still be equal to 3 after executing
the statement in the action block:
{ // assume: $1 == 3 $1 += 12; // $1 now is 15, $$ remains 3 }If
$$
should receive the value of the modified $1
, then $$
must explicitly be assigned to $$
. E.g.,
{ // assume: $1 == 3 $1 += 12; // $1 now is 15, $$ remains 3 $$ = $1; // now $$ == 15 as well. }
Using $n
with n equal to zero or a negative value is allowed for
reference to tokens and groupings on the stack before those that match the
current rule. This is a very risky practice, and to use it reliably you
must be certain of the context in which the rule is applied. Here is a case in
which you can use this reliably:
foo: expr bar '+' expr { ... } | expr bar '-' expr { ... } ; bar: // empty | { previous_expr = $0; } ;As long as
bar
is used only in the fashion shown here, $0
always refers to the expr
which precedes bar in the definition of foo
.
But as mentioned: it's a risky practice, which should be avoided if at all
possible. See also section 6.6.
$$
and
$n
constructs always have that data type.
If you have used a %union
directive to specify a variety of data types,
then you must declare a choice among these types for each terminal or
nonterminal symbol that can have a semantic value. Then each time you use
$$
or $n
, its data type is determined by which symbol it refers to in
the rule. In this example,
exp: ... | exp '+' exp { $$ = $1 + $3; }
$1
and $3
refer to instances of exp, so they all have the data
type declared for the nonterminal symbol exp. If $2
were used, it would
have the data type declared for the terminal symbol '+
', whatever that
might be.
Alternatively, you can specify the data type when you refer to the value,
by inserting `<type>
' after the `$
' at the beginning of the
reference. For example, if you have defined types as shown here:
%union { int u_int; double u_double; };then you can write
$<u_int>1
to refer to the first subunit of the rule
as an integer, or $<u_double>1
to refer to it as a double.
A mid-rule action may refer to the components preceding it using $n
, but
it may not (cannot) refer to subsequent components because it is executed
before they are parsed.
The mid-rule action itself counts as one of the components of the rule. This
makes a difference when there is another action later in the same rule (and
usually there is another at the end): you have to count the actions along with
the symbols when working out which number n
to use in $n
.
The mid-rule action can also have a semantic value. The action can set its
value with an assignment to $$
, and actions later in the rule can refer to
the value using $n
. Since there is no symbol to name the action, there is
no way to declare a data type for the value in advance, so you must use the
`$<...>
' construct to specify a data type each time you refer to this
value.
There is no way to set the value of the entire rule with a mid-rule action,
because assignments to $$
do not have that effect. The only way to set the
value for the entire rule is with an ordinary action at the end of the rule.
Here is an example from a hypothetical compiler, handling a let
statement
that looks like ``let (variable) statement
' and serves to create a
variable named variable
temporarily for the duration of statement. To
parse this construct, we must put variable
into the symbol table while
statement is parsed, then remove it afterward. Here is how it is done:
stmt: LET '(' var ')' { $<u_context>$ = pushSymtab(); temporaryVariable($3); } stmt { $$ = $6; popSymtab($<u_context>5); }As soon as `
let (variable)
' has been recognized, the first action is
executed. It saves a copy of the current symbol table as its semantic value,
using alternative u_context
in the data-type union. Then it uses
temporaryVariable()
to add the new variable (using, e.g., a name that
cannot normally be used in the parsed language) to the current symbol
table. Once the first action is finished, the embedded statement (stmt
)
can be parsed. Note that the mid-rule action is component number 5, so
`stmt
' is component number 6.
Once the embedded statement is parsed, its semantic value becomes the value of
the entire let
-statement. Then the semantic value from the earlier action
is used to restore the former symbol table. This removes the temporary
let
-variable from the list so that it won't appear to exist while the rest
of the program is parsed.
Taking action before a rule is completely recognized often leads to conflicts since the parser must commit to a parse in order to execute the action. For example, the following two rules, without mid-rule actions, can coexist in a working parser because the parser can shift the open-brace token and look at what follows before deciding whether there is a declaration or not:
compound: '{' declarations statements '}' | '{' statements '}' ;But when we add a mid-rule action as follows, the rules become nonfunctional:
compound: { prepareForLocalVariables(); } '{' declarations statements '}' | '{' statements '}' ;Now the parser is forced to decide whether to execute the mid-rule action when it has read no farther than the open-brace. In other words, it must commit to using one rule or the other, without sufficient information to do it correctly. (The open-brace token is what is called the look-ahead token at this time, since the parser is still deciding what to do about it. See section 7.1.4.
You might think that the problem can be solved by putting identical actions into the two rules, like this:
{ prepareForLocalVariables(); } '{' declarations statements '}' | { prepareForLocalVariables(); } '{' statements '}' ;But this does not help, because bisonc++ never parses the contents of actions, and so it will not realize that the two actions are identical.
If the grammar is such that a declaration can be distinguished from a statement by the first token (which is true in C, but not in C++, which allows statements and declarations to be mixed)), then one solution is to put the action after the open-brace, like this:
compound: '{' { prepareForLocalVariables(); } declarations statements '}' | '{' statements '}' ;Now the next token following a recognized
'{'
token would be either
the first declarations
token or the first statements
token, which
would in any case tell bisonc++ which rule to use, thus solving the problem.
Another (much used) solution is to bury the action inside a support non-terminal symbol which recognizes the first block-open brace and performs the required preparations:
openblock: '{' { prepareForLocalVariables(); } ; compound: openblock declarations statements '}' | openblock statements '}' ;Now bisonc++ can execute the action in the rule for subroutine without deciding which rule for compound it will eventually use. Note that the action is now at the end of its rule. Any mid-rule action can be converted to an end-of-rule action in this way, and this is what bisonc++ actually does to implement mid-rule actions.
By the way, note that in a language like C++ the above construction is obsolete anyway, since C++ allows mid-block variable- and object declarations. In C++ a compound statement could be defined, e.g., as follows:
stmnt_or_decl: declarations | pure_stmnt // among which: compound_stmnt ; statements: // empty | statements stmnt_or_decl ; compound_stmnt: open_block statements '}' ;Here, the
compound_stmnt
would begin with the necessary preparations
for local declarations, which would then have been completed by the time they
would really be needed by declarations
.
All token type names (but not single-character literal tokens such as '+' and '*') must be declared. If you need to specify which data type to use for the semantic value (see section 5.5.2) of nonterminal symbols, these symbols must be declared as well.
The first rule in the file by default specifies the start symbol. If you
want some other symbol to be the start symbol, you must use an explicit
%start
directive (see section 3.1).
In this section all bisonc++ declarations will be discussed. Some of the
declarations have already been mentioned, but several more are available. Some
declarations define how the grammar will parse its input (like %left,
%right
); other declarations are available, defining, e.g., the name of the
parsing function (by default parse()
), or the name(s) of the
files generated by bisonc++.
In particular readers familiar with Bison (or Bison++) should read this section thoroughly, since bisonc++'s directives are more extensive and different from the `declarations' offered by Bison, and the macros offered by Bison++.
Several directives expect filename arguments. Filenames must be specified on
the same line as the directive itself, and they start at the first non-blank
character following the directive. Filenames may contain escape sequences
(e.g., if you must: use `\
' to include a blank in a filename) and
continue until the first blank character thereafter. Alternatively, filenames
may be surrounded by double quotes ("..."
) or pointed brackets
(<...>
). Pointed brackets surrounding filenames merely function to delimit
filenames. They do not refer to, e.g., C++'s include path. No escape
sequences are required for blanks within delimited filenames.
Sometimes directives are parallelled by analogous options. In those cases options take priority over directives.
header
header
as the pathname to the file pre-included in the
parser's base-class header. See the description of the
--baseclass-preinclude option for details about
this directive. The filename header
will by default be
surrounded by double quotes; it's OK, however, to provide them
yourself. When the argument is surrounded by pointed brackets
#include <header>
will be included.
parser-class-name
Parser
. The default can be changed using this directive which defines the
name of the C++ class that will be generated. It may be defined only once
and parser-class-name
must be a C++ identifier.
If you're familiar with the Bison++ program, please note:
parse()
and its support functions with debugging code,
showing the actual parsing process on the standard output
stream. When included, the debugging output is active by default,
but its activity may be controlled using the setDebug(bool
on-off) member. Note that no #ifdef DEBUG
macros are used
anymore. Rerun bisonc++ without the --debug option to generate an
equivalent parser not containing the debugging code.
parse()
member function. Following a call of the
error()
function, the stack is dumped from the top of the stack (highest
offset) down to its bottom (offset 0). Each stack element is prefixed by the
stack element's index.
number
%expect
declaration.
The argument number
is a decimal integer. The declaration says there
should be no warning if there are number
shift/reduce conflicts and no
reduce/reduce conflicts. The usual warning is given if there are either
more or fewer conflicts, or if there are any reduce/reduce
conflicts.
In general, using %expect
involves these steps:
%expect
. Use the `-V
' option to
get a verbose list of where the conflicts occur. Bisonc++ will also print the
number of conflicts.
%expect
declaration, copying the number of (shift-reduce)
conflict printed by bisonc++.
filename
%include declarations.gr %% %include rules.grwhere
declarations.gr
contains declarations and rules.gr
contains
the rules. Each of the files included using %include
may itself use
%include
directives. The default nesting limit for %include
directives
is 10, but the option --max-inclusion-depth can be used to
change this default.
%include
directives should be specified on a line of their own, and not
from within the specification of another directive or rule.
%left [ <type> ] terminal(s)These directives are called precedence directives (see also section 5.6.7 for general information on operator precedence).
%nonassoc [ <type> ] terminal(s)
%right [ <type> ] terminal(s)
The %left
, %right
or %nonassoc
directive are used to declare
tokens and to specify their precedence and associativity, all at once.
op
determines how repeated
uses of the operator nest: whether `x op y op z
' is parsed by grouping
x
with y
first or by grouping y
with z
first. %left
specifies left-associativity (grouping x
with y
first) and
%right
specifies right-associativity (grouping y
with z
first). %nonassoc
specifies no associativity, which means that `x
op y op z
' is not a defined operation, and could be considered an error.
<type>
specification is optional, and specifies the type of the
semantic value when a token specified to the right of a <type>
specification is received. The pointed arrows are part of the type
specification; the type itself must be a field of a %union
specification
(see section 5.6.21).
When multiple tokens are listed they must be separated by whitespace or by
commas. Note that the precedence directives also serve to define token names:
symbolic tokens mentioned with these directives should not be defined using
%token
directives.
parse()
function. It acts identically to the -l
command line option and is suppressed by the --no-lines
option.
struct-definition
struct LTYPE__ { int timestamp; int first_line; int first_column; int last_line; int last_column; char *text; };Note that defining this struct type does not imply that its field are also assigned. Some form of communication with the lexical scanner is probably required to initialize the fields of this struct properly.
typename
should be the name of an alternate (predefined) type (e.g.,
size_t). It should not be used together with a
%locationstruct specification. From within the parser class,
this type may be used as LTYPE__.
Any text following %ltype
up to the end of the line, up to the first
of a series of trailing blanks or tabs or up to a comment-token (//
or
/*
) becomes part of the type definition. Be sure not to end a
%ltype
definition in a semicolon.
namespace
namespace
. By default no
namespace is defined.
If this options is used the implementation header will contain a commented
out using namespace
directive for the requested namespace. This directive
is overridden by the --namespace command-line option.
$-1
or $<type>-1
, where type
is a
%union
field-name. See also the sections 5.5.4 and 6.6.
token
token
may be used in production rules to
overrule the actual precendence of an operator in a particular production
rule. Well known is the construction
expression: '-' expression %prec UMINUS { ... }Here, the default priority and precedence of the
`-'
token as the
subtraction operator is overruled by the precedence and priority of the
UMINUS
token, which is frequently defined as:
%right UMINUSE.g., a list of arithmetic operators could consists of:
%left '+' '-' %left '*' '/' '%' %right UMINUSgiving
'*' '/'
and '%'
a higher priority than '+'
and '-'
,
ensuring at the same time that UMINUS
is given both the highest priority
and a right-associativity.
In the above production rule the operator order would cause the construction
'-' expressionto be evaluated from right to left, having a higher precedence than either the multiplication or the addition operators.
ntokens
The directive %required-tokens
can be used to specify this
number. E.g., the specification %required-tokens 10
requires the parsing
function to process successfully a series of 10 tokens before another
syntactic error is reported (and counted). If a syntactic error is encountered
before processing 10 tokens then the counter counting the number of
successfully processed tokens is reset to zero, no error is reported, but the
error recoery procedure continues as usual. The number of required tokens can
also be set using the option --required-tokens. By default the
number of required tokens is initialized to 0.
header
header
as the pathname of a file to include in the parser's class
header. See the description of the --scanner option for details
about this option. This directive also implies the automatic definition of a
composed Scanner d_scanner
data member into the generated parser, as well
as a predefined int lex() member, returning d_scanner.yylex()
. If this
is inappropriate, a user defined implementation of int lex()
must be
provided.
The specfied header
file will be surrounded by double quotes if no
delimiters were provided. If pointed brackets (<...>
) are used, they are
kept.
nonterminal symbol
By default bisonc++ uses the the LHS of the first rule in a grammar specification file as the start symbol. I.e., the parser will try to recognize that nonterminal when parsing input.
This default behavior may be overriden using the %start
directive.
The nonterminal symbol specifies a LHS that may be defined anywhere in
the rules section of the grammar specification file. This LHS becomes the
grammar's start symbol.
typename
should be the name of an unstructured type (e.g.,
size_t). By default it is int. See YYSTYPE in
bison. It should not be used if a %union specification is
used. Within the parser class, this type may be used as
STYPE__.
Any text following %stype
up to the end of the line, up to the first
of a series of trailing blanks or tabs or up to a comment-token (//
or
/*
) becomes part of the type definition. Be sure not to end a
%stype
definition in a semicolon.
%tokenterminal token(s)
%token [ <type> ]terminal token(s)
The %token directive is used to define one or more symbolic terminal tokens. When multiple tokens are listed they must be separated by whitespace or by commas.
The <type>
specification is optional, and specifies the type of the
semantic value when a token specified to the right of a <type>
specification is received. The pointed arrows are part of the type
specification; the type itself must be a field of a %union
specification
(see section 5.6.21).
bisonc++ will convert the symbolic tokens (including those defined by the
precedence directives (cf. section 5.6.7)) into a Parser::Tokens
enumeration value (where `Parser
' is the name of the generated parser
class, see section 5.6.2). This allows the lexical scanner member
function lex()
to return these token values by name directly, and it
allows externally defined lexical scanners (called by lex()
) to return
token values as Parser::name
.
When an externally defined lexical scanner is used, it should include
Parserbase.h
, the parser's base class header file, in order to be able to
use the Parser::Tokens
enumeration type.
Although it is possible to specify explicitly the numeric code for a
token type by appending an integer value in the field immediately following
the token name (as in %token NUM 300
) this practice is deprecated. It is
generally best to let bisonc++ choose the numeric codes for all token types. Bisonc++
will automatically select codes that don't conflict with each other or with
ASCII characters.
In particular,
__
).
ABORT, ACCEPT, ERROR, clearin, debug, error, setDebugExcept for
error
, which is a predefined terminal token, these
identifiers are the names of functions traditionally defined by bisonc++. The
restriction on the above identifers could be lifted, but then the resulting
generated parser would no longer be backward compatible with versions before
Bisonc++ 2.0.0. It appears that imposing minimal restrictions on the names of
tokens is a small penalty to pay for keeping backward compatibility.
<type> symbol(s)
%union
is used to specify multiple value types, the value type of
each rule returning a semantic value can be defined. For this the %type
directive is used.
With this directive, symbol(s)
represents of one or more blank or
comma delimited grammatical symbols (i.e., terminal and/or nonterminal
symbols); type
is a field name defined in the %union
specification
that you want to associate with the specified nonterminal(s). The type itself
must be a field of a %union
specification (see section
5.6.21). The pointed arrows are part of the type specification.
When the semantic value type of a terminal symbol is defined the
lexical scanner rather than the parser's actions must assign the
appropriate semantic value to d_val__ just prior to returning the
token. To associate terminal symbols with semantic values, terminal symbols
can also be specified in a %type
directive.
union-definition body
%union
directive specifies the entire collection of possible data
types for semantic values. The keyword %union
is followed by a pair of
braces containing the same thing that goes inside a union in C++.
For example:
%union { double u_val; symrec *u_tptr; };This says that the two alternative types are
double
and symrec
*
. They are given names u_val
and u_tptr
; these names are used in the
%token
and %type
directives to pick one of the types for a terminal or
nonterminal symbol (see section 5.6.20).
Notes:
union
definitions. Consequently, they cannot be used in %union
directives either. When a class type variant is required, a
pointer to a class object should be used. The responsibility for
the correct construction and destruction of the pointed-to class
objects lies with the programmer. See also section 5.5.2.
<CLASS>
should be
interpreted as the name of the parser's class, Parser
by default, but
configurable using %class-name
(see section 5.6.2).
<Class>base.h
, configurable using
%baseclass-header
(see section 5.6.22.1) or %filenames
(see section 5.6.22.3);
<Class>.h
, configurable using
%class-header
(see section 5.6.22.2) or %filenames
(see section 5.6.22.3);
<Class>.ih
, configurable
using %implementation-header
(see section 5.6.22.4) or
%filenames
(see section 5.6.22.3);
header
header
header
header
source
parse()
. This directive is overridden by the
--parse-source (-p) command-line option.
|
). Each
alternative should begin with a unique identifying terminal token. The
terminal token may actually be hidden in a non-terminal rule, in which case
that nonterminal can be used as an alias for the non-terminal. In fact
identical terminal tokens may be used if at some point the terminal tokens
differ over different alternatives. Here are some examples:
// Example 1: plain terminal distinguishing tokens expr: ID | NUMBER ; // Example 2: nested terminal distinguishing tokens expr: id | number ; id: ID ; number: NUMBER ; // Example 3: eventually diverting routes expr: ID id | ID number ; id: ID ; number: NUMBER ;
ASCII
characters define a string, and multiple white-space delimited strings are
handled as one single string:
"hello" // multiple ws-delimited strings " " "world" "hello world" // same thingUsually a parser is responsible for concatenating the individual string-parts, receiving one or more
STRING
tokens from the lexical
scanner. A string
rule handles one or more incoming STRING
tokens:
string: STRING | string STRINGThe above rule can be used as a prototype for recognizing a series of elements. The token
STRING
may of course be embedded in another rule. The
generic form of this rule could be formulated as follows:
series: unit | series unitNote that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly.
opt_statements: // empty | opt_statements statementsThe above rule can be used as a prototype for recognizing a series of elements: the generic form of this rule could be formulated as follows:
opt_series: // empty | opt_series unitNote that the empty element is recognized first, even though it is empty, whereafter the left-recursive alternative may be recognized repeatedly. In practice this means that an action block may be defined at the empty alternative, which is then executed prior to the left-recursive alternative. Such an action block could be used to perform initializations necessary for the proper handling of the left-recursive alternative.
variables: IDENTIFIER | variables ',' IDENTIFIERThe above rule can be used as a prototype for recognizing a series of delimited elements. The generic form of this rule could be formulated as follows:
series: unit | series delimiter unitNote that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly. In fact, this rule is not really different from the standard rule for a series, which does not hold true for the rule to recognize an optional series of delimited elements, covered in the next section.
opt_parlist: // empty | opt_parlist ',' parameterTo define an optional series of delimited elements two rules are required: one rule handling the optional part, the other the delimited series of elements. So, the correct definition is as follows:
opt_parlist: // empty | parlist ; parlist: parameter | parlist ',' parameter ;Again, the above rule pair can be used as a prototype for recognizing an optional series of delimited elements. The generic form of these rules could be formulated as follows:
opt_series: // empty | series ; series: element | series delimiter elementNote that the
opt_series
rules neatly distinguishes the no-element
case from the case were elements are present. Usually these two cases need to
be handled quite differently, and the opt_series
rules empty alternative
easily allows us to recognize the no-elements case.
expr: NUMBER | ID | expr '+' expr | ... | '(' expr ')' ;This definition is simply characterized that the non-terminal
expr
appears within a set of parentheses, which is not too complex.
The definition of opt_statements
, however, is a bit more complex. But
acknowledging the fact that a statement
contains among other elements a
compound statement, and that a compound statement, in turn, contains
opt_statements
an opt_statements
construction can be formulated
accordingly:
opt_statements: // define an optional series // empty | opt_statements statement ; statement: // define alternatives for `statement' expr_statement | if_statement | ... | compound_statement ; compound_statement: // define the compound statement itself '{' opt_statements '}' ;
%class-name
directive, and construct parser objects
of each of the defined classes.