Chapter 5: Bisonc++ grammar files

Bisonc++ takes as input a context-free grammar specification and produces a C++ class offering various predefined members, among which the member parse(), that recognizes correct instances of the grammar.

In this chapter the organization and specification of such a grammar file is discussed in detail.

Having read this chapter you should be able to define a grammar for which Bisonc++ can generate a class, including a member that will recognize correctly formulated (in terms of your grammar) input, using all the features and facilities offered by Bisonc++ to specify a grammar. In principle this grammar will be in the class of LALR(1) grammars (see, e.g., Aho, Sethi & Ullman, 19xx).

5.1: Outline of a Bisonc++ Grammar File

The input file for the Bisonc++ utility is a Bisonc++ grammar file. Different from Bison++ and Bison grammar files, Bisonc++ grammar file consist of only two sections. The general form of a Bisonc++ grammar file is as follows:

    B() directives
    %%
    Grammar rules
        
Readers familiar with Bison will note that there is no C declaration section and no section to define Additional C code. With Bisonc++ these sections are superfluous since, due to the fact that a Bisonc++ parser is a class, all additional code required for the parser's implementation can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be defined in the classes' header files, so also the `additional C code' section could be omitted from the Bisonc++ grammar file.

The `%%' is a punctuation that appears in every Bisonc++ grammar file to separate the two sections.

The Bisonc++ directives section is used to declare the names of the terminal and nonterminal symbols, and may also describe operator precedence and the data types of semantic values of various symbols. Furthermore, this section is also used to specify Bisonc++ directives. These Bisonc++ directives are used to define, e.g., the name of the generated parser class and a namespace in which the parser class will be defined. All Bisonc++ directives are covered in section 5.6.

The grammar rules define how to construct nonterminal symbols from their parts. The grammar rules section contains one or more Bisonc++ grammar rules, and nothing else. See section 5.3, covering the syntax of grammar rules.

There must always be at least one grammar rule, and the first `%%' (which precedes the grammar rules) may never be omitted even if it is the first thing in the file.

5.2: Symbols, Terminal and Nonterminal Symbols

Symbols in Bisonc++ grammars represent the grammatical classifications of the language.

A terminal symbol (also known as a token type) represents a class of syntactically equivalent tokens. You use the symbol in grammar rules to mean that a token in that class is allowed. The symbol is represented in the Bisonc++ parser by a numeric code, and the parser's lex() member function returns a token type code to indicate what kind of token has been read. You don't need to know what the code value is; you can use the symbol to stand for it.

A nonterminal symbol stands for a class of syntactically equivalent groupings. The symbol name is used in writing grammar rules. By convention, it should be all lower case.

Symbol names can contain letters, digits (not at the beginning), and underscores. Bisonc++ currently does not support periods in symbol names (Users familiar with Bison may observe that Bison does support periods in symbol names, but the Bison user guide remarks that `Periods make sense only in nonterminals'. Even so, it appears that periods in symbols are hardly ever used).

There are two ways to write terminal symbols in the grammar:

Note that literal string tokens, formally supported in Bison, is not supported by Bisonc++. Again, such tokens are hardly ever encountered, and the dominant lexical scanner generators (like flex(1)) do not support them. Common practice is to define a symbolic name for a literal string token. So, a token like EQ may be defined in the grammar file, with the lexical scanner returning EQ when it matches ==.

How you choose to write a terminal symbol has no effect on its grammatical meaning. That depends only on where it appears in rules and on when the parser function returns that symbol.

The value returned by the lex() member is always one of the terminal symbols (or 0 for end-of-input). Whichever way you write the token type in the grammar rules, you write it the same way in the definition of yylex. The numeric code for a character token type is simply the ASCII code for the character, so lex() can use the identical character constant to generate the requisite code. Each named token type becomes a C++ enumeration value in the parser base-class header file, so lex() can use the corresponding enumeration identifiers. When using an externally (to the parser) defined lexical scanner, the lexical scanner should include the parser's base class header file, returning the required enumeration identifiers as defined in the parser class. So, if (%token NUM) is defined in the parser class Parser, then the externally defined lexical scanner may return Parser::NUM.

The symbol error is a terminal symbol reserved for error recovery (see chapter 8). The error symbol should not be used for any other purpose. In particular, the parser's member function lex() should never return this value.

5.3: Syntax of Grammar Rules

A Bisonc++ grammar rule has the following general form:
 
    result: 
        components
        ...
    ;
        
where result is the nonterminal symbol that this rule describes and components are various terminal and nonterminal symbols that are put together by this rule (see section 5.2). For example,

    exp:    
        exp '+' exp
    ;
        
means that two groupings of type exp, with a `+' token in between, can be combined into a larger grouping of type exp.

Whitespace in rules is significant only to separate symbols. You can add extra whitespace as you wish.

Scattered among the components can be actions that determine the semantics of the rule. An action looks like this:


    {
        C++ statements
    }
        
Usually there is only one action and it follows the components. See section 5.5.3.

Multiple rules for the same result can be written separately or can be joined with the vertical-bar character `|' as follows:


    result: 
        rule1-components
        ...
    | 
        rule2-components...
        ...
    ;
        
They are still considered distinct rules even when joined in this way.

If components in a rule is empty, it means that result can match the empty string. For example, here is how to define a comma-separated sequence of zero or more exp groupings:


    expseq:   
        // empty 
    | 
        expseq1
    ;

    expseq1:  
        exp
    | 
        expseq1 ',' exp
    ;
        
It is customary to write a comment `// empty' in each rule with no components.

5.4: Writing recursive rules

A rule is called recursive when its result nonterminal appears also on its right hand side. Nearly all Bisonc++ grammars need to use recursion, because that is the only way to define a sequence of any number of somethings. Consider this recursive definition of a comma-separated sequence of one or more expressions:

    expseq1:  
            expseq1 ',' exp
    | 
            exp
    ;
        
Since the recursive use of expseq1 is the leftmost symbol in the right hand side, we call this left recursion. By contrast, here the same construct is defined using right recursion:

    expseq1:  
            exp ',' expseq1
    | 
            exp
    ;
        
Any kind of sequence can be defined using either left recursion or right recursion, but you should always use left recursion, because it can parse a sequence of any number of elements with bounded stack space. Right recursion uses up space on the Bisonc++ stack in proportion to the number of elements in the sequence, because all the elements must be shifted onto the stack before the rule can be applied even once. See chapter 7 for further explanation of this.

Indirect or mutual recursion occurs when the result of the rule does not appear directly on its right hand side, but does appear in rules for other nonterminals which do appear on its right hand side. For example:


    expr:     
        primary '+' primary
    |
        primary
        ;

    primary:  
        constant
    | 
        '(' expr ')'
    ;
        
defines two mutually-recursive nonterminals, since each refers to the other.

5.5: Defining Language Semantics

The grammar rules for a language determine only the syntax. The semantics are determined by the semantic values associated with various tokens and groupings, and by the actions taken when various groupings are recognized.

For example, the calculator calculates properly because the value associated with each expression is the proper number; it adds properly because the action for the grouping `x + y' is to add the numbers associated with x and y.

In this section defining the semantics of a language will be addressed. The following topics will be covered:

5.5.1: Data Types of Semantic Values

In a simple program it may be sufficient to use the same data type for the semantic values of all language constructs. This was true in the RPN and infix calculator examples (see, e.g., sections 4.1 and 4.2).

Bisonc++'s default is to use type int for all semantic values. To specify some other type, the directive %stype must be used, like this:


    %stype double
        
This directive must go in the directives section of the grammar file (see section 5.1). As a result of this, the parser class will define a private type STYPE as double: Semantic values of language constructs always have the type STYPE, and an internally used data member d_val that could be used by the lexical scanner to associate a semantic value with a returned token is defined as

    STYPE d_val;
        

5.5.2: More Than One Value Type

In most programs, you will need different data types for different kinds of tokens and groupings. For example, a numeric constant may need type int or double, while a string needs type std::string, and an identifier might need a pointer to an entry in a symbol table.

To use more than one data type for semantic values in one parser, Bisonc++ offers the following feature:

This approach has the advantage that Bisonc++ is able to enforce the correct association between semantic types and rules and/or tokens, and that Bisonc++ is able to check the type-correctness of assignments to rule results. It has the drawback that union fields cannot contain class objects as fields, and thus that such objects must be accessed through pointers. So, if the semantic value of a textual token is returned in std::string object, then the union should have a std::string * field. Consequently, the programmer must take care to delete the memory pointed to by the string * fields whenever these fields become obsolete. However, this requirement is not too hard to meet, and so union constructions containing pointers to class type objects are frequently used in practice.

5.5.3: Actions

An action accompanies a syntactic rule and contains C++ code to be executed each time an instance of that rule is recognized. The task of most actions is to compute a semantic value for the grouping built by the rule from the semantic values associated with tokens or smaller groupings.

An action consists of C++ statements surrounded by braces, much like a compound statement in C++. It can be placed at any position in the rule; it is executed at that position. Most rules have just one action at the end of the rule, following all the components. Actions in the middle of a rule are tricky and should be used only for special purposes (see section 5.5.5).

The C++ code in an action can refer to the semantic values of the components matched by the rule with the construct $n, which stands for the value of the nth component. The semantic value for the grouping being constructed is $$. (Bisonc++ translates both of these constructs into array element references when it copies the actions into the parser file.)

Here is a typical example:


    exp:
        ...
    |
        exp '+' exp
        { 
            $$ = $1 + $3; 
        }
    |
        ...    
        
This rule constructs an exp from two smaller exp groupings connected by a plus-sign token. In the action, $1 and $3 refer to the semantic values of the two component exp groupings, which are the first and third symbols on the right hand side of the rule. The sum is stored into $$ so that it becomes the semantic value of the addition-expression just recognized by the rule. If there were a useful semantic value associated with the `+' token, it could be referred to as $2.

If you don't specify an action for a rule, Bisonc++ supplies a default: $$ = $1. Thus, the value of the first symbol in the rule becomes the value of the whole rule. Of course, the default rule is valid only if the two data types match. There is no meaningful default action for an empty rule; every empty rule must have an explicit action unless the rule's value does not matter.

Using $n with n equal to zero or a negative value is allowed for reference to tokens and groupings on the stack before those that match the current rule. This is a very risky practice, and to use it reliably you must be certain of the context in which the rule is applied. Here is a case in which you can use this reliably:


    foo:      
        expr bar '+' expr  
        { ... }
    | 
        expr bar '-' expr  
        { ... }
    ;

    bar:
        // empty
    |
        { 
            previous_expr = $0; 
        }
    ;
        
As long as bar is used only in the fashion shown here, $0 always refers to the expr which precedes bar in the definition of foo. But as mentioned: it's a risky practice, which should be avoided if at all possible. See also section 6.5.1.

5.5.4: Data Types of Values in Actions

If you have chosen a single data type for semantic values, the $$ and $n constructs always have that data type.

If you have used a %union directive to specify a variety of data types, then you must declare a choice among these types for each terminal or nonterminal symbol that can have a semantic value. Then each time you use $$ or $n, its data type is determined by which symbol it refers to in the rule. In this example,


    exp:    
        ...
    | 
        exp '+' exp
        { 
            $$ = $1 + $3; 
        }
        
$1 and $3 refer to instances of exp, so they all have the data type declared for the nonterminal symbol exp. If $2 were used, it would have the data type declared for the terminal symbol '+', whatever that might be.

Alternatively, you can specify the data type when you refer to the value, by inserting `<type>' after the `$' at the beginning of the reference. For example, if you have defined types as shown here:


    %union 
    {
        int u_int;
        double u_double;
    };
        
then you can write $<u_int>1 to refer to the first subunit of the rule as an integer, or $<u_double>1 to refer to it as a double.

5.5.5: Actions in Mid-Rule

Occasionally it is useful to put an action in the middle of a rule. These actions are written just like usual end-of-rule actions, but they are executed before the parser recognizes the components that follow them.

A mid-rule action may refer to the components preceding it using $n, but it may not (cannot) refer to subsequent components because it is executed before they are parsed.

The mid-rule action itself counts as one of the components of the rule. This makes a difference when there is another action later in the same rule (and usually there is another at the end): you have to count the actions along with the symbols when working out which number n to use in $n.

The mid-rule action can also have a semantic value. The action can set its value with an assignment to $$, and actions later in the rule can refer to the value using $n. Since there is no symbol to name the action, there is no way to declare a data type for the value in advance, so you must use the `$<...>' construct to specify a data type each time you refer to this value.

There is no way to set the value of the entire rule with a mid-rule action, because assignments to $$ do not have that effect. The only way to set the value for the entire rule is with an ordinary action at the end of the rule.

Here is an example from a hypothetical compiler, handling a let statement that looks like ``let (variable) statement' and serves to create a variable named variable temporarily for the duration of statement. To parse this construct, we must put variable into the symbol table while statement is parsed, then remove it afterward. Here is how it is done:


    stmt:   
        LET '(' var ')'
        {
            $<u_context>$ = pushSymtab();
            temporaryVariable($3); 
        }
        stmt    
        { 
            $$ = $6;
            popSymtab($<u_context>5); 
        }
        
As soon as `let (variable)' has been recognized, the first action is executed. It saves a copy of the current symbol table as its semantic value, using alternative u_context in the data-type union. Then it uses temporaryVariable() to add the new variable (using, e.g., a name that cannot normally be used in the parsed language) to the current symbol table. Once the first action is finished, the embedded statement (stmt) can be parsed. Note that the mid-rule action is component number 5, so `stmt' is component number 6.

Once the embedded statement is parsed, its semantic value becomes the value of the entire let-statement. Then the semantic value from the earlier action is used to restore the former symbol table. This removes the temporary let-variable from the list so that it won't appear to exist while the rest of the program is parsed.

Taking action before a rule is completely recognized often leads to conflicts since the parser must commit to a parse in order to execute the action. For example, the following two rules, without mid-rule actions, can coexist in a working parser because the parser can shift the open-brace token and look at what follows before deciding whether there is a declaration or not:


    compound: 
        '{' declarations statements '}'
    | 
        '{' statements '}'
    ;
        
But when we add a mid-rule action as follows, the rules become nonfunctional:

    compound: 
        { 
            prepareForLocalVariables(); 
        }
        '{' declarations statements '}'
    | 
        '{' statements '}'
    ;
        
Now the parser is forced to decide whether to execute the mid-rule action when it has read no farther than the open-brace. In other words, it must commit to using one rule or the other, without sufficient information to do it correctly. (The open-brace token is what is called the look-ahead token at this time, since the parser is still deciding what to do about it. See section 7.1.

You might think that the problem can be solved by putting identical actions into the two rules, like this:


        { 
            prepareForLocalVariables(); 
        }
        '{' declarations statements '}'
    | 
        { 
            prepareForLocalVariables(); 
        }
        '{' statements '}'
    ;
        
But this does not help, because Bisonc++ does not parse the contents of actions, and so it will not realize that the two actions are identical (Note that Bisonc++ never tries to understand the C++ code in an action).

If the grammar is such that a declaration can be distinguished from a statement by the first token (which is true in C, but not in C++, which allows statements and declarations to be mixed)), then one solution is to put the action after the open-brace, like this:


    compound: 
        '{'
        { 
            prepareForLocalVariables(); 
        }
        declarations statements '}'
    | 
        '{' statements '}'
    ;
        
Now the next token following a recognized '{' token would be either the first declarations token or the first statements token, which would in any case tell Bisonc++ which rule to use, thus solving the problem.

Another (much used) solution is to bury the action inside a support non-terminal symbol which recognizes the first block-open brace and performs the required preparations:


    openblock:
        '{'
        { 
            prepareForLocalVariables(); 
        }
    ;

    compound: 
            openblock declarations statements '}'
    | 
            openblock statements '}'
    ;
        
Now Bisonc++ can execute the action in the rule for subroutine without deciding which rule for compound it will eventually use. Note that the action is now at the end of its rule. Any mid-rule action can be converted to an end-of-rule action in this way, and this is what Bisonc++ actually does to implement mid-rule actions.

By the way, note that in a language like C++ the above construction is obsolete anyway, since C++ allows mid-block variable- and object declarations. In C++ a compound statement could be defined, e.g., as follows:


    stmnt_or_decl:
        declarations
    |
        pure_stmnt      // among which: compound_stmnt
    ;

    statements:
        // empty
    |
        statements stmnt_or_decl
    ;

    compound_stmnt:                 
        open_block statements '}'
    ;
        
Here, the compound_stmnt would begin with the necessary preparations for local declarations, which would then have been completed by the time they would really be needed by declarations.

5.6: Bisonc++ Directives

The Bisonc++ declarations section of a Bisonc++ grammar defines the symbols used in formulating the grammar and the data types of semantic values. See section 5.2.

All token type names (but not single-character literal tokens such as '+' and '*') must be declared. Nonterminal symbols must be declared if you need to specify which data type to use for the semantic value (see section 5.5.2).

The first rule in the file also specifies the start symbol, by default. If you want some other symbol to be the start symbol, you must use an explicit %start directive (see section 3.1).

In this section all Bisonc++ declarations will be discussed. Some of the declarations have already been mentioned, but several more are available. Some declarations define how the grammar will parse its input (like %left, %right); other declarations are available, defining, e.g., the name of the parsing function (by default parse()), or the name(s) of the files generated by Bisonc++.

In particular readers familiar with Bison (or Bison++) should read this section thoroughly, since Bisonc++'s directives are more extensive and different from the `declarations' offered by Bison, and the macros offered by Bison++.

5.6.1: %token: Token Type Names

The basic way to declare a token type name (terminal symbol) is as follows:

    %token name
        
Bisonc++ will convert this into a Parser::Tokens enumeration value (where `Parser' is the name of the generated parser class, see section 5.6.7). This allows the lexical scanner member function lex() to return these token values by name directly, and it allows externally defined lexical scanners (called by lex()) to return token values as Parser::name. Note that when an externally defined lexical scanner is used, it should include Parserbase.h, the parser's base class header file, in order to be able to use the Parser::Tokens enumeration type.

Alternatively, you can use %left, %right, or %nonassoc instead of %token, if you wish to specify precedence. See section 5.6.2.

It is possible to specify explicitly the numeric code for a token type by appending an integer value in the field immediately following the token name:


    %token NUM 300
        
However, this practice is strongly discouraged. It is generally best to let Bisonc++ choose the numeric codes for all token types. Bisonc++ will automatically select codes that don't conflict with each other or with ASCII characters.

In the event that the semantic stack type contain union elements, the %token or other token declaration may be augmented with the data type alternative delimited by angle-brackets (see section 5.5.2). For example:


    %union              // define semantic stack elements
    {              
      double u_val;
      symrec *u_tptr;
    };

    %token <val> NUM    // define token NUM and its type
        

5.6.2: %left, %right, %nonassoc: Operator Precedence

Use the %left, %right or %nonassoc directive to declare a token and specify its precedence and associativity, all at once. These are called precedence directives. See section 5.6.2, for general information on operator precedence. The syntaxis of a precedence directive is comparable to that of %token: either

    %left symbols...
        
or

    %left <type> symbols...
        
And indeed any of these directives serves the purposes of %token. But in addition, they specify the associativity and relative precedence for all the symbols:

5.6.3: %prec: Overruling default precedences

The construction %prec token may be used in production rules to overrule the defined precendence of an operator for a particular grammatical rule. Well known is the construction

    expression:
        '-' expression %prec UMINUS
        {
            ...
        }
                
Here, the default priority and precedence of the `-' token as the subtraction operator is overruled by the precedence and priority of the UMINUS token, which is commonly defined as

    %right UMINUS
                
following, e.g., the '*' and '/' operators.

In this case acommon list of operators consists of:


    %left '+' '-'
    %left '*' '/' '%'
    %right UMINUS
        
giving '*' '/' and '%' a higher priority than '+' and '-', ensuring at the same time that UMINUS is given both the highest priority and a right-associativity.

In the abovementioned production rule the operator-order would cause the construction


    '-' expression
        
to be evaluated from right to left, which a higher precedence than either the multiplication or the addition operators.

5.6.4: %type: Nonterminal Symbols

When %union is used to specify multiple value types, the value type of each nonterminal symbol for which values are used must be specified. This is done with a %type declaration, like this:

    %type <type> nonterminal...
        
Here nonterminal is the name of a nonterminal symbol, and type is the name given in the %union to the alternative that you want (see section 5.6.11). Any number of nonterminal symbols can be specified in the same %type declaration, if they have the same value type. Use spaces to separate the symbol names.

It is also possible to declare the value type of a terminal symbol. To do this, use the same <type> construction in a declaration for the terminal symbol. All kinds of token declarations allow <type>.

5.6.5: %expect: Suppressing Conflict Warnings

Bisonc++ normally warns if there are any conflicts in the grammar (see section 7.2), but many real grammars have harmless shift/reduce conflicts which are resolved in a predictable way and would be difficult to eliminate. It is desirable to suppress the warning about these conflicts unless the number of conflicts changes. You can do this with the %expect declaration. The declaration looks like this:

    %expect n
        
Here n is a decimal integer. The declaration says there should be no warning if there are n shift/reduce conflicts and no reduce/reduce conflicts. The usual warning is given if there are either more or fewer conflicts, or if there are any reduce/reduce conflicts.

In general, using %expect involves these steps:

Now Bisonc++ will stop annoying you about the conflicts you have checked, but it will warn you again if changes in the grammar result in additional conflicts.

5.6.6: %start: The Start-Symbol

Bisonc++ assumes by default that the start symbol for the grammar is the first nonterminal specified in the grammar specification section. The programmer may override this restriction with the %start directive as follows:

    %start symbol
        

5.6.7: %class-name: Choosing the Name of the Parser Class

By default, Bisonc++ will define a parser-class named Parser. If this is not appropriate, the directive %class-name may be used as follows:

    %class-name parser-class-name
        
This directive defines the name of the C++ class that will be generated. It may be defined only once.

If you're familiar with the Bison++ program, please note:

5.6.8: %namespace: Using a namespace

%namespace namespace
Define the parser class in the namespace namespace. By default no namespace is defined. If this options is used the implementation header will contain a commented out using namespace directive for the requested namespace. This directive is overridden by the --namespace command-line option.

5.6.9: %negative-dollar-indices: Using constructions like $-1

%negative-dollar-indices
Accept (do not generate warnings) zero- or negative dollar-indices in the grammar's action blocks. Zero or negative dollar-indices are commonly used to implement inherited attributes, and should normally be avoided. When used, they can be specified like $-1 or $<type>-1, where type is a %union field-name. See also the sections 5.5.3 and 6.5.1.

5.6.10: %stype: The semantic stack type

%stype typename
The type of the semantic value of tokens. The specification typename should be the name of an unstructured type (e.g., size_t/*unsigned*/). By default it is int. See YYSTYPE in bison. It should not be used if a %union specification is used. Within the parser class, this type may be used as STYPE.

5.6.11: %union: The Collection of Value Types

The %union directive specifies the entire collection of possible data types for semantic values. The keyword %union is followed by a pair of braces containing the same thing that goes inside a union in C++. For example:

    %union {
      double u_val;
      symrec *u_tptr;
    };
        
This says that the two alternative types are double and symrec *. They are given names u_val and u_tptr; these names are used in the %token and %type directives to pick one of the types for a terminal or nonterminal symbol (see section 5.6.4).

Notes:

5.6.12: %lsp-needed: Using the default location type

%lsp-needed
Defining this causes bisonc++ to include code into the generated parser using the standard location stack. The token-location type defaults to the following struct, defined in the parser's base class when this directive is specified:

    struct LTYPE
    {
        int timestamp;
        int first_line;
        int first_column;
        int last_line;
        int last_column;
        char *text;
    };
           

5.6.13: %ltype: Using an existing location type

%ltype typename
Specifies a user-defined token location type. If %ltype is used, typename should be the name of an alternate (predefined) type (e.g., size_t/*unsigned*/). It should not be used if a %locationstruct specification is used (see below). Within the parser class, this type may be used as LTYPE.

5.6.14: %locationstruct: Specifying a dedicated location struct

%locationstruct struct-definition
Defines the organization of the location-struct data type LTYPE. This struct should be specified analogously to the way the parser's stacktype is defined using %union (see below). The location struct is named LTYPE. If neither locationstruct nor LTYPE is specified, the aforementioned default struct is used.

5.6.15: %scanner: Using a standard scanner interface

%scanner header
Use header as the pathname to the file pre-included in the parser's class header. See the description of the --scanner option for details about this option. Similar to the convention adopted for this argument, header will (by default) be surrounded by double quotes. However, when the argument is surrounded by pointed brackets #include <header> will be included. Note that using this directive implies the definition of a composed Scanner d_scanner data member into the generated parser, as well as a predefined int lex() member, returning d_scanner.yylex(). If this is inappropriate, a user defined implementation of int lex() must be provided.

5.6.16: Directives controlling generated filenames

Unless otherwise specified, bisonc++ uses the name of the parser-class to derive the names of most of the files it may generate. Below <CLASS> should be interpreted as the name of the parser's class, Parser by default, but configurable using %class-name (see section 5.6.7

5.6.16.1: %baseclass-header: the Parser's Base Class header %baseclass-header header
Defines the pathname of the file containing the parser's base class. This directive is overridden by the --baseclass-header or -b command-line options.

5.6.16.2: %class-header: the Parser's Class header %class-header header
Defines the pathname of the file containing the parser class. This directive is overridden by the --class-header or -c command-line options.

5.6.16.3: %implementation-header: the Implementation Header %implementation-header header
Defines the pathname of the file containing the implementation header. This directive is overridden by the --implementation-header or -i command-line options.

5.6.16.4: %parsefun-source: the parse() function's sourcefile %parsefun-source source
Defines the pathname of the file containing the parser member parse(). This directive is overridden by the --parse-source or -p command-line options.

5.6.16.5: %filenames: the generic filenames %filenames header
Defines the generic name of all generated files, unless overridden by specific names. This directive is overridden by the --filenames or -f command-line options.

5.6.17: %baseclass-preinclude: header included by the baseclass

%baseclass-preinclude header
Use header as the pathname to the file pre-included in the parser's base-class header. See the description of the --baseclass-preinclude option for details about this option. Like the convention adopted for this argument, header will (by default) be surrounded by double quotes. However, when the argument is surrounded by pointed brackets #include <header> will be included.

5.6.18: %debug: Adding debugging code to the `parse()' member

%debug
Provide parse() and its support functions with debugging code, showing the actual parsing process on the standard output stream. When included, the debugging output is active by default, but its activity may be controlled using the setDebug(bool on-off) member. Note that no #ifdef DEBUG macros are used anymore. By rerunning bisonc++ without the --debug option an equivalent parser is generated not containing the debugging code.

5.6.19: %error-verbose: (To Do) Dumping the parser stack

%error-verbose
(This directive is not yet available)
if defined the parser stack is dumped when an error is detected by the parse() member function.

5.6.20: %lines: Insert `#line' directives

%lines
Put #line preprocessor directives in the file containing the parser's parse() function. It acts identically to the -l command line option, and is suppressed by the --no-lines option.

5.7: Basic Grammatical Constructions

In the following sections several basic grammatical constructions are presented in their prototypical and generic forms. When these basic constructions are used to construct a grammar, the resulting grammar will usually be accepted by bisonc++. Moreover, these basic constructions are frequently encountered in programming languages. When designing your own grammar, try to stick as closely as possible to the following basic grammatical constructions.

5.7.1: Plain Alternatives

Simple alternatives can be specified using the vertical bar (|). Each alternative should begin with a unique identifying terminal token. The terminal token may actually be hidden in a non-terminal rule, in which case that nonterminal can be used as an alias for the non-terminal. In fact identical terminal tokens may be used if at some point the terminal tokens differ over different alternatives. Here are some examples:

    // Example 1:  plain terminal distinguishing tokens
    expr:
        ID        
    |
        NUMBER
    ;

    // Example 2:  nested terminal distinguishing tokens
    expr:
        id
    |
        number
    ;

    id:
        ID
    ;

    number:
        NUMBER
    ;

    // Example 3:  eventually diverting routes
    expr:
        ID
        id
    |
        ID
        number
    ;

    id:
        ID
    ;

    number:
        NUMBER
    ;
        

5.7.2: One Or More Alternatives, No Separators

A series of elements normally use left-recursion. For example, C++ supports string concatenation: series of double quote delimited ASCII characters define a string, and multiple white-space delimited strings are handled as one single string:

    "hello"         // multiple ws-delimited strings
    " " 
    "world"

    "hello world"   // same thing
        
Usually a parser is responsible for concatenating the individual string-parts, receiving one or more STRING tokens from the lexical scanner. A string rule handles one or more incoming STRING tokens:

    string:
        STRING
    |
        string STRING
        
The above rule can be used as a prototype for recognizing a series of elements. The token STRING may of course be embedded in another rule. The generic form of this rule could be formulated as follows:

    series:
        unit
    |
        series unit
        
Note that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly.

5.7.3: Zero Or More Alternatives, No Separators

An optional series of elements also use left-recursion, but the single element alternative remains empty. For example, in C++ a compound statement may contain statements or declarations, but it may also be empty:

    opt_statements:
        // empty
    |
        opt_statements statements
        
The above rule can be used as a prototype for recognizing a series of elements: the generic form of this rule could be formulated as follows:

    opt_series:
        // empty
    |
        opt_series unit
        
Note that the empty element is recognized first, even though it is empty, whereafter the left-recursive alternative may be recognized repeatedly. In practice this means that an action block may be defined at the empty alternative, which is then executed prior to the left-recursive alternative. Such an action block could be used to perform initializations necessary for the proper handling of the left-recursive alternative.

5.7.4: One Or More Alternatives, Using Separators

A series of elements which are separated from each other using some delimiter again normally uses left-recursion. For example, a C++ variable definition list consists of one or more identifiers, separated by comma's. If there is only one identifier no comma is used. Here is the rule defining a list using separators:

    variables:
        IDENTIFIER
    |
        variables ',' IDENTIFIER
        
The above rule can be used as a prototype for recognizing a series of delimited elements. The generic form of this rule could be formulated as follows:

    series:
        unit
    |
        series delimiter unit
        
Note that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly. In fact, this rule is not really different from the standard rule for a series, which does not hold true for the rule to recognize an optional series of delimited elements, covered in the next section.

5.7.5: Zero Or More Alternatives, Using Separators

An optional series of elements, separated from each other using delimiters occurs frequently in programming languages. For example, C++ functions have parameter lists which may or may not require arguments. Since a parameter list may be defined empty, an empty alternative is required. However, a simple generalization of the optional series construction (section 5.7.3) won't work, since that would imply that the first argument is preceded by a separator, which is clearly not the intention. So, the following construction is wrong:

    opt_parlist:
        // empty
    |
        opt_parlist ',' parameter
        
To define an optional series of delimited elements two rules are required: one rule handling the optional part, the other the delimited series of elements. So, the correct definition is as follows:

    opt_parlist:
        // empty
    |
        parlist
    ;

    parlist:
        parameter
    |
        parlist ',' parameter
    ;
        
Again, the above rule pair can be used as a prototype for recognizing an optional series of delimited elements. The generic form of these rules could be formulated as follows:

    opt_series:
        // empty
    |
        series
    ;

    series:
        element
    |
        series delimiter element
        
Note that the opt_series rules neatly distinguishes the no-element case from the case were elements are present. Usually these two cases need to be handled quite differently, and the opt_series rules empty alternative easily allows us to recognize the no-elements case.

5.7.6: Nested Blocks

Finally, we add the nested rule to our bag of rule-tricks. Again, nested rules appear frequently: parenthesized expressions and compound statements are two very well known examples. These kind of rules are characterized by the fact that the nested variant is itself an example of the element appearing in the nested variant. The definition of a statement is actually a bit more complex than the definition of an expression, since the statement appearing in the compound statement is in fact an optional series of elements. Let's first have a look at the nested expression rule. Here it is, in a basic form:

    expr:
        NUMBER
    |
        ID
    |
        expr '+' expr
    |
        ...
    |
        '(' expr ')'
    ;
        
This definition is simply characterized that the non-terminal expr appears within a set of parentheses, which is not too complex.

The definition of opt_statements, however, is a bit more complex. But acknowledging the fact that a statement contains among other elements a compound statement, and that a compound statement, in turn, contains opt_statements an opt_statements construction can be formulated accordingly:


    opt_statements:         // define an optional series
        // empty
    |
        opt_statements statement
    ;
        
    statement:              // define alternatives for `statement'
        expr_statement
    |
        if_statement
    |
        ...
    |
        compound_statement
    ;

    compound_statement:     // define the compound statement itself
        '{' opt_statements '}'
    ;
        

5.8: Multiple Parsers in the Same Program

Most programs that use Bisonc++ parse only one language and therefore contain only one Bisonc++ parser. But what if you want to parse more than one language with the same program? Since Bisonc++ constructs class rather than a parsing function, this problem can be easily solved: simply define your second (third, fourth, ...) parser class, each having its own unique class-name, using the %class-name directive, and construct parser objects of each of the defined classes.