previous         next         contents

2. An example: nanoPascal

This chapter illustrates the construction of a language processor by use of Depot4 for a tiny language. We start from a parser for that language and continue by developing it into a translator, which produces semantically equivalent C code.

2.1. The language

We take a small language called nanoPascal (a very poor subset of Pascal) from R. Gerady: Experimental Comparison of Some Parsing Methods. SIGPLAN Notices 22, 79-88 (1987) . Its syntax is given in Listing 1.

The taken EBNF notation is based on N. Wirth's suggestions. All literal terminals must be quoted. Identifiers for productions (= nonterminals) must start with a letter and can contain digits or letters. There are some restrictions regarding the number of significant characters and forbidden names. See the installation guide for detailed information. (Identifiers are propagated into the host language programs to make them better readable. Thus, one has to avoid clashes with the host language's keywords.)

	program = block '.'.
	block = ['VAR' ident {',' ident} ';']
	    'BEGIN' statement {';' statement } 'END' .
	statement = 'WRITE' expression |
	    'READ' ident|
	     ident ':=' expression |
	    'BEGIN' statement {';' statement } 'END' |
	    'IF' expression 'THEN' statement ['ELSE' statement] |
	    'WHILE' expression 'DO' statement | .
	expression = simpleExpr
	    [('='|'<'|'>'|'<='|'>='|'<>') simpleExpr ].
	simpleExpr = ['+'|'-'] term
	    {('+'|'-'|'OR') term }.
	term = factor {('*'|'DIV'|'AND') factor}.
	factor = ident | num | '(' expression ')'| 'NOT' factor.
Listing 1: Syntax of nanoPascal

2.2. The Parser

In producing a parser, it is the first step to adjust the grammar due to the restrictions of the parsing scheme. First, this means elimination of left recursion for which some algorithms are known. The second restriction comes from the sequential evaluation of alternative branches: If two (or more) of them start with equal prefixes then the longer must precede to shorter. Our example needs only two changes with regard to these restrictions:

(1) In expression the order of comparisons must be rearranged to:

expression = simpleExpr [('='|'<='|'>='|'<'|'>'|'<>') simpleExpr ].

(2) Rather similar, in factor one has to move the ident branch to the end (or at least after the 'NOT' factor branch):

factor = num | '(' expression ')'| 'NOT' factor | ident.

Here, this is enough to create a parser for the nanoPascal language from the slightly modified EBNF description.

Now you can generate a parser for this simple language by processing each of its rules, i.e. by their translation. In fact, you are producing a bundle of mutually dependent parsers as - in principle - every nonterminal can act as root of a (sub-)language.

If this language processor parses some input it delivers just a boolean value (true - text belongs to the defined language resp. false - otherwise). This state may be sent to the log.

2.3. Translation

2.3.1. Basics

The description of translations is based on a fairly singular approach. Its main focus is the similarity between source and target description. This is achieved by recalling that grammar productions can serve not only for parsing but also to synthesize text. Each rule of the source language is supplied with a production for the target language. By this, the application of a rule in parsing gets a value: its translation, which is defined by the associated target production. To distinguish this value from the rule, an underscore is added. For simple tokens such as identifier, number, etc. this value is equal to the accepted source text. E.g., a rule ex1= ident -> ident_. would accept an identifier and translate it into itself. The translation of a sequence of numbers like 1 2 3 into a reverted list 3,2,1 would be described by
ex2= num [ ex2] -> [ex2_ ','] num_.
Although we are speaking about structural translations, every implementation must start with an attempt to map the semantic concepts of the source onto those of the target. Obviously, this holds without regard to the applied tools. The seeming easiness with that structural translations may be described must not obscure that the real challenge is usually the identification of suitable matches for the different semantic concepts involved. Thus, let us stress once more the importance of a careful preceding conceptual design.

2.3.2. Defining the translation

In our case the conceptual considerations are simple. All involved concepts, e.g., variable declaration and statement types, have clearly their respective counterparts in C.

We will demonstrate the step by step implementation of a translator. It is useful to insert some default (maybe empty) target generation into each rule first to avoid error messages when applying productions, which have not been supplemented yet. Now it is possible to check the target description of every translation rule individually.

Let us begin with the translation of variable declarations, which are an optional component of block. Here nothing is known about their type but it seems sensible to assume them as integer ones. I.e., some text VAR a,b,c; should be translated into int a,b,c;.

Thus we will get

block = ['VAR' id:ident {',' ident[i]} ';']
    'BEGIN' statement {';' statement} 'END'
-> ['int ' id_ {',' ident_[i]} ';\n'].
Remember, spaces and line feeds have to be added explicitly. The notion ident occurs twice in the source part of rule block, once in an iterative clause, where the single incarnations are distinguished by indices. This pattern is also applied for statement, but as the default variable for the number of iterations N is already used for the number of declarations, here the value is stored in the (otherwise unused default variable for alternatives) c. Now the complete translation for block is:

block = ['VAR' ident {',' ident[i]} ';']
    'BEGIN' statement {/..c/';' statement[i] } 'END' 
-> ['int ' ident_ {',' ident_[i]} ';\n']
    'main() \n{\n' statement_ {/..c/ '\n' statement_[i] } '}\n'.
Making block the root production, one gets the following translation (assuming ->'STATEMENT' the default target of statement):
VAR a,b,c;                     int a,b,c;
BEGIN                          main(){
 a:= 2;                 into   STATEMENT
 IF b<3 THEN c:= a             STATEMENT
END                            }
Next is to figure out a translation for statement. One also may start with a simplified translation like
statement = 'WRITE' expression |
    'READ' ident|
    'BEGIN' statement[1] N:=0;  {';' statement[i+1] } 'END' |
    'IF' expression 'THEN' st:statement ['ELSE' ste:statement] |
    'WHILE' expression 'DO' st:statement | 
    ident ':=' expression |
-> 'WRITE ' |
   'READ ' |
   'COMPOUND ' |
   'IF '|
   'WHILE ' |
   ':= '|
   'EMPTY '.

and then substitute the strings by correct descriptions. The handling of semicolons and line feeds needs some care for a fairly pretty looking result. The translation of the compound statement uses an index shift to avoid renaming. The empty statement is silently included as last (empty) branch in both the source and the target.

In the same way a translator for expression etc. can be developed. Listing 2 shows the complete description. T he rule for term illustrates the use of an additional data structure for preserving branch information in an iterative alternative structure.

program = block '.' -> '#include <stdio.h>\n' block_ .
block = ['VAR' ident {',' ident[i]} ';']
    'BEGIN' statement {/..c/';' statement[i] } 'END' 
->  ['int ' ident_ {',' ident_[i]} ';\n']
    'main() \n{\n'
    statement_ {/..c/ '\n' statement_[i] }
    '}\n'.
statement = 'WRITE' expression |
    'READ' ident|
    'BEGIN' statement[1] {';' statement[i+1] } 'END' |
    'IF' expression 'THEN' st:statement ['ELSE' ste:statement] |
    'WHILE' expression 'DO' st:statement | 
    ident ':=' expression |
-> 'printf("%d",' expression_ ');' |
   'scanf("%d", &' ident_ ');' |
   '{'  {/..N+1/ statement_[i] '\n' } '}' |
   'if (' expression_ ') ' st_  ['else ' ste_ ]|
   'while (' expression_ ') ' st_ |
   ident_ '=' expression_ ';'| .
expression = simpleExpr
   [('='|'<='|'>='|'<>'|'<'|'>') sE: simpleExpr ]
-> simpleExpr_
   [('=='|'<='|'>='|'!='|'<'|'>') sE_ ].
simpleExpr =
   VAR op: FLEX OF INT;
   ['+'|'-'] term
   {(/op[i]/'+'|'-'|'OR') term[i] }
-> ['+'|'-'] term_
   {(/op[i]/'+'|'-'|'||') term_[i] }.
term =
   VAR op: FLEX OF INT;
   factor {(/op[i]/'*'|'DIV'|'AND') f2:factor[i]}
-> factor_ {(/op[i]/ '*'|'/'|'&&') f2_[i]}.
   factor = num | '(' expression ')'| 'NOT' factor | ident
-> num_ |'(' expression_ ')'|'!' factor_ | ident_ .
Listing 2: Complete nanoPascal to C translator

A small example translation is given in Listing 3. Of course, some things could still be improved, but let us mention that up to now this translation is defined on a purely syntactic level. Whether these are pros or cons, depends from the application context. The main advantage of the restriction to syntax is its simplicity. A possible drawback may be that so called semantic errors, e.g., missing declaration, in the source are not detected. The consequences may vary depending from how strict generated targets are checked in subsequent steps. If errors are reliably caught later on, it may be a matter of taste where to find them out in the most sensible way.

      SOURCE                      TARGET

VAR X, Y, H, GGT;               #include <stdio.h>
BEGIN                           int X,Y,H,GGT;
  READ X; READ Y;               main()
  WHILE X<>0 DO                 {
    BEGIN                       scanf("%d", &X);
     IF X<Y THEN BEGIN          scanf("%d", &Y);
       H:= X; X:= Y; Y:= H      while (X!=0) {if (X<Y) {H=X;
     END;                       X=Y;
     X:= X-Y;                   Y=H;
     WHILE X<Y DO               X:= X-Y }
       X:= X-Y                  X=X-Y;
    END;                        while (X<Y) X=X-Y;
  GGT:= Y; WRITE GGT            }
END.                            GGT=Y;
                                printf("%d",GGT);}
Listing 3: Example translation

Finally, let us add a translator which converts nanoPascal programs into XML representations reflecting the grammar. It can be applied to the same source. Its output gives a striking impression of the advantages of syntax based notations if they are intended to be read by humans. Remarkably, without adding any information or making decoding significantly easier the size of the XML description is more than twice the size of the original document.

program = block '.' -> '<program>' block_ '\n</program>'.

block = ['VAR' ident {',' ident[i]} ';']
       'BEGIN' statement {/..c/';' statement[i] } 'END'
-> '\n<block>' ['\n<ident value= "' ident_ '"/>' {'\n<ident value= "' ident_[i]'"/>'} ]
   statement_ {/..c/  statement_[i] }
   '\n</block>'.

statement = 'WRITE' expression |
    'READ' ident|
    'BEGIN' statement[1] {';' statement[i+1] } 'END' |
    'IF' expression 'THEN' st:statement ['ELSE' ste:statement] |
    'WHILE' expression 'DO' st:statement |
    ident ':=' expression |
-> '\n<statement action= "write">' expression_ '\n</statement>' |
    '\n<statement action= "read"> <ident value= "' ident_ '"/>\n</statement>' |
    {/..N+1/ statement_[i]  } |
    '\n<statement action= "if">' expression_ ' ' st_  [' ' ste_ ]'\n</statement>'|
    '\n<statement action= "while"> ' expression_ ' ' st_ '\n</statement>'|
    '\n<statement action= "="><ident value= "' ident_ '"/> ' expression_  '\n</statement>'| .
expression = simpleExpr
    [('='|'<='|'>='|'<>'|'<'|'>') sE: simpleExpr ]
-> '\n<expression>' simpleExpr_
    ['<operator="'('=='|'<='|'>='|'!='|'<'|'>')'"/>' sE_ ] '\n</expression>'.
simpleExpr =
    VAR op: FLEX OF INT;
    ['+'|'-'] term
    {(/op[i]/'+'|'-'|'OR') term[i] }
-> '\n<simpleexpression>' ['<sign value="' ('+'|'-') '"/> '] term_
    {(/op[i]/'+'|'-'|'||') term_[i] } '\n</simpleexpression>'.
term =
    VAR op: FLEX OF INT;
    factor {(/op[i]/'*'|'DIV'|'AND') f2:factor[i]}
-> '\n<factor> 'factor_ {'<operation="'(/op[i]/ '*'|'/'|'&&')'"/>' f2_[i]} '\n</factor>'.
factor = num | '(' expression ')'| 'NOT' factor | ident
-> '<number value="' num_ '"/>'
   | expression_
   |'<neagtion> ' factor_  '</negation>'
   | '<ident value= "' ident_  '"/> '.
Listing 4: Example translation converting the structure into XML notation

Suggestion: One may consider it a interesting exercise to write a translator which builds XML DTDs (or similar) from EBNF grammars.


    previous         next         contents

© J. Lampe 1997-2010