previous         next         contents

3. The Ml4 language

3.1. Structure of Ml4

Ml4 (Meta-Language of Depot4) is based on EBNF. In fact, it is a true extension of the notation introduced by N. Wirth.

A Ml4 program per se does not exist, instead there is a set of Ml4 productions, which can be translated independently of each other. Thus, Ml4 features production (resp. rule) based modularization. Translators are configured dynamically by selecting one of the rules as root production. I.e., the nonterminal on the left-hand side of the production is declared as start symbol of the grammar. By this an applicable language processor is formed. The settlement of the language root can always be changed dynamicly. Together with the dynamic loading of the modules this enables the testing of parts of the language processor before finishing the implementation of all the productions.
The formal description of the EBNF in section 3.3. is already a set of valid Ml4 productions, which can be translated by the Depot4 metalanguage translator into executable code. By choosing Rule as start symbol we get an acceptor for EBNF productions.

A Ml4 production has the general structure:

   identifier = sourceExpression -> targetExpression .
where the part starting with ->, called target production,  is optionally and may occur more than once.
The possibility to describe the source as well as the target language by use of similar means is one of Ml4's unique features. All the structure operators of the EBNF are available on both the source and the target side. There are further extensions such as declarations, assign and call statements, etc.
Elements of Ml4, which are not part of the basic EBNF (i.e. extensions) are separated among themselves and from those basic elements by semicolons. (In fact, semicolons may be used within the EBNF parts, too.)

3.2. Lexical elements

The Ml4 language is free-format, that is white spaces and newlines can be used anywhere between lexemes.
Ml4 is case sensitive.
Comments are deliminated by (* and *) and may be nested, i.e., a comment must not contain any unbalanced *) even if quoted.

3.2.1 Identifiers

Identifiers (in the meta-language) must start with a letter and can contain only letters and digits. They can be of arbritrary length but there may be an (implementation dependent) limit in the number of characters that are recognised as significant.
To make the intermediate code as readable as possible, most identifiers are kept during translation into the host language. Therefore it might be wise to avoid such identifiers that may conflict with keywords of possible host languages. (E.g., DO or do are very likely to clash and thus should be avoided.)
There is also a set of reserved identifiers of the Ml4 language:
ARR DCL END FLEX GLOBVAR IMPORTS INIT MODULE REC TYPE TYPEND USE VAR

3.2.2 Literal terminals

Literals are written as strings and have to be enclosed with apostrophs, e.g. ':=' or 'BEGIN'. If the character ' itself is needed in a literal, it has to be written twice, e.g. '''Hallo!'''.
For special (non-printable) symbols there are substitutions:
	\n	newline		choosen corresponding to the actual operating system
	\c	carriage return
	\l	line feed
	\f	form feed
	\t	horizontal tabulator
	\v	vertical tabulator
	\B	bell
	\b	backspace
	\\	\
	\0	Nullbyte
In the source part of a Ml4 production one can make use of an additional feature:
For literals consisting of several characters it is possible to allow abbreviations. For that the string has to start with $i, where i is one of the digits 1...9 giving the number of characters requested at least. So '$3INTEGER' accepts the strings INT, INTE, INTEG, INTEGE, and INTEGER, but not INTEGERS. If a literal starts with the character '$', then the character '$' has to be written twice, e.g. '$$a$' accepts the string $a$.
To guarantee the separation of a literal from the succeeding text the separating symbol $ can be applied after the string. For instance, the literal 'REAL' would also accept the starting sequence of a string REALUM, but writing 'REAL' $ prevents this.
A Literal can also be defined as value of a variable of type SYM.

3.3. Base EBNF

EBNF is a production-based notation, i.e., the grammar is described by the set of productions P. The sets N (nonterminals) and T (terminals) are implicitly given by the occurence of its elements. The start symbol n0 has to be marked. All productions with the same nonterminal on the left-hand side are collected to one EBNF production. An EBNF production has the general form:
   identifier = expression.
identifier is the name of a nonterminal. The dot marks the end of the production. expression is the collection of all right-hand sides of the productions with identifier on the left-hand side.
For this it is possible to use the following structure operators:
  1. Sequencing
    A sequence is simply represented by concatenation of elements (terminals and nonterminals).
    	B = A1 A2 ... An.
  2. Alternation
    Alternates are separated by vertical bars.
    	B = A1 | A2 | ... | An.
  3. Option (Zero-or-one occurrence)
    Optionals parts are enclosed in square brackets.
       B = [ A ].
    Due to an intersection with indexing in in the enhanced language (Ml4), an option following an identifier must be separated by at least one space (or other deliminator).

  4. Iteration (Zero-or-many occurrence)
    Iteration may be directly represented (without using recursion) by curly braces. Iteration is useful to express left association when left recursion is forbidden (as in Depot4).
       B = { A }.
  5. Parentheses (grouping)
    Parentheses may be used to override the relative priority of alternation and sequencing. E.g.
       B = 'a' ('b'|'c') 'd'.
    describes the language {'abd', 'acd'}.
These structure operators can be combined in any production. An example is the syntax of the EBNF itself:
	Rule   = ident '=' Expr '.'.
	Expr   = Term { '|' Term }.
	Term   = { Factor }.
	Factor = string | ident |'('Expr')'|'['Expr']'|'{'Expr'}'.
Ml4 allows empty productions, i.e. empty = . is valid.

3.4. Types

There are three kinds of data types in Ml4: primitive types, structured types, and opaque types. The latter are of interest only in connection with the import feature and allow a simple handling (declaration, parameter passing) of foreign data.

3.4.1 Primitive types

This types are predefined in every Ml4 production.
INT - actually $3INTEGER
Integer type is mapped on the respective type of the host language.
REAL
Floating point type, mapped on a real type, too. There is only a limited support for this type, e.g., no conversions are available.
BOOL - actually $4BOOLEAN
The boolean type, whose values are TRUE and FALSE
SYM
A type, whose values are symbols, i.e., possibly limited strings of characters. They can be, at least, concatenated and compared.
TXT
This is the basic target type. Values of this type can only be concatenated.
TAR - actually $3TARGET
This is, exactly, no primitive type, as it is the result of a nonterminal's invocation, i. e. the collection of targets. There are no operations other than assignment or selecting a certain target with the trailing underscore notation.

3.4.2 Structured types

There are three kinds of data structures: records, arrays, and flexible arrays.
RECORD - actually $3RECORD
The syntax of a record definition follows that of Pascal/Modula (without variants).
ARRAY - actually $3ARRAY
An array is a constant sized vector of elements (which may be in turn of array type again). Only the number of elements is given, their counts start with zero.
FLEX - also FLEX1, resp. FLEX2
Flexible arrays (FLEXes) are suited to store information in connection with EBNF's iteration construct. They have no upper limit for the number of their elements. Accessing a non-existing element f[i] will create it.
The index range of FLEX starts with one.
The use of this data type requires runtime management of the associated data structures and, thus, is expected to be in most host languages less efficient than ordinary arrays.
Flexible arrays may be of dimension one (FLEX / FLEX1) or two (FLEX2).
Examples:
ARR 20 OF INT
REC name, town: SYM; age: INT; gen: BOOLEAN END
FLEX OF SYM
FLEX2 OF RECORD F: FLEX OF INTEGER;
                AAR: ARRAY 10 OF ARRAY 5 OF REAL END

3.5. Productions and modules

Productions are the standard units of Ml4. Because of efficiency reasons Ml4 allows to combine several productions into a module. This is restricted to groups of nonterminals, where only one is called from outside, but the remaining are needed locally onlys. The name of the module has to be the name of the nonterminal called from outside.
The use of modules should be restricted to closed parts of grammars, which are not expected to change. Especially overriding a production in a module can have surprising results.

Productions resp. modules are translated separately. There is no need for any used nonterminal (i.e., a nonterminal on the right-hand side of the rule) to be defined yet.
The nonterminal's identifier (i.e., the left-hand side of the rule) becomes the identifier of all the generated entities (host language source file, object file, etc.). This means, if there are two productions with the same left-hand side, translating one of them will possibly overwrite the implementation of the other.

There exists just one global name space for all productions. Thus, it is useful to follow a naming convention when defining new rules.
Depot4 supports prefixing, i.e., if an identifier contains a small letter or digit followed by a capital letter, all the part before the first such capital is regarded as common prefix. (E.g. Dp4 is the prefix of Dp4ExAmPlE1.) This avoids name collisions and is also applied for automatic structuring (into subsystems/packages) if the host system supports this.

3.6. Source elements

This are all those elements that may occur in the source part of a Ml4 production. The basic structure of this part is given by EBNF.

3.6.1 Class terminals

Class terminals are entities like identifier, number, etc. that are usually called terminals although they stand for a whole class of symbols. There is no real distinction here. Just for efficiency, a set of prefabricated class terminals, which are implemented directly in the host language is supplied.
Class terminals play an important role with respect to extensibility. That's why there is no special tool. As for nonterminals, for every class teminal there is a module containing the corresponding acceptor procedure, called scanning procedure. Detailed information about the terminal is needed only in this module, outwards it is known only by its name. This fact enables the flexible extension of Depot4, also beyond the processing of texts.

The following terminals are supplied with every Depot4 implementation:

digit
Accepts a single digit character.
'0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
letter
Accepts a single letter character.
hexdigit
Accepts a single hexadecimal digit character, a digit or one of the letters A - F or a - f.
id, ident
Identifier due to letter {letter|digit} See also Treatment of Keywords
intn, num, integer
Integer number in decimal format:
digit{digit}
number
Real (float) number:
digit{digit}['.'{digit}[('E'|'D') ['+'|'-']digit{digit}]]
filename
Accepts a filename. A filename may be quoted, then it is a synonym to str. (To confirm to str, filename has a second target defined, too.) Otherwise, it accepts a sequence of characters not containing space ' ' (or any character with a code less than space), the character '#', and the platform specific file separator character. (Thus, the result may still be an invalid filename string.)
line
Accepts all characters to the end-of-line (inclusive), for an example see Make linefeed ended comments
ident4root
Accepts an identifier and takes its value as name of a nonterminal, which is called afterwards. Should be useful for tests, for an example see Individual test of productions in complex environments
any
This is no real class terminal as it has no translation. (Thus, its implementation cannot be changed.) Its purpose is to skip one significant (non comment or white space) character. (Example: Skip text until keyword/symbol)

Handling of string lexems is a little bit trickier regarding delimiters and escaped characters. Sometimes the value of the string may be of interest, sometimes its full representation as in the source. That's why a string terminal sterm has always three results:
sterm_0: the value as SYM
sterm_, strem_1: the value as TAR
sterm_2: the representation as found in the source as TAR
Escaping also has several pitfalls. To circumvent most of them pairs of backslashes do not escape the following character.
Example: For "ab\\"cd" stresc accepts only "ab\\" giving the value ab\\.
In contrast "ab\\\"cd" will deliver ab\\cd.

str, string
String: sequence of characters, quoted by ' or ", can containing the quoting character if doubled. Value is only the accepted sequence of characters without quotes.
E. g. "abc""de'" has the value abc"de'
stresc, stringesc
Accepts a string quoted by " or ' which may contain the quoting character escaped by \.
"abc\"de" is accepted as abc"de.
dqstr
Accepts a string quoted by " which may contain the quoting character, if doubled.
"abc""de" is accepted as abc"de.
dqstresc
Accepts a string quoted by " which may contain the quoting character escaped by \.
"abc\"de" is accepted as abc"de.
sqstr
Accepts a string quoted by ' which may contain the quoting character, if doubled.
'abc''de' is accepted as abc'de.
sqstresc
Accepts a string quoted by ' which may contain the quoting character escaped by \.
'abc\'de' is accepted as abc'de.
In general, the target of any class terminal is a copy of its source. Usually class terminals deliver a symbol value (target no. 0), too. This symbol value is in most cases the sequence of accepted characters. However, if the translator is configured to treat capitalization as insignificant, for id/ident all letters in the symbol value are converted into upper case. Thus, delivering a canonical representation which is of good use in symbol tables etc.
Treatment of Keywords
In general, there is no syntactical difference between ordinary identifiers and keywords. So one is likely to run into trouble with an expression like [ident] 'END' as the closing end will be accepted as an identifier. There are at least two ways to overcome this. At first, one can change the grammar, e.g. into (ident 'END'|'END') which solves the problem.
Depot4 has a more convenient solution now. One can write all these words that are not identifiers into a file. As a default Depot4 looks in the current directory for a file NoIdent.lst (can be changed in module Dp4Config) and excludes all the words that it contains from being recognized as identifiers.
It is also possible to change these list dynamically. This is achieved by calling procedure NoIdents from module Dp4Stdlex. The argument is the filename string. This call discards the previous list and installs a new one, which will be empty if no file was found.
There are two more procedures that can be useful in the case of nesting. Procedure pushNoIdents(filenameString) saves the old settings in addition, while popNoIdents() restores the saved status.

The syntax of an exclusion file is simple: just list the words, separated by spaces or newlines. If capitalization is insignificant, upper case letters must be used.

Example:
IMPORTS Dp4Stdlex;
lextst = Dp4Stdlex.NoIdents('PascalNoIdent.lst');
  { ident } 'END' 
         Dp4Stdlex.pushNoIdents('CNoIdent.lst')
  { ident } 'end' 
         Dp4Stdlex.popNoIdents(); 
  { ident } 'UNTIL'
.
with file PascalNoIdent.lst containing at least END and UNTIL, and file CNoIdent.lst containing end, this rule will accept
alfa beta END END UNTIL end end UNTIL

3.6.2 Nonterminals

As mentioned earlier, there is - in technical respect - no real distinction between class terminals and nonterminals. Thus, all features described here may also be applied with them. However, while one can compile a reasonable set of basic class terminals, there is not such basic set of nonterminals.
Nonterminals are called by their names. It is possible to call them recursively, i.e., the left-hand side of a production may appear in the right-hand side. Then it has to be ensured that the recursion is finite. (Due to Ml4's operational model automatic detection is not possible in general.)
Often it is necessary to distinguish in a production several instances of the same nonterminal.

There are two possibilities to modify nonterminals in the description of the source:

  1. Renaming
    By Name:NT the nonterminal NT gets the new designation Name. Renaming is usually used if a nonterminal occurs on several positions in a production:
        Prod = F1:Fact [ Op F2:Fact]
          -> F1_ [Op_ F2_].
    But renaming can also be used in the reversed way. It is possible to give different nonterminals in different branches of an alternative the same name if they are to be treated equally:
        Stat = S:IfStat | S:AssStat | S:ForStat
          -> S_.
  2. Indexing
    By NT[index] it is possible to provide nonterminals with indices. This is usually used in connection with iterations:
        DclSeq = { Dcl[i] }
          -> { Dcl_[i] }.
    Every nonterminal can get at most two indices. To distinguish between the parentheses for indices and for options the following has to be obeyed: There must not be a space, newline or comment between the nonterminal and the opening index parenthesis. In contrast there has to be a delimiter between a nonterminal and an opening option parenthesis.
Indexing and renaming may be combined:
    Seq = { D:ConstDef[i] | D:TypeDef[i] }
      -> { D_[i] }.

3.6.3 Skipping

Normally, there can be an arbitrary number of delimiters, i.e. spaces, newlines and comments between two successive terminals in the source. They are automatically skipped. But sometimes it is necessary to suppress this behaviour. Then all source elements in front of which delimiters are not allowed have to be enclosed in < and >. In this way class terminals can easily be implemented, too.
   Integer = digit < { digit } >.
By the enclosure in < ... > delimiters inside the number are prohibited. An exception is the first digit, so that delimiters in front of the number can be ignored.
This feature may also be used to parse formatted, e.g., tab separated, input.
Skipping areas may be nested. Skipping is then disabled until the outermost scope is left.
Remark: It is essential, to select the correct text stretch because < as > imply internal actions. So, e.g., <digit [digit>] will not work correctly if only one digit was accepted.

3.6.4 Procedure calls

Although Ml4 aims at the goal of translation descriptions which are highly independent from the system's actual host language it does not take a purist's view and offers an interface to those basic system features. The interface is defined by procedures (or routines or methods) encapsulated in an unity called module, e.g. a class in Java or an Ada package. Calls to such procedures may be embedded in the source text of the parsing part. The import of modules is described in 3.14.1.
Procedure calls must contain a (possibly empty) parameter list.
Due to the generality there is no simple way of type checking which, therefore is deferred to the host language compiler. This solution is not fully satisfactory. Nevertheless, it is usually not too hard to link the error message with the appropriate Ml4 code position.
Further versions may offer additional means.

Intrinsic procedures are described in 3.7.2, regardless if they are proper procedures (i.e. have no return value) or not.

Any variable can be assigned to a value of its type. There are some automatic conversions into type SYM. Be aware that the translator does not know anything about the type of imported entities. Thus it cannot insert any conversion or check compatibility.

The result of an assignment is not reverted during back-tracking. (Thus permitting unbounded lookahead.)

3.7. Expressions

Expressions can be build similar to the rules of Pascal, i.e., with three levels of priority. Unary operators (sign, NOT) are of the highest level.

3.7.1 Operators

Add operators
+, -, OR
+ serves for concatenation (types SYM and TXT) too
Multiplication operators
*, DIV, MOD, &
& is the logical AND operator
Comparisons
=, #, <=, >=, <, >
# stands for not equal

3.7.2 Intrinsic procedures

Functional procedures
Pseudo conversion functions
This functions do NO real conversion. Actually, they tell the translator the resulting type of an expression in places where it cannot be deduced from the code, i.e., in case of external modules. In fact, one could them call type assertions. If the actual return type of a function call does not match the name of the conversion function this will result in a host language compiling error.
Proper procedures

3.7.3 Predeclared variables and constants

The following variables are predeclared in every Ml4 production and, thus, must not be explicitly redeclared. They serve as default control variables (see there), but can - with some care - be applied elsewhere, too.
Integer: N, O, i, c

Variables with special function (all of type SYM):

Boolean constants: FALSE, TRUE


    previous         next         contents


© J. Lampe 1997-2010