Implementing a DSL

For the end user, using a DSL is surely easier than writing XML code; however, the developer of the DSL is now left with the task of implementing it.

Implementing a DSL means developing a program that is able to read text written in that DSL, parse it, process it, and then possibly interpret it or generate code in another language. Depending on the aim of the DSL, this may require several phases, but most of these phases are typical of all implementations.

In this section we only hint at the main concepts of implementing a DSL.

Note

From now on, throughout the book we will not distinguish, unless strictly required by the context, between DSL and programming language.

Parsing

First of all, when reading a program written in a programming language, the implementation has to make sure that the program respects the syntax of that language.

To this aim, we need to break the program into tokens. Each token is a single atomic element of the language; this can be a keyword (such as class in Java), an identifier (such as a Java class name), or symbol name (such as a variable name in Java).

For instance, in the preceding example, employed is a keyword; the parentheses are operators (which can be seen as special kinds of keywords as well). All the other elements are literals (James is a string literal and 50 is an integer literal).

The process of converting a sequence of characters into a sequence of tokens is called lexical analysis, and the program or procedure that performs such analysis is called a lexical analyzer, lexer, or simply a scanner. This analysis is usually implemented by using regular expressions syntax.

Having the sequence of tokens from the input file is not enough: we must make sure that they form a valid statement in our language, that is, they respect the syntactic structure expected by the language.

This phase is called parsing or syntactic analysis.

Let us recall the DSL to describe the name and age of various people and a possible input text:

James Smith (50)
John Anderson (40) employed

In this example each line of the input must respect the following structure:

  • two string literals
  • the operator (
  • one integer literal
  • the operator )
  • the optional keyword employed

In our language, tokens are separated by white spaces, and lines are separated by a newline character.

You can now deduce that the parser relies on the lexer.

If you have never implemented a programming language, you might be scared at this point by the task of implementing a parser, for instance, in Java. You are probably right, since it is not an easy task. The DSL we just used as an example is very small (and still it would require some effort to implement), but if we think about a DSL that has to deal also with, say, arithmetic expressions? In spite of their apparently simple structure, arithmetic expressions are recursive by their own nature, thus a parser implemented in Java would have to deal with recursion as well (and, in particular, it should avoid endless loops).

There are tools to deal with parsing so that you do not have to implement a parser by hand. In particular, there are DSLs to specify the grammar of the language, and from this specification they automatically generate the code for the lexer and parser (for this reason, these tools are called parser generators or compiler-compilers). In this context, such specifications are called grammars. A grammar is a set of rules that describe the form of the elements that are valid according to the language syntax.

Here are some examples of tools for specifying grammars.

Bison and Flex (Levine 2009) are the most famous in the C context: from a high level specification of the syntactic structure (Bison) and lexical structure (Flex) of a language, they generate the parser and lexer in C, respectively. Bison is an implementation of Yacc (Yet Another Compiler-compiler, Brown et al. 1995), and there are variants for other languages as well, such as Racc for Ruby.

In the Java world, the most well-known is probably ANTLR (pronounced Antler, ANother Tool for Language Recognition) (Parr 2007). ANTLR allows the programmer to specify the grammar of the language in one single file (without separating the syntactic and lexical specifications in different files), and then it automatically generates the parser in Java.

Just to have an idea of what the specification of grammars looks like in ANTLR, here is the (simplified) grammar for an expression language for arithmetic expressions (with sum and multiplication):

expression
  : INT
  | expression '*' expression
  | expression '+' expression
;

Even if you do not understand all the details, it should be straightforward to get its meaning: an expression is either an integer literal, or (recursively) two expressions with an operator in between (either * or +).

From such a specification, you automatically get the Java code that will parse such expressions.

The Abstract Syntax Tree (AST)

Parsing a program is only the first stage in a programming language implementation. Once the program is checked as correct from the syntactic point of view, the implementation will have to do something with the elements of the program.

First of all, the overall correctness of a program cannot always be determined during parsing. One of the correctness checks that usually cannot be performed during parsing is type checking, that is, checking that the program is correct with respect to types. For instance, in Java, you cannot assign a string value to an integer variable, or, you can only assign instances of a variable's declared type or subclasses thereof.

Trying to embed type checking in a grammar specification could either make the specification more complex, or it could be simply impossible, since some type checks can be performed only when other program parts have already been parsed.

Type checking is part of the semantic analysis of a program. This often includes managing the symbol table, that is, for instance, handling the variables that are declared and that are visible only in specific parts of the program (think of fields in a Java class and their visibility in methods).

For these reasons, during parsing, we should also build a representation of the parsed program and store it in memory so that we can perform the semantic analysis on the memory representation without needing to parse the same text over and over again. A convenient representation in memory of a program is a tree structure called the Abstract Syntax Tree (or AST). The AST represents the abstract syntactic structure of the program. In this tree, each node represents a construct of the program.

Once the AST is stored in memory, the DSL implementation will not need to parse the program anymore, and it can perform all the additional semantic checks on the AST, and if they all succeed, it can use the AST for the final stage of the implementation, which can be the interpretation of the program or code generation.

In order to build the AST we need two additional things.

We need the code for representing the nodes of such a tree; if we are using Java, this means that we need to write some Java classes, typically one for each language construct. For instance, for the expression language we might write one class for the integer literal and one for the binary expression. Remember that since the grammar is recursive, we need a base class for representing the abstract concept of an expression. For example,

interface Expression { }

class Literal implements Expression {
  Integer value;
  // constructor and set methods...
}

class BinaryExpression implements Expression {
  Expression left, right;
  String operator;
  // constructor and set methods...
}

Then, we need to annotate the grammar specification with actions that construct the AST during the parsing; these actions are basically Java code blocks embedded in the grammar specification itself; the following is just a (simplified) example and it does not necessarily respect the actual ANTLR syntax:

expression:
  INT { $value = new Literal(Integer.parseInt($INT.text)); }
| left=expression '*' right=expression {
  $value = new BinaryExpression($left.value, $right.value);
  $value.setOperator("*");
}
| left=expression '+' right=expression {
  $value = new BinaryExpression($left.value, $right.value);
  $value.setOperator("+");
}
;