Stork, An Example Programming Language, Lesson 2: Expression Parsing

  • strict warning: Non-static method view::load() should not be called statically in /home/sigpboo7/public_html/modules/views/views.module on line 842.
  • strict warning: Declaration of views_handler_field_comment::init() should be compatible with views_handler_field::init(&$view, $options) in /home/sigpboo7/public_html/modules/views/modules/comment/views_handler_field_comment.inc on line 50.
  • strict warning: Declaration of views_plugin_row::options_validate() should be compatible with views_plugin::options_validate(&$form, &$form_state) in /home/sigpboo7/public_html/modules/views/plugins/views_plugin_row.inc on line 135.
  • strict warning: Declaration of views_plugin_row::options_submit() should be compatible with views_plugin::options_submit(&$form, &$form_state) in /home/sigpboo7/public_html/modules/views/plugins/views_plugin_row.inc on line 135.

Welcome back!

For those of you just joining, Stork is an example programming language “course” designed to demonstrate the principles of programming language implementation in 10 “lessons.” This is Lesson 2 in a series of 10, so if you’re just joining now, you may want to check out lesson 1 to gear up a bit for this story.

Lesson 2 covers the basics of parsing numerical expressions using the tokenizer implemented in Lesson 1. Evaluation of these expressions will be handled in Lesson 3.

The code for this lesson is available on github under the tag lesson2, and you can follow the discussion about this lesson on reddit.

What is Parsing?

If a programming language is a (more) convenient language humans can use to describe tasks to computers, then parsing is the process of turning a program’s tokens into sentences, or “abstract syntax trees” (ASTs), that the computer can understand. For example, consider this simple mathematical expression for the area of a circle with radius 5:

3.14159*5*5

For this program text, the tokens would be 3.14159, *, 5, *, 5, and a parser would build the following AST for it:

Clearly, parsing is essentially “sentence diagramming” for a programming language.

This lesson covers how to transform a token stream into a parse tree like the above example. Looking at parse trees — the syntactic relationships among tokens — instead of the tokens themselves will make evaluating those expressions much easier in the next lesson.

How does Parsing Work?

Just like a sequence of different kinds of words — nouns, verbs, adjectives, etc. — make up a sentence, a sequence of different kinds of tokens make up a statement in a programming language. Parsers are the programs that convert a program’s tokens into ASTs that represent the semantics of that program.

Parsers are a little more complex than tokenizers, but are still fairly simple. Essentially, a parser reads in tokens and interprets them as programming language statements until it runs out of tokens. If it can’t interpret a sequence of tokens as a valid statement, then that’s an error. (Such “unexpected token” errors are a common (and frustrating) source of compile errors.) This lesson covers how to build a parser for statements that evaluate simple numerical expressions, but many more statement types will be covered in this course as more features are added to Stork.

ASTs are simple trees that have tokens as their nodes. When parsing numerical expressions, the nodes of ASTs are either numbers (like 3.14159) or operators (like *). Operators take arguments called operands, and so operators must be interior nodes with their operands as their children. Operands can be either number nodes or other operator nodes, recursively. Numbers are themselves values, so they have no children and are necessarily leaf nodes.

Example of Valid and Invalid Numerical Expression ASTs

Handling Ambiguity

Interestingly, simple numerical expressions are ambiguous, which is to say that there is more than one valid AST for a given valid numerical expression. For example, the mathematical expression for the area of a circle with radius 5 given above:

3.14159*5*5

can actually be parsed legally in two different ways:

Both ASTs are perfectly valid. In this case, both expression trees would result in the same value because multiplication is associative, but for other expressions, like 1-2-3, this ambiguity is unacceptable because evaluating the different ASTs would result in different answers. Programming languages define operator associativity and order of operations rules for all operators to eliminate this kind of ambiguity, but ultimately these rules boil down to following conventions taught in grade school.

Operator associativity is actually a familiar concept from arithmetic, but is never given a name in most classrooms. Operator associativity resolves ambiguity when the same operator (or similar operators) are applied one after the other in sequence, as in 1+2+3. There are two kinds of associativity: left-associative, where a sequence of operators binds left-first, and right-associative, where a sequence of operators binds right-first. Addition, for example, is left-associative: In the expression 1+2+3, the 1+2 is evaluated first, so the expression evaluates like (1+2)+3 rather than 1+(2+3). Unary negative, on the other hand, is right-associative, so --3 (“negative negative 3”) evaluates like -(-3). Associativity resolves ambiguity by simple convention.

Order of operations is also familiar, but most people probably remember it as PEMDAS: parentheses, exponents, multiplication, division, addition, subtraction. In an expression that uses multiple operators, like 1+2*3, order of operations dictates which part of the expression should be evaluated first. In this example, multiplication has higher precendence than addition, so it binds more tightly and is evaluated first. So, 1+2*3 is evaluated as 1+(2*3) rather than (1+2)*3. Again, order of operations resolves ambiguity by simple convention.

Implementing Stork’s Numerical Expression Parser

Creating Stork’s parsing engine requires the implementation of two primary abstractions: a parser to perform the token-to-AST conversion, and the AST object that will be emitted. These abstractions are implemented in a few new classes:

src/main/java/com/sigpwned/stork/parse/Parser.java — Converts tokens to ASTs
src/main/java/com/sigpwned/stork/ast/AST.java — AST parent class
src/main/java/com/sigpwned/stork/ast/Expr.java — Expression AST parent class
src/main/java/com/sigpwned/stork/ast/expr/BinaryOperatorExpr.java — Expression AST for binary operators, like + and -
src/main/java/com/sigpwned/stork/ast/expr/FloatExpr.java — Expression AST for floating-point numbers
src/main/java/com/sigpwned/stork/ast/expr/IntExpr.java — Expression AST for integers
src/main/java/com/sigpwned/stork/ast/expr/UnaryOperatorExpr.java — Expression AST for unary operators, like unary -

Implementing ASTs: AST.java, Expr.java, and children

ASTs are recursive data structures, which makes them extremely flexible and able to model even complex program syntax. The top-level AST class called AST (apropos!) is essentially a trivial implementation of an unsorted n-ary tree:

public abstract class AST {
    private List<AST> children;

    protected AST() {
        this.children = new ArrayList<AST>();
    }

    public List<AST> getChildren() {
        return children;
    }

    protected void addChild(AST child) {
        getChildren().add(child);
    }
}

This data structure can represent every syntactic feature Stork will use, but does not itself do anything useful. Rather, Stork’s syntactic constructs are implemented by subclasses of AST, like BinaryOperatorExpr which implements a binary operator expression:

public class BinaryOperatorExpr extends Expr {
    public static enum Operator {
        PLUS("+"), MINUS("-"), TIMES("*"), DIVIDE("/"), MOD("%");

        private String text;

        private Operator(String text) {
            this.text = text;
        }

        public String getText() {
            return text;
        }
    }

    private Operator operator;

    public BinaryOperatorExpr(Operator operator, Expr left, Expr right) {
        this.operator = operator;
        addChild(left);
        addChild(right);
    }

    public Operator getOperator() {
        return operator;
    }

    public Expr getLeft() {
        return (Expr) getChildren().get(0);
    }


    public Expr getRight() {
        return (Expr) getChildren().get(1);
    }
}

This BinaryOperatorExpr class implements five operators with arity 2 (hence “binary”), per the BinaryOperatorExpr.Operator inner class: +, -, *, /, and %. The class also exposes a convenient two-expression constructor and accessors for each operand, which makes BinaryOperatorExpr much easier to use than the AST class.

Careful readers will also notice that BinaryOperatorExpr’s parent is not AST, but rather Expr. This is a simple class that acts as a “bridge” between the raw AST type and syntactic construct implementations:

public abstract class Expr extends AST {
    public BinaryOperatorExpr asBinaryOperator() {
        return (BinaryOperatorExpr) this;
    }

    public UnaryOperatorExpr asUnaryOperator() {
        return (UnaryOperatorExpr) this;
    }

    public IntExpr asInt() {
        return (IntExpr) this;
    }

    public FloatExpr asFloat() {
        return (FloatExpr) this;
    }
}

Other expression classes — UnaryOperatorExpr, IntExpr, and FloatExpr — implement the other supported numerical expressions currently defined using similar techniques.

Building ASTs: Parser.java

Now that the individual expressions have been implemented, the next task is the parser which will build them from program tokens.

As mentioned in the previous lesson, this course will use a by-hand recursive descent parser for simplicity. Certainly other good approaches to parsing exist — parser generators and monadic parsers, for example, which were excellent suggestions in the last lesson’s reddit thread — but there is no substitute for building a parser by hand, especially when learning first principles.

At the highest level, the concept behind a recursive descent parser is simple: model program syntax on the call stack using recursive functions. More practically for this lesson, though, this lesson’s parser will use several methods to parse numerical expressions: one for each level of operator precedence, discussed above. The top-level parsing method for expressions, expr:

public Expr expr() throws IOException, ParseException {
    return expr1();
}

calls into the lowest-precedence parsing method, expr1:

protected Expr expr1() throws IOException, ParseException {
    Expr result=expr2();

    boolean replaced;
    do {
        replaced = false;
        if(getTokens().peekType() == Token.Type.PLUS) {
            getTokens().consumeType(Token.Type.PLUS);
            result   = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.PLUS, result, expr2());
            replaced = true;
        } else
        if(getTokens().peekType() == Token.Type.MINUS) {
            getTokens().consumeType(Token.Type.MINUS);
            result   = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.MINUS, result, expr2());
            replaced = true;
        }
    } while(replaced);
    
    return result;
}

Note that expr is public and expr1 is protected. The only method (currently) in the parser that is meaningful for code outside the parser is the expr method, so it simply acts as a publicly-accessible alias for the “real” expression parsing code, which starts with expr1.

The expr1 method implements parsing for the lowest-precedence operators, which are the A and S from PEMDAS: addition and subtraction. The method first consumes a higher-precedence Expr — maybe a multiplication expression, for example — using expr2, and then checks to see if there is a + or - token it should process. If it finds one, it consumes the token, and then builds a corresponding BinaryOperatorExpr using the previously-parsed expression and another higher-precedence expression via expr2. This approach correctly models these operators as left-associative, since it makes “left-deep” expressions.

The next-higher parsing method, expr2, uses a very similar approach, but looks for *, /, and % tokens (the MD part of PEMDAS) instead of + and -, and in turn calls into an even higher precedence parsing method, expr3, which should provide some insight into how a recursive descent parser models program syntax on the call stack:

protected Expr expr2() throws IOException, ParseException {
    Expr result=expr3();

    boolean replaced;
    do {
        replaced = false;
        if(getTokens().peekType() == Token.Type.STAR) {
            getTokens().consumeType(Token.Type.STAR);
            result   = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.TIMES, result, expr3());
            replaced = true;
        } else
        if(getTokens().peekType() == Token.Type.SLASH) {
            getTokens().consumeType(Token.Type.SLASH);
            result   = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.DIVIDE, result, expr3());
            replaced = true;
        } else
        if(getTokens().peekType() == Token.Type.PERCENT) {
            getTokens().consumeType(Token.Type.PERCENT);
            result   = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.MOD, result, expr3());
            replaced = true;
        }
    } while(replaced);

    return result;
}

This style of parsing method finally ends with the value method, which is the highest-precedence parsing method in the current parser. value simply parses the atomic values of our numerical expressions — integers and floating point numbers — and parenthetical expressions, which it handles by simply parsing another expression between the parentheses:

protected Expr value() throws IOException, ParseException {
    Expr result;

    if(getTokens().peekType() == Token.Type.INT)
        result = new IntExpr(Long.parseLong(tokens.nextToken().getText()));
    else
    if(getTokens().peekType() == Token.Type.FLOAT)
        result = new FloatExpr(Double.parseDouble(tokens.nextToken().getText()));
    else
    if(getTokens().peekType() == Token.Type.LPAREN) {
        getTokens().consumeType(Token.Type.LPAREN);
        result = expr();
        getTokens().consumeType(Token.Type.RPAREN);
    }
    else
        throw new ParseException("Unexpected token: "+tokens.peekToken().getText(), tokens.peekToken().getOffset());

    return result;
}

By implementing the various layers of precedence in different mutually recursive coroutines, this recursive descent parser can correctly model the order of operations of Stork’s numerical expressions.

Implementing the testing harness: Stork.java

The main class Stork.java has been modified to drive the core parsing classes, as opposed to the core tokenization classes from the last lesson. This class now exposes a simple REPL that reads a single numerical expression on a line, parses it, and prints an ASCII representation of that tree to the console.

Downloading and Running Stork

Stork is designed to be easy to download, build, and run. As along as you have Git, Maven, and a JDK installed, you can get this lesson of Stork up and running on your computer like this:

$ git clone https://github.com/sigpwned/stork.git
$ cd stork
$ git checkout lesson2
$ mvn compile
$ cd target/classes
$ java com.sigpwned.stork.Stork
>>> 1+2*(3+4)
    +
    |-- 1
    |-- *
        |-- 2
        |-- +
            |-- 3
            |-- 4
>>> ^D
$

Next Lesson: Expression Evaluation

The next step in building Stork will be to transform the parser’s ASTs into an in-memory representation that can evaluate the numerical expressions those trees represent. This lesson will be the first introduction to Stork’s interpreter, so it should be very interesting for readers not familiar with interpreter internals.

Exercises

  1. Right now, Stork only understands +, -, *, /, %, and () operators (PMDAS from PEMDAS). How could parser support for exponentiation be added, for example with the ** operator?
  2. Modify the program to use the token mod to represent the modulus operator instead of %.
  3. Modify the program to make addition and subtraction right-associative as opposed to left-associative. (Referring to the method expr3 in the code may be useful.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Got Registered

I was able to register, it seems the captcha has some domain or security issue and I had to force it to show up. Anyway, can we talk more about the ‘replaced’ business? I’m not sure what that’s all about.

Wonderful!

I was able to register, it seems the captcha has some domain or security issue and I had to force it to show up.

Glad you could register! Sorry you had such trouble. :(

Anyway, can we talk more about the ‘replaced’ business? I’m not sure what that’s all about.

Do you mean the replaced variable in the expr*() methods? The replaced variable simply determines if a method has parsed more tokens or not to replace the original result obtained by the next-highest parsing method.

In more detail, the expr*() methods parse one level of operator precedence. Each method does its work by calling the next-highest parsing method — e.g., expr1() calls expr2() — and then checking to see if it should consume more tokens.

Consider the expression 2*3+4. expr1() immediately calls expr2(), which would consume 2*3:

Expr result=expr2();

In expr1(), result now contains the expression 2*3, and the remaining tokens are +4. Next, expr1() peeks at the next token (+), and determine that yes, it should parse more tokens at if(getTokens().peekType() == Token.Type.PLUS):

    boolean replaced;
    do {
        replaced = false;
        if(getTokens().peekType() == Token.Type.PLUS) {
            getTokens().consumeType(Token.Type.PLUS);
            result   = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.PLUS, result, expr2());
            replaced = true;
        } else
        if(getTokens().peekType() == Token.Type.MINUS) {
            getTokens().consumeType(Token.Type.MINUS);
            result   = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.MINUS, result, expr2());
            replaced = true;
        }
    } while(replaced);

expr1() then consumes the Token.Type.PLUS token, and then replaces its original result (from expr2()) with a new BinaryOperatorExpr that represents 2*3 on the left, + as the operator, and 4 on the right (obtained from expr2() again). expr1 then sets a flag to indicate that it has read more tokens and replaced the original result, which then causes the do while() loop to go through another iteration, which allows expr1() to check for more operators.

On the next iteration, expr1(), the next token is the EOF token, so expr1() will not see + or -, so replaced will remain false, so the loop bails out. expr1() then returns (2*3) + (4):

   return result;

Hopefully that made sense! Did that answer your question?