# Stork, An Example Programming Language, Lesson 2: Expression Parsing

### Welcome back!

For those of you just joining, Stork is an example programming language “course” designed to demonstrate the principles of programming language implementation in 10 “lessons.” This is Lesson 2 in a series of 10, so if you’re just joining now, you may want to check out lesson 1 to gear up a bit for this story.

Lesson 2 covers the basics of parsing numerical expressions using the tokenizer implemented in Lesson 1. Evaluation of these expressions will be handled in Lesson 3.

The code for this lesson is available on github under the tag lesson2, and you can follow the discussion about this lesson on reddit.

### What is Parsing?

If a programming language is a (more) convenient language humans can use to describe tasks to computers, then parsing is the process of turning a program’s tokens into sentences, or “abstract syntax trees” (ASTs), that the computer can understand. For example, consider this simple mathematical expression for the area of a circle with radius 5:

3.14159*5*5

For this program text, the tokens would be `3.14159`

, `*`

, `5`

, `*`

, `5`

, and a parser would build the following AST for it:

Clearly, parsing is essentially “sentence diagramming” for a programming language.

This lesson covers how to transform a token stream into a parse tree like the above example. Looking at parse trees — the syntactic relationships among tokens — instead of the tokens themselves will make evaluating those expressions much easier in the next lesson.

### How does Parsing Work?

Just like a sequence of different kinds of words — nouns, verbs, adjectives, etc. — make up a sentence, a sequence of different kinds of tokens make up a statement in a programming language. *Parsers* are the programs that convert a program’s tokens into ASTs that represent the semantics of that program.

Parsers are a little more complex than tokenizers, but are still fairly simple. Essentially, a parser reads in tokens and interprets them as programming language statements until it runs out of tokens. If it can’t interpret a sequence of tokens as a valid statement, then that’s an error. (Such “unexpected token” errors are a common (and frustrating) source of compile errors.) This lesson covers how to build a parser for statements that evaluate simple numerical expressions, but many more statement types will be covered in this course as more features are added to Stork.

ASTs are simple trees that have tokens as their nodes. When parsing numerical expressions, the nodes of ASTs are either numbers (like `3.14159`

) or operators (like `*`

). Operators take arguments called *operands*, and so operators must be interior nodes with their operands as their children. Operands can be either number nodes or other operator nodes, recursively. Numbers are themselves values, so they have no children and are necessarily leaf nodes.

Example of Valid and Invalid Numerical Expression ASTs

#### Handling Ambiguity

Interestingly, simple numerical expressions are ambiguous, which is to say that there is more than one valid AST for a given valid numerical expression. For example, the mathematical expression for the area of a circle with radius 5 given above:

3.14159*5*5

can actually be parsed legally in two different ways:

Both ASTs are perfectly valid. In this case, both expression trees would result in the same value because multiplication is associative, but for other expressions, like `1-2-3`

, this ambiguity is unacceptable because evaluating the different ASTs would result in different answers. Programming languages define operator associativity and order of operations rules for all operators to eliminate this kind of ambiguity, but ultimately these rules boil down to following conventions taught in grade school.

Operator associativity is actually a familiar concept from arithmetic, but is never given a name in most classrooms. Operator associativity resolves ambiguity when the same operator (or similar operators) are applied one after the other in sequence, as in `1+2+3`

. There are two kinds of associativity: left-associative, where a sequence of operators *binds* left-first, and right-associative, where a sequence of operators *binds* right-first. Addition, for example, is left-associative: In the expression `1+2+3`

, the `1+2`

is evaluated first, so the expression evaluates like `(1+2)+3`

rather than `1+(2+3)`

. Unary negative, on the other hand, is right-associative, so `--3`

(“negative negative 3”) evaluates like `-(-3)`

. Associativity resolves ambiguity by simple convention.

Order of operations is also familiar, but most people probably remember it as PEMDAS: parentheses, exponents, multiplication, division, addition, subtraction. In an expression that uses multiple operators, like `1+2*3`

, order of operations dictates which part of the expression should be evaluated first. In this example, multiplication has *higher precendence* than addition, so it binds more tightly and is evaluated first. So, `1+2*3`

is evaluated as `1+(2*3)`

rather than `(1+2)*3`

. Again, order of operations resolves ambiguity by simple convention.

### Implementing Stork’s Numerical Expression Parser

Creating Stork’s parsing engine requires the implementation of two primary abstractions: a parser to perform the token-to-AST conversion, and the AST object that will be emitted. These abstractions are implemented in a few new classes:

`src/main/java/com/sigpwned/stork/parse/Parser.java`

— Converts tokens to ASTs

`src/main/java/com/sigpwned/stork/ast/AST.java`

— AST parent class

`src/main/java/com/sigpwned/stork/ast/Expr.java`

— Expression AST parent class

`src/main/java/com/sigpwned/stork/ast/expr/BinaryOperatorExpr.java`

— Expression AST for binary operators, like`+`

and`-`

`src/main/java/com/sigpwned/stork/ast/expr/FloatExpr.java`

— Expression AST for floating-point numbers

`src/main/java/com/sigpwned/stork/ast/expr/IntExpr.java`

— Expression AST for integers

`src/main/java/com/sigpwned/stork/ast/expr/UnaryOperatorExpr.java`

— Expression AST for unary operators, like unary`-`

#### Implementing ASTs: AST.java, Expr.java, and children

ASTs are recursive data structures, which makes them extremely flexible and able to model even complex program syntax. The top-level AST class called `AST`

(apropos!) is essentially a trivial implementation of an unsorted n-ary tree:

public abstract class AST { private List<AST> children; protected AST() { this.children = new ArrayList<AST>(); } public List<AST> getChildren() { return children; } protected void addChild(AST child) { getChildren().add(child); } }

This data structure can represent every syntactic feature Stork will use, but does not itself do anything useful. Rather, Stork’s syntactic constructs are implemented by subclasses of `AST`

, like `BinaryOperatorExpr`

which implements a binary operator expression:

public class BinaryOperatorExpr extends Expr { public static enum Operator { PLUS("+"), MINUS("-"), TIMES("*"), DIVIDE("/"), MOD("%"); private String text; private Operator(String text) { this.text = text; } public String getText() { return text; } } private Operator operator; public BinaryOperatorExpr(Operator operator, Expr left, Expr right) { this.operator = operator; addChild(left); addChild(right); } public Operator getOperator() { return operator; } public Expr getLeft() { return (Expr) getChildren().get(0); } public Expr getRight() { return (Expr) getChildren().get(1); } }

This `BinaryOperatorExpr`

class implements five operators with arity 2 (hence “binary”), per the `BinaryOperatorExpr.Operator`

inner class: `+`

, `-`

, `*`

, `/`

, and `%`

. The class also exposes a convenient two-expression constructor and accessors for each operand, which makes `BinaryOperatorExpr`

much easier to use than the `AST`

class.

Careful readers will also notice that `BinaryOperatorExpr`

’s parent is not `AST`

, but rather `Expr`

. This is a simple class that acts as a “bridge” between the raw `AST`

type and syntactic construct implementations:

public abstract class Expr extends AST { public BinaryOperatorExpr asBinaryOperator() { return (BinaryOperatorExpr) this; } public UnaryOperatorExpr asUnaryOperator() { return (UnaryOperatorExpr) this; } public IntExpr asInt() { return (IntExpr) this; } public FloatExpr asFloat() { return (FloatExpr) this; } }

Other expression classes — `UnaryOperatorExpr`

, `IntExpr`

, and `FloatExpr`

— implement the other supported numerical expressions currently defined using similar techniques.

#### Building ASTs: Parser.java

Now that the individual expressions have been implemented, the next task is the parser which will build them from program tokens.

As mentioned in the previous lesson, this course will use a by-hand recursive descent parser for simplicity. Certainly other good approaches to parsing exist — parser generators and monadic parsers, for example, which were excellent suggestions in the last lesson’s reddit thread — but there is no substitute for building a parser by hand, especially when learning first principles.

At the highest level, the concept behind a recursive descent parser is simple: model program syntax on the call stack using recursive functions. More practically for this lesson, though, this lesson’s parser will use several methods to parse numerical expressions: one for each level of operator precedence, discussed above. The top-level parsing method for expressions, `expr`

:

public Expr expr() throws IOException, ParseException { return expr1(); }

calls into the lowest-precedence parsing method, `expr1`

:

protected Expr expr1() throws IOException, ParseException { Expr result=expr2(); boolean replaced; do { replaced = false; if(getTokens().peekType() == Token.Type.PLUS) { getTokens().consumeType(Token.Type.PLUS); result = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.PLUS, result, expr2()); replaced = true; } else if(getTokens().peekType() == Token.Type.MINUS) { getTokens().consumeType(Token.Type.MINUS); result = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.MINUS, result, expr2()); replaced = true; } } while(replaced); return result; }

Note that `expr`

is `public`

and `expr1`

is `protected`

. The only method (currently) in the parser that is meaningful for code outside the parser is the `expr`

method, so it simply acts as a publicly-accessible alias for the “real” expression parsing code, which starts with `expr1`

.

The `expr1`

method implements parsing for the lowest-precedence operators, which are the A and S from PEMDAS: addition and subtraction. The method first consumes a higher-precedence `Expr`

— maybe a multiplication expression, for example — using `expr2`

, and then checks to see if there is a `+`

or `-`

token it should process. If it finds one, it consumes the token, and then builds a corresponding `BinaryOperatorExpr`

using the previously-parsed expression and another higher-precedence expression via `expr2`

. This approach correctly models these operators as left-associative, since it makes “left-deep” expressions.

The next-higher parsing method, `expr2`

, uses a very similar approach, but looks for `*`

, `/`

, and `%`

tokens (the MD part of PEMDAS) instead of `+`

and `-`

, and in turn calls into an even higher precedence parsing method, `expr3`

, which should provide some insight into how a recursive descent parser models program syntax on the call stack:

protected Expr expr2() throws IOException, ParseException { Expr result=expr3(); boolean replaced; do { replaced = false; if(getTokens().peekType() == Token.Type.STAR) { getTokens().consumeType(Token.Type.STAR); result = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.TIMES, result, expr3()); replaced = true; } else if(getTokens().peekType() == Token.Type.SLASH) { getTokens().consumeType(Token.Type.SLASH); result = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.DIVIDE, result, expr3()); replaced = true; } else if(getTokens().peekType() == Token.Type.PERCENT) { getTokens().consumeType(Token.Type.PERCENT); result = new BinaryOperatorExpr(BinaryOperatorExpr.Operator.MOD, result, expr3()); replaced = true; } } while(replaced); return result; }

This style of parsing method finally ends with the `value`

method, which is the highest-precedence parsing method in the current parser. `value`

simply parses the atomic values of our numerical expressions — integers and floating point numbers — *and* parenthetical expressions, which it handles by simply parsing another expression between the parentheses:

protected Expr value() throws IOException, ParseException { Expr result; if(getTokens().peekType() == Token.Type.INT) result = new IntExpr(Long.parseLong(tokens.nextToken().getText())); else if(getTokens().peekType() == Token.Type.FLOAT) result = new FloatExpr(Double.parseDouble(tokens.nextToken().getText())); else if(getTokens().peekType() == Token.Type.LPAREN) { getTokens().consumeType(Token.Type.LPAREN); result = expr(); getTokens().consumeType(Token.Type.RPAREN); } else throw new ParseException("Unexpected token: "+tokens.peekToken().getText(), tokens.peekToken().getOffset()); return result; }

By implementing the various layers of precedence in different mutually recursive coroutines, this recursive descent parser can correctly model the order of operations of Stork’s numerical expressions.

#### Implementing the testing harness: Stork.java

The main class `Stork.java`

has been modified to drive the core parsing classes, as opposed to the core tokenization classes from the last lesson. This class now exposes a simple REPL that reads a single numerical expression on a line, parses it, and prints an ASCII representation of that tree to the console.

### Downloading and Running Stork

Stork is designed to be easy to download, build, and run. As along as you have Git, Maven, and a JDK installed, you can get this lesson of Stork up and running on your computer like this:

```
$ git clone https://github.com/sigpwned/stork.git
$ cd stork
$ git checkout lesson2
$ mvn compile
$ cd target/classes
$ java com.sigpwned.stork.Stork
>>> 1+2*(3+4)
+
|-- 1
|-- *
|-- 2
|-- +
|-- 3
|-- 4
>>> ^D
$
```

### Next Lesson: Expression Evaluation

The next step in building Stork will be to transform the parser’s ASTs into an in-memory representation that can evaluate the numerical expressions those trees represent. This lesson will be the first introduction to Stork’s interpreter, so it should be very interesting for readers not familiar with interpreter internals.

### Exercises

- Right now, Stork only understands
`+`

,`-`

,`*`

,`/`

,`%`

, and`()`

operators (PMDAS from PEMDAS). How could parser support for exponentiation be added, for example with the`**`

operator? - Modify the program to use the token
`mod`

to represent the modulus operator instead of`%`

. - Modify the program to make addition and subtraction right-associative as opposed to left-associative. (Referring to the method
`expr3`

in the code may be useful.)

## Got Registered

I was able to register, it seems the captcha has some domain or security issue and I had to force it to show up. Anyway, can we talk more about the ‘replaced’ business? I’m not sure what that’s all about.

## Wonderful!

Glad you could register! Sorry you had such trouble. :(

Do you mean the

`replaced`

variable in the`expr*()`

methods? The`replaced`

variable simply determines if a method has parsed more tokens or not toreplacethe original result obtained by the next-highest parsing method.In more detail, the

`expr*()`

methods parse one level of operator precedence. Each method does its work by calling the next-highest parsing method — e.g.,`expr1()`

calls`expr2()`

— and then checking to see if it should consume more tokens.Consider the expression

`2*3+4`

.`expr1()`

immediately calls`expr2()`

, which would consume`2*3`

:In

`expr1()`

,`result`

now contains the expression`2*3`

, and the remaining tokens are`+4`

. Next,`expr1()`

peeks at the next token (`+`

), and determine thatyes, it should parse more tokens at`if(getTokens().peekType() == Token.Type.PLUS)`

:`expr1()`

then consumes the`Token.Type.PLUS`

token, and thenreplacesits original result (from`expr2()`

) with a new`BinaryOperatorExpr`

that represents`2*3`

on the left,`+`

as the operator, and`4`

on the right (obtained from`expr2()`

again).`expr1`

then sets a flag to indicate that it has read more tokens and replaced the original result, which then causes the`do while()`

loop to go through another iteration, which allows`expr1()`

to check for more operators.On the next iteration,

`expr1()`

, the next token is the EOF token, so`expr1()`

will not see`+`

or`-`

, so`replaced`

will remain false, so the loop bails out.`expr1()`

then returns`(2*3) + (4)`

:Hopefully that made sense! Did that answer your question?